linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Optimization for use-once pages
@ 2001-07-24  3:47 Daniel Phillips
  2001-07-24 12:38 ` jlnance
                   ` (3 more replies)
  0 siblings, 4 replies; 58+ messages in thread
From: Daniel Phillips @ 2001-07-24  3:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rik van Riel, Marcelo Tosatti, Ben LaHaise, Mike Galbraith

Today's patch tackles the use-once problem, that is, the problem of
how to identify and discard pages containing data likely to be used 
only once in a long time, while retaining pages that are used more 
often.

I'll try to explain not only what I did, but the process I went
through to arrive at this particular approach.  This requires some 
background.

Recent History of Linux Memory Management
-----------------------------------------

Rik van Riel developed our current memory management regime early last 
fall, more or less on a dare by Linus.[1]  At that time the big 
question was, could the poorly-performing 2.3 memory manager be fixed 
up in some way or should it be rewritten from scratch?  Rik favored a 
complete rewrite in the 2.5 timeframe whereas Linus wanted to see 
incremental improvements to the existing code base.  After a little 
flam^H^H^H^H encouragement from Linus, Rik did manage to produce a 
patch that improved the performance of 2.3 considerably.

Rik's patch was loosely inspired by Matt Dillon's FreeBSD work.[2]   At 
that time, FreeBSD enjoyed a clear lead in memory management 
performance over Linux, so it made sense to use his work as a model.  
What Rik came up with though is really his own creation[3] and differs 
in a number of fundamental respects from Matt's work.  After a few 
months of working out minor problems this is essentially what we have 
now.  It works well under typical loads but has tended to suffer from 
variable or even horrible performance under certain rarer loads and 
machine configurations.  Over time some of the problems have been 
tracked down and fixed, the most recent being Marcelo's identification 
and eradication of the zone-imbalance performance bug.[4]

During this whole period, I've been studying Rik's memory manager with 
a view to understanding how it does what it does, and thus, how to go 
about improving it.  I'll to describe the essential details now, as I 
see them.

What Rik's Memory Manager Does
------------------------------

Every physical page of memory in the system can be on at most one of 
three "page_lru" lists.  They are:

   1) active
   2) inactive_dirty (misnomer - has both clean and dirty pages)
   3) inactive_clean (misnomer - really is 'inactive_bufferless')

List (1) is an unordered ring while lists (3) is a fifo queue.  List 
(2) is somewhere in between, a fifo for clean pages and a ring for 
dirty pages.  The transitions are:

   active -> inactive_dirty, if a page is an eviction candidate
   inactive_dirty -> inactive_clean, if a page is not dirty
   inactive_clean -> active, if a page is referenced
   inactive_dirty -> active, if a page is referenced

The memory manager's job is to scan each of these lists from time to 
time and effect the appropriate transitions.  To decide which pages 
should be considered for eviction, the memory manager walks around the 
active ring "aging" pages, "up" if they have been referenced since the 
last scan and "down" otherwise.  A page whose age has reached zero is 
considered to be a good candidate for eviction (a "victim") and is 
moved to the inactive_dirty queue (note: regardless of whether it is 
actually dirty).  The memory manager also scans the "old" end of the 
inactive_dirty queue.  Any page that has been used since being put on 
the queue is moved to the active ring.  Unreferenced clean pages that 
have their buffers (if any) removed and are moved to the inactive_clean 
queue.  Unreferenced dirty pages are scheduled for writeout and moved 
back to the "young" end of the queue, to be rescanned later.  

All of this scanning is driven by demand for pages.  To anticpate 
future demand, the memory manager tries to keep the inactive_clean list 
longer than a certain minimum, and this in turn determines the rate at 
which the other two lists are scanned.

In this short tour of Linux's memory manager I've glossed over many 
subtle details and complexities.  Sorry.  This is just to provide some 
context for discussing the problem at hand, and my approach to solving 
it.  Now on to the theoretical part.

Why it works
------------

The two fifo queues can be thought of as a single, two-stage queue.  So 
what we have is an unordered ring and a queue.  As described above, the 
ring is used to find eviction candidates via the aging process, which 
gives us information about both how frequently and how recently a given 
page was used, mixed together inseparably.  Ultimately, being recently 
used is more important than being frequently used, so to find out 
whether a page really isn't being used we put it on the inactive queue. 
If it goes all the way from one end of the inactive queue to the other 
without being referenced we then know quite a lot more about it, and 
can evict it with a reasonable expectation that it will not need to be 
re-read immediately.  (Notice how this is all kind of approximate - 
memory management is like that.)

Why it doesn't always work
--------------------------

This is the tough question.  If we knew the whole answer, it would 
already be working perfectly under all conditions, or we would have 
moved on to a different design.  Recently though, some of the more 
obscure problems have begun to yield to analysis.  In particular, 
Marcelo demonstrated how important it is to know what the memory manager 
is actually doing, and developed a statistical reporting interface 
(soon to be included in the standard kernel) which allowed him to 
isolate and fix the obscure zone-balancing problem.  There are also 
more of us coming up to speed on understanding the memory manager, both 
the part inherited from 2.2 and the new part designed by Rik.

Use-once Pages
--------------

One of the things we want the memory manager to do is to assign very 
low priority to data that tends to be used once and then not again for 
a long time.  Pages containing such data should be identified and 
evicted as soon as possible.  Typically, these "use-once" pages are 
generated by file IO, either reads or writes, and reside in the page 
cache.  The question is, how do we distinguish between "use-once" file 
data and file data that we should keep in memory, expecting it to be 
used more than once in a short period?  Up till now we've used a simple 
strategy: always expect that file data is "use-once" and queue it for 
eviction immediately after being used.  If it really is going to be 
used more than once, then hopefully it will be used again before it 
reaches the other end of the inactive queue and thus be "rescued" from 
eviction.

Rik added code to the memory manager to implement this idea, which he 
called "drop-behind", and system performance did improve.  (I verified 
this by disabling the code, and sure enough, it got worse again.)  But 
there are several problems with the drop-behind strategy:

  - Deactivates even popular file pages - every time a popular
    file page is touched by generic_file_read or generic_file_write
    it gets deactivated, losing historical information and putting
    the page at risk of being evicted, should its next access come
    a little too late.

  - Busy work - readahead pages circulate on the active list before
    being used and deactivated, which does nothing other than
    consume CPU cycles.

  - Not general - drop-behind specific heuristic for file I/O and
    shows a bias towards one particular access pattern, perhaps at
    the expense of, say, a database-like access pattern.  If for
    some reason, less than an entire file is read, the readahead
    pages will not be deactivated.

  - Doesn't do anything for memory-mapped files.

LRU/2 and 2Q
------------

Enter the LRU/2 Algorithm[5] which provides a specific prescription 
aimed at optimizing for database loads with a mix of use-once pages and 
use-often pages.  The idea is to choose for eviction the page whose 
second-most-recent access was furthest in the past, and not pay any 
attention at all to the time of the most recent access.  This algorithm 
has been shown to be more efficient than classic LRU for real-world 
loads but is difficult to implement efficiently.  The 2Q algorithm, 
discussed on lkml recently, approximates the behavior of LRU/2, 
essentially by ignoring the first use of a page then managing all pages 
referenced at least twice using a normal LRU.  Both LRU/2 and 2Q show 
significant improvements over LRU, on the order of 5-15%, in testing 
against captured real-world database access patterns.

My Approach
-----------

The 2Q algorithm has a major disadvantage from my point of view: it 
uses a LRU list.[6]  We don't, we use the aging strategy described
above.  There are several reasons to keep doing it this way:

  - It's more efficient to update an age than to relink a LRU item
  - Aging captures both frequency and temporal information
  - Aging offers more scope for fine tuning
  - It's what we're doing now - changing would be a lot of work

So I decided to look for a new way of approaching the use-once problem 
that would easily integrated with our current approach.   What I came 
up with is pretty simple: instead of starting a newly allocated page on 
the active ring, I start it on the inactive queue with an age of zero. 
Then, any time generic_file_read or write references a page, if its 
age is zero, set its age to START_AGE and mark it as unreferenced.

This has the effect of scheduling all file pages read or written the
first time for early eviction, but introduces a delay before the 
eviction actually occurs.  If the file page is referenced relatively 
soon thereafter it will be "rescued" from eviction as described above, 
and thereafter aged normally on the active ring.  Subsequent reads and 
writes of the same cached page age it normally and do not cause it to 
be "unreferenced" again, because its age is not zero.  (Just noticed 
while composing this: the attached patch does not provide properly for 
the case where a page deactivated by aging is subsequently 
read/written, and it should.)

Implementing this idea was not as easy as it sounds.  At first, I hoped 
for a one-line implementation:

-	add_page_to_active_list(page);
+	add_page_to_inactive_dirty_list(page);

but alas this was not to be.  The actual implementation required  three 
days of chasing after odd effects caused by redundant tests in the page 
scanning code, and learning a lot more about the life cycles of page 
cache pages.  During this process I managed to disable aging of 
read/write pages completely, resulting in a "fifo" eviction policy.  
This degraded performance by about 60%.  (Truly remarkable not only 
because page io is far from my test case's only activity, but because
it shows just how well we are already doing versus a truly naive 
eviction policy.)  This was especially discouraging because I had 
imagined I was quite close to actually achieving some optimization.  
However, changing just one line took me from that disaster to a version 
that magically turned in a performance 12% better than I'd seen with
the existing code.

In retrospect, if I'd been smarter about this I would have applied 
Marcelo's statistical patch at the beginning of the process.  That 
would have saved me a lot of time spent hunting around in the dark, 
using only the test case running time for feedback.[7]

Test Load
---------

For a test load, I used the following:

    time -o /tmp/time1 make clean bzImage &
    time -o /tmp/time2 grep -r foo /usr/src >/dev/null 2>/dev/null

In other words, a clean make of the kernel concurrent with a grep of my 
(rather large) source directory.  By themselves, the make requires a 
little more than 12 minutes and the grep about 15 minutes.  The stock 
2.4.5 kernel turned in a performance of slightly better than 20 minutes 
for the two simultaneously: very respectable.  All of my "improved" 
kernels did worse than this, up until the last one, which did 
considerably better.  (I think the following principle applies: you 
always find what you are looking for in the last place you look.)

I chose to combine the make and grep in order to create an I/O-bound 
situation, but not without some complexity and with real work being 
done.  With optimal performance, I imagined the disk light would be on 
all, or nearly all the time.  This turned out to be the case.  
Unfortunately, it also turned out to be the case with my disastrous 
"fifo" kernel.  In the latter case, the light was always on because 
pages frequently needed to be read back in immediately after being 
evicted.

This is probably true though: any time the disk light goes out during 
this test it's trying to tell us we're not finished optimizing yet.

Timings
-------

The grep always takes longer than the make, so the numbers for the grep 
are the total elapsed time for both processes to complete and are thus 
more important.  It's nice to see that the make always runs faster with 
the patch though - this shows that the system is treating the two 
processes fairly and the grep is not starving the make too badly for 
disk bandwidth.  (However, I did note with interest that the 
disk-intensive make step involving 'tmppiggy' takes 8 times longer when 
running in parallel with the grep than by itself - there is some poor 
IO scheduling behavior to investigate there.)

Here is the executive summary:

	  2.4.5, vanilla	2.4.5+use.once.patch

make	    16:29.87		     15:46.67
grep	    20:09.15		     17:37.33

So, with the patch, the two processes run to completion 2 1/2 minutes 
faster, a savings of about 13%.  These results are quite repeatable and 
considerably better than what I had hoped for when I began this work.

Detailed timings are given below.  Interestingly, the timings for the
two processes running separately do not vary nearly as much: the make 
runs at virtually the same speed with or without the patch, and the 
grep runs 54 seconds faster, much less than the 2 1/2 minute difference 
seen when the processes run in parallel.

vanilla 2.2.14

make
475.38user 59.71system 16:27.28elapsed 54%CPU
0inputs+0outputs (461932major+513611minor)pagefaults 268swaps

grep
22.14user 71.81system 20:09.15elapsed 7%CPU
0inputs+0outputs (366064major+7236minor)pagefaults 110swaps

vanilla 2.4.5

make
305.70user 229.08system 16:29.87elapsed 54%CPU
0inputs+0outputs (460034major+533055minor)pagefaults 0swaps

grep
13.48user 85.18system 19:49.90elapsed 8%CPU
0inputs+0outputs (1839major+20261minor)pagefaults 0swaps

vanilla 2.4.5 (again)

make
306.05user 229.74system 16:34.83elapsed 53%CPU
0inputs+0outputs (457841major+531251minor)pagefaults 0swaps

grep
13.64user 84.12system 19:52.66elapsed 8%CPU
0inputs+0outputs (2104major+16537minor)pagefaults 0swaps

use.once.patch

make
306.34user 253.36system 15:46.67elapsed 59%CPU
0inputs+0outputs (451569major+522397minor)pagefaults 0swaps

grep
13.20user 81.60system 17:37.33elapsed 8%CPU
0inputs+0outputs (369major+11632minor)pagefaults 0swaps

use.once.patch (again)

make
306.30user 244.88system 15:31.69elapsed 59%CPU
0inputs+0outputs (451352major+522484minor)pagefaults 0swaps

grep
13.50user 82.88system 17:36.84elapsed 9%CPU
0inputs+0outputs (379major+14643minor)pagefaults 0swaps

vanilla 2.4.5, just the make
638.42user 27.96system 12:03.33elapsed 92%CPU
0inputs+0outputs (450134major+520691minor)pagefaults 0swaps

use.once.patch, just the make
632.39user 28.37system 11:58.59elapsed 91%CPU
0inputs+0outputs (450135major+520725minor)pagefaults 0swaps

vanilla 2.4.5, just the grep
13.08user 62.54system 13:43.16elapsed 9%CPU
0inputs+0outputs (1316major+21906minor)pagefaults 0swaps

use.once.patch, just the grep
13.00user 61.05system 12:49.37elapsed 9%CPU
0inputs+0outputs (259major+7586minor)pagefaults 0swaps

[1] http://mail.nl.linux.org/linux-mm/2000-08/msg00023.html
   (Linus goads Rik on to great things)

[2] http://mail.nl.linux.org/linux-mm/2000-05/msg00419.html
   (Rik's discussion with Matt Dillon)

[3] http://mail.nl.linux.org/linux-mm/2000-05/msg00418.html
   (Rik's RFC for his 2.3/4 memory manager design)

[4] http://marc.theaimsgroup.com/?l=linux-kernel&m=99542192302579&w=2
   (Marcelo's fix to the zone balancing problem)

[5] http://www.informatik.uni-trier.de/~ley/db/conf/vldb/vldb94-439.html
   (Original paper on the 2Q algorithm)

[6] A further disadvantage of the 2Q Algorithm is that it is oriented
towards database files with a relatively limited number of blocks in
them, as opposed to files of effectively unlimited size as are common
in a general-purpose operating system.

[7] While doing this work I like to imagine how it must feel to work 
on a tokomak prototype fusion reactor or z-pinch machine.  In these 
projects, scientists and engineers try to put theory into practice by 
squeezing ever more power out of the same, weird and wonderful piece of 
hardware.  Sometimes new ideas work as expected, and sometimes the 
plasma just goes unstable.

The Patch
---------

This is an experimental patch, for comment.  There is a small amount
of raciness in the check_used_once function, on some architectures,
but this would at worst lead to suboptimal performance.  The patch has
been completely stable for me, and it may just be my imagination, but
interactive performance seems to be somewhat improved.

To apply:

    cd /usr/src/your-2.4.5-tree
    patch <this.patch -p0

--- ../2.4.5.clean/mm/filemap.c	Thu May 31 17:30:45 2001
+++ ./mm/filemap.c	Mon Jul 23 17:42:40 2001
@@ -770,51 +770,6 @@
 #endif
 
 /*
- * We combine this with read-ahead to deactivate pages when we
- * think there's sequential IO going on. Note that this is
- * harmless since we don't actually evict the pages from memory
- * but just move them to the inactive list.
- *
- * TODO:
- * - make the readahead code smarter
- * - move readahead to the VMA level so we can do the same
- *   trick with mmap()
- *
- * Rik van Riel, 2000
- */
-static void drop_behind(struct file * file, unsigned long index)
-{
-	struct inode *inode = file->f_dentry->d_inode;
-	struct address_space *mapping = inode->i_mapping;
-	struct page *page;
-	unsigned long start;
-
-	/* Nothing to drop-behind if we're on the first page. */
-	if (!index)
-		return;
-
-	if (index > file->f_rawin)
-		start = index - file->f_rawin;
-	else
-		start = 0;
-
-	/*
-	 * Go backwards from index-1 and drop all pages in the
-	 * readahead window. Since the readahead window may have
-	 * been increased since the last time we were called, we
-	 * stop when the page isn't there.
-	 */
-	spin_lock(&pagecache_lock);
-	while (--index >= start) {
-		page = __find_page_simple(mapping, index);
-		if (!page)
-			break;
-		deactivate_page(page);
-	}
-	spin_unlock(&pagecache_lock);
-}
-
-/*
  * Read-ahead profiling information
  * --------------------------------
  * Every PROFILE_MAXREADCOUNT, the following information is written 
@@ -1033,12 +988,6 @@
 		if (filp->f_ramax > max_readahead)
 			filp->f_ramax = max_readahead;
 
-		/*
-		 * Move the pages that have already been passed
-		 * to the inactive list.
-		 */
-		drop_behind(filp, index);
-
 #ifdef PROFILE_READAHEAD
 		profile_readahead((reada_ok == 2), filp);
 #endif
@@ -1048,6 +997,15 @@
 }
 
 
+inline void check_used_once (struct page *page)
+{
+	if (!page->age)
+	{
+		page->age = PAGE_AGE_START;
+		ClearPageReferenced(page);
+	}
+}
+
 /*
  * This is a generic file read routine, and uses the
  * inode->i_op->readpage() function for the actual low-level
@@ -1163,7 +1121,8 @@
 		offset += ret;
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
-	
+
+		check_used_once (page);
 		page_cache_release(page);
 		if (ret == nr && desc->count)
 			continue;
@@ -2597,7 +2556,6 @@
 	while (count) {
 		unsigned long index, offset;
 		char *kaddr;
-		int deactivate = 1;
 
 		/*
 		 * Try to find the page in the cache. If it isn't there,
@@ -2606,10 +2564,8 @@
 		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
-		if (bytes > count) {
+		if (bytes > count)
 			bytes = count;
-			deactivate = 0;
-		}
 
 		/*
 		 * Bring in the user page that we will copy from _first_.
@@ -2653,8 +2609,7 @@
 unlock:
 		/* Mark it unlocked again and drop the page.. */
 		UnlockPage(page);
-		if (deactivate)
-			deactivate_page(page);
+		check_used_once(page);
 		page_cache_release(page);
 
 		if (status < 0)
--- ../2.4.5.clean/mm/swap.c	Mon Jan 22 22:30:21 2001
+++ ./mm/swap.c	Mon Jul 23 17:38:25 2001
@@ -231,11 +231,8 @@
 	spin_lock(&pagemap_lru_lock);
 	if (!PageLocked(page))
 		BUG();
-	DEBUG_ADD_PAGE
-	add_page_to_active_list(page);
-	/* This should be relatively rare */
-	if (!page->age)
-		deactivate_page_nolock(page);
+	add_page_to_inactive_dirty_list(page);
+	page->age = 0;
 	spin_unlock(&pagemap_lru_lock);
 }
 
--- ../2.4.5.clean/mm/vmscan.c	Sat May 26 02:00:18 2001
+++ ./mm/vmscan.c	Mon Jul 23 17:25:27 2001
@@ -359,10 +359,10 @@
 		}
 
 		/* Page is or was in use?  Move it to the active list. */
-		if (PageReferenced(page) || page->age > 0 ||
-				(!page->buffers && page_count(page) > 1)) {
+		if (PageReferenced(page) || (!page->buffers && page_count(page) > 1)) {
 			del_page_from_inactive_clean_list(page);
 			add_page_to_active_list(page);
+			page->age = PAGE_AGE_START;
 			continue;
 		}
 
@@ -462,11 +462,11 @@
 		}
 
 		/* Page is or was in use?  Move it to the active list. */
-		if (PageReferenced(page) || page->age > 0 ||
-				(!page->buffers && page_count(page) > 1) ||
+		if (PageReferenced(page) || (!page->buffers && page_count(page) > 1) ||
 				page_ramdisk(page)) {
 			del_page_from_inactive_dirty_list(page);
 			add_page_to_active_list(page);
+			page->age = PAGE_AGE_START;
 			continue;
 		}
 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24  3:47 [RFC] Optimization for use-once pages Daniel Phillips
@ 2001-07-24 12:38 ` jlnance
  2001-07-24 16:56   ` Rik van Riel
  2001-07-24 16:48 ` Linus Torvalds
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 58+ messages in thread
From: jlnance @ 2001-07-24 12:38 UTC (permalink / raw)
  To: linux-kernel

On Tue, Jul 24, 2001 at 05:47:30AM +0200, Daniel Phillips wrote:

> So I decided to look for a new way of approaching the use-once problem 
> that would easily integrated with our current approach.   What I came 
> up with is pretty simple: instead of starting a newly allocated page on 
> the active ring, I start it on the inactive queue with an age of zero. 
> Then, any time generic_file_read or write references a page, if its 
> age is zero, set its age to START_AGE and mark it as unreferenced.

This is wonderfully simple and ellegant.

Jim

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24  3:47 [RFC] Optimization for use-once pages Daniel Phillips
  2001-07-24 12:38 ` jlnance
@ 2001-07-24 16:48 ` Linus Torvalds
  2001-07-24 17:04   ` Rik van Riel
  2001-07-24 20:24   ` [RFC] Optimization for use-once pages Patrick Dreker
  2001-07-24 20:27 ` Marcelo Tosatti
  2001-07-25  2:31 ` Marcelo Tosatti
  3 siblings, 2 replies; 58+ messages in thread
From: Linus Torvalds @ 2001-07-24 16:48 UTC (permalink / raw)
  To: phillips, linux-kernel

Hey, this looks _really_ nice. I never liked the special-cases that you
removed (drop_behind in particular), and I have to say that the new code
looks a lot saner, even without your extensive description and timing
analysis.

Please people, test this out extensively - I'd love to integrate it, but
while it looks very sane I'd really like to hear of different peoples
reactions to it under different loads.

		Linus


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 12:38 ` jlnance
@ 2001-07-24 16:56   ` Rik van Riel
  2001-07-25  0:04     ` Daniel Phillips
  2001-07-25  0:43     ` Anton Altaparmakov
  0 siblings, 2 replies; 58+ messages in thread
From: Rik van Riel @ 2001-07-24 16:56 UTC (permalink / raw)
  To: jlnance; +Cc: linux-kernel

On Tue, 24 Jul 2001 jlnance@intrex.net wrote:
> On Tue, Jul 24, 2001 at 05:47:30AM +0200, Daniel Phillips wrote:
>
> > So I decided to look for a new way of approaching the use-once problem
> > that would easily integrated with our current approach.   What I came
> > up with is pretty simple: instead of starting a newly allocated page on
> > the active ring, I start it on the inactive queue with an age of zero.
> > Then, any time generic_file_read or write references a page, if its
> > age is zero, set its age to START_AGE and mark it as unreferenced.
>
> This is wonderfully simple and ellegant.

It's a tad incorrect, though.

If a page gets 2 1-byte reads in a microsecond, with this
patch it would get promoted to the active list, even though
it's really only used once.

Preferably we'd want to have a "new" list of, say, 5% maximum
of RAM size, where all references to a page are ignored. Any
references made after the page falls off that list would be
counted for page replacement purposes.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 16:48 ` Linus Torvalds
@ 2001-07-24 17:04   ` Rik van Riel
  2001-07-24 18:14     ` Daniel Phillips
  2001-07-24 20:24   ` [RFC] Optimization for use-once pages Patrick Dreker
  1 sibling, 1 reply; 58+ messages in thread
From: Rik van Riel @ 2001-07-24 17:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: phillips, linux-kernel

On Tue, 24 Jul 2001, Linus Torvalds wrote:

> Hey, this looks _really_ nice. I never liked the special-cases
> that you removed (drop_behind in particular), and I have to say
> that the new code looks a lot saner, even without your extensive
> description and timing analysis.

Fully agreed, drop_behind is an ugly hack.  The sooner
it dies the happier I am ;)

> Please people, test this out extensively - I'd love to integrate
> it, but while it looks very sane I'd really like to hear of
> different peoples reactions to it under different loads.

The one thing which has always come up in LRU/k and 2Q
papers is that the "first reference" can really be a
series of references in a very short time.

Counting only the very first reference will fail if we
do eg. sequential IO with non-page aligned read() calls,
which doesn't look like it's too uncommon.

In order to prevent this from happening, either the system
counts all first references in a short timespan (complex to
implement) or it has the new pages on a special - small fixed
size - page list and all references to the page while on that
list are ignored.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 17:04   ` Rik van Riel
@ 2001-07-24 18:14     ` Daniel Phillips
  2001-07-24 18:15       ` Rik van Riel
  0 siblings, 1 reply; 58+ messages in thread
From: Daniel Phillips @ 2001-07-24 18:14 UTC (permalink / raw)
  To: Rik van Riel, Linus Torvalds; +Cc: phillips, linux-kernel

On Tuesday 24 July 2001 19:04, Rik van Riel wrote:
> On Tue, 24 Jul 2001, Linus Torvalds wrote:
> > Hey, this looks _really_ nice. I never liked the special-cases
> > that you removed (drop_behind in particular), and I have to say
> > that the new code looks a lot saner, even without your extensive
> > description and timing analysis.
>
> Fully agreed, drop_behind is an ugly hack.  The sooner
> it dies the happier I am ;)
>
> > Please people, test this out extensively - I'd love to integrate
> > it, but while it looks very sane I'd really like to hear of
> > different peoples reactions to it under different loads.
>
> The one thing which has always come up in LRU/k and 2Q
> papers is that the "first reference" can really be a
> series of references in a very short time.

Yes, I thought about that but decided to try to demonstrate the
concept in its simplest form, and if things worked out, go ahead
and try to refine it.

Memory-mapped files have to be handled too.  One possible way to
go at it is to do the test not against the current page being
handled by generic_* but against the page already on the head of
the inactive_dirty list, at the time the *next* page is queued.
This introduces a slight delay, time enough for several programmed
IO operations to complete.  It will also work for mmap.  As a
bonus, the code might even get cleaner because all the use_once
tests are gathered together into a single place.

> Counting only the very first reference will fail if we
> do eg. sequential IO with non-page aligned read() calls,
> which doesn't look like it's too uncommon.

Yep.  We should also look at some statistics.  So far I've just
used a single number: execution time.

> In order to prevent this from happening, either the system
> counts all first references in a short timespan (complex to
> implement) or it has the new pages on a special - small fixed
> size - page list and all references to the page while on that
> list are ignored.

Yes, those are both possibilities.

--
Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 18:14     ` Daniel Phillips
@ 2001-07-24 18:15       ` Rik van Riel
  2001-07-24 19:41         ` Is /dev/epoll scheduled for official inclusion anytime soon? David E. Weekly
  0 siblings, 1 reply; 58+ messages in thread
From: Rik van Riel @ 2001-07-24 18:15 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linus Torvalds, linux-kernel

On Tue, 24 Jul 2001, Daniel Phillips wrote:
> On Tuesday 24 July 2001 19:04, Rik van Riel wrote:

> Memory-mapped files have to be handled too.

> > In order to prevent this from happening, either the system
> > counts all first references in a short timespan (complex to
> > implement) or it has the new pages on a special - small fixed
> > size - page list and all references to the page while on that
> > list are ignored.
>
> Yes, those are both possibilities.

The special "new pages" list handles the mmap() situation
automatically.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 23:09       ` Daniel Phillips
@ 2001-07-24 19:35         ` Rob Landley
  2001-07-25  6:10           ` Marcelo Tosatti
                             ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Rob Landley @ 2001-07-24 19:35 UTC (permalink / raw)
  To: Daniel Phillips, Marcelo Tosatti, Rik van Riel
  Cc: linux-kernel, Ben LaHaise, Mike Galbraith

On Tuesday 24 July 2001 19:09, Daniel Phillips wrote:
> On Tuesday 24 July 2001 22:53, Marcelo Tosatti wrote:
> > On Tue, 24 Jul 2001, Rik van Riel wrote:
> > > On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > > > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > > > > Today's patch tackles the use-once problem, that is, the
> > > > > problem of
> > > >
> > > > Well, as I see the patch should remove the problem where
> > > > drop_behind() deactivates pages of a readahead window even if
> > > > some of those pages are not "used-once" pages, right ?
> > > >
> > > > I just want to make sure the performance improvements you're
> > > > seeing caused by the fix of this _particular_ problem.
> > >
> > > Fully agreed.
> > >
> > > Especially since it was a one-liner change from worse
> > > performance to better performance (IIRC) it would be
> > > nice to see exactly WHY the system behaves the way it
> > > does.  ;)
> >
> > Yes.
> >
> > Daniel's patch adds "drop behind" (that is, adding swapcache
> > pages to the inactive dirty) behaviour to swapcache pages.
> >
> > This is a _new_ thing, and I would like to know how that is changing
> > the whole VM behaviour..
>
> Yes, absolutely.  I knew I was doing that but I also thought it wouldn't
> hurt.  Rather it's part of a transition towards a full unification of
> the file and swap paths.

Stupid question time:

So basically (lemme get this straight):

All VM allocation in 2.4 has become some variant of mmap.  Either we're 
mmaping in the executable and libraries when we exec, we're mmaping in a 
file, or we're mmaping in the swap file/block device which is how we do 
anonymous page allocations.

And this is why 2.4 keeps wanting to allocate swap space for pages that are 
still in core.  And why the 2.4 VM keeps going ballistic on boxes that have 
more physical DRAM than they have swap space.  (I.E. the 2X swap actually 
being NECESSARY now, and 256 megs of "overflow" swap files for a 2 gig ram 
box actually hurting matters by confusing the VM if swap is enabled AT ALL, 
since it clashes conceptually.)

So getting swap to exclude in-core pages (so they don't take up space in the 
swap file) would therefore be conceptually similar to implementing "files 
with holes in them" support like EXT2 has for these swap file mmaps.

And the argument that doing so might "fragment the swap file", while true, is 
actually a bit of a red herriing because the real objection is that it 
unnecessarily complicates an otherwise clean and straightforward concept.

Am I even close here?  (That was the stupid question I mentioned earlier, 
knew I'd get around to it...)  Reading the code's a bit harder when you don't 
have any sort of conceptual understanding of what the heck it's actually 
trying to do.

> Basically, I just left that part of it hanging.  If you check my
> detailed timings you'll see all my test runs have swaps=0, basically
> because I didn't really want to hear about it just then ;-)
>
> I was pretty sure it could be fixed if it broke.

Just thought I'd say that personally, I think your greatest contribution in 
this whole thread is that somebody finally clearly explained how the new VM 
works, using small words.  The actual improvements are just gravy.  Yes I'm 
biased.  :)

I don't suppose we could get some variant of your initial post into 
/Documentation/vm/HowItActuallyWorks.txt?  (I take it the biggest "detail" 
you glossed over was the seperation of memory into zones?)

> --
> Daniel

Rob

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Is /dev/epoll scheduled for official inclusion anytime soon?
  2001-07-24 18:15       ` Rik van Riel
@ 2001-07-24 19:41         ` David E. Weekly
  2001-07-24 20:05           ` Rik van Riel
  2001-07-24 20:26           ` linux partitioning in IA64 hiufungeric.tse
  0 siblings, 2 replies; 58+ messages in thread
From: David E. Weekly @ 2001-07-24 19:41 UTC (permalink / raw)
  To: linux-kernel

Hello all.

I've been playing around with Davide Libenzi's "/dev/epoll" patch and have
been very impressed by the performance figures (see his paper at
http://www.xmailserver.org/linux-patches/nio-improve.html for some
numbers) - it seems to exhibit really excellent scaling.

Given the sheer utility of using /dev/epoll in a largescale server, are
there any plans to roll it into the mainline kernel at any point? If not,
why / is there an equivalent scheduled for inclusion?

Clearly, IMHO, Linux needs something that can scale as well as BSD's kqueue
and Solaris's /dev/poll. /dev/epoll seems to be an excellent answer.


Sincerely,
 David E. Weekly

PS: Cheers to Anton Altaparmikov's work on NTFS: In 2.4.7, we now have
rather robust read/write support of NT filesystems!



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Is /dev/epoll scheduled for official inclusion anytime soon?
  2001-07-24 19:41         ` Is /dev/epoll scheduled for official inclusion anytime soon? David E. Weekly
@ 2001-07-24 20:05           ` Rik van Riel
  2001-07-24 20:26           ` linux partitioning in IA64 hiufungeric.tse
  1 sibling, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2001-07-24 20:05 UTC (permalink / raw)
  To: David E. Weekly; +Cc: linux-kernel

On Tue, 24 Jul 2001, David E. Weekly wrote:

> I've been playing around with Davide Libenzi's "/dev/epoll"

> Given the sheer utility of using /dev/epoll in a largescale
> server, are there any plans to roll it into the mainline kernel
> at any point?

So why don't you ask Davide if he has any plans to submit
it for inclusion into the kernel? ;)

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 16:48 ` Linus Torvalds
  2001-07-24 17:04   ` Rik van Riel
@ 2001-07-24 20:24   ` Patrick Dreker
  2001-07-24 20:32     ` Rik van Riel
                       ` (2 more replies)
  1 sibling, 3 replies; 58+ messages in thread
From: Patrick Dreker @ 2001-07-24 20:24 UTC (permalink / raw)
  To: Linus Torvalds, phillips, linux-kernel

Hello...

Am Dienstag, 24. Juli 2001 18:48 schrieb Linus Torvalds:
> Please people, test this out extensively - I'd love to integrate it, but
> while it looks very sane I'd really like to hear of different peoples
> reactions to it under different loads.
I just decided to give this patch a try, as I have written a little 
application which does some statistics over traces dumped by another program 
by mmap()ing a large file and reading it sequentially. The trace file to be 
analyzed is about 240 megs in size and consists of records each 249 bytes 
long. The analysis program opens and the mmap()s the trace file doing some 
small calculations on the data (basically it adds up fields from the records 
to get overall values).

I have tested this on my Athlon 600 with 128 Megs of RAM, and it does not 
make any difference whether I use plain 2.4.7 or 2.4.5-use-once. The program 
always takes about 1 minute 6 seconds (+- 2 seconds) to complete, and the 
machine starts swapping out stuff (thus I have omitted further stats like 
vmstat output). I have just taken another look into my program to verify it 
does not do something silly, like keeping old data around, but the program 
cycle is always the same: copy a record from the mmap into a struct, perform 
analysis, and copy next record. The struct is always reused for the next 
struct (so there is only one struct at any time).

I can do further tests, if someone asks me to. I could even modify the 
analysis program to check changes in behaviour...

> 		Linus

Patrick
-- 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* linux partitioning in IA64
  2001-07-24 19:41         ` Is /dev/epoll scheduled for official inclusion anytime soon? David E. Weekly
  2001-07-24 20:05           ` Rik van Riel
@ 2001-07-24 20:26           ` hiufungeric.tse
  2001-07-25  9:29             ` Mike A. Harris
  1 sibling, 1 reply; 58+ messages in thread
From: hiufungeric.tse @ 2001-07-24 20:26 UTC (permalink / raw)
  To: linux-kernel

Hei,

I am trying to install seawolf(linux IA64) on an IA 64 machine. For 
partitioning I went through several trails without success. (total # disk space 
is 10G). After all the initial settings, I restarted the machine to EFI (boot 
manager). Can't boot to linux and the EFI can't recognize files the 
installation CD initially saved on the hard disk. Are there special ways or 
tricks to partition the disk and to make it work?

thanks in advance
hiu fung

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24  3:47 [RFC] Optimization for use-once pages Daniel Phillips
  2001-07-24 12:38 ` jlnance
  2001-07-24 16:48 ` Linus Torvalds
@ 2001-07-24 20:27 ` Marcelo Tosatti
  2001-07-24 22:05   ` Rik van Riel
  2001-07-24 22:22   ` Daniel Phillips
  2001-07-25  2:31 ` Marcelo Tosatti
  3 siblings, 2 replies; 58+ messages in thread
From: Marcelo Tosatti @ 2001-07-24 20:27 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Rik van Riel, Ben LaHaise, Mike Galbraith



On Tue, 24 Jul 2001, Daniel Phillips wrote:

> Today's patch tackles the use-once problem, that is, the problem of
> how to identify and discard pages containing data likely to be used 
> only once in a long time, while retaining pages that are used more 
> often.
> 
> I'll try to explain not only what I did, but the process I went
> through to arrive at this particular approach.  This requires some 
> background.

Well, as I see the patch should remove the problem where drop_behind()
deactivates pages of a readahead window even if some of those pages are
not "used-once" pages, right ? 

I just want to make sure the performance improvements you're seeing caused
by the fix of this _particular_ problem.

If we knew the amount of non-used-once pages which drop_behind() is
deactivating under _your_ tests, we could make absolute sure about the
problem. 




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 20:24   ` [RFC] Optimization for use-once pages Patrick Dreker
@ 2001-07-24 20:32     ` Rik van Riel
  2001-07-24 22:16       ` Patrick Dreker
  2001-07-24 20:33     ` Linus Torvalds
  2001-07-25  0:18     ` Daniel Phillips
  2 siblings, 1 reply; 58+ messages in thread
From: Rik van Riel @ 2001-07-24 20:32 UTC (permalink / raw)
  To: Patrick Dreker; +Cc: Linus Torvalds, phillips, linux-kernel

On Tue, 24 Jul 2001, Patrick Dreker wrote:

	[snip program with mmap()]

> I have tested this on my Athlon 600 with 128 Megs of RAM, and it
> does not make any difference whether I use plain 2.4.7 or
> 2.4.5-use-once.

As expected. Only programs using generic_file_{read,write}()
will be impacted at the moment.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 20:24   ` [RFC] Optimization for use-once pages Patrick Dreker
  2001-07-24 20:32     ` Rik van Riel
@ 2001-07-24 20:33     ` Linus Torvalds
  2001-07-25  1:25       ` Daniel Phillips
  2001-07-25  0:18     ` Daniel Phillips
  2 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2001-07-24 20:33 UTC (permalink / raw)
  To: Patrick Dreker; +Cc: phillips, linux-kernel


On Tue, 24 Jul 2001, Patrick Dreker wrote:
>
> I just decided to give this patch a try, as I have written a little
> application which does some statistics over traces dumped by another program
> by mmap()ing a large file and reading it sequentially. The trace file to be
> analyzed is about 240 megs in size and consists of records each 249 bytes
> long. The analysis program opens and the mmap()s the trace file doing some
> small calculations on the data (basically it adds up fields from the records
> to get overall values).

Note that the patch won't do much anything for the mmap() case - the VM
doesn't consider that "use once" anyway, and trying to make it do so would
be hard, I suspect. It's just very nasty to try to figure out whether the
user has touched the page a single time or millions of times...

We do have the "accessed" bit, but in order to get any access patterns
we'd have to scan the page tables a _lot_ more than we do now. Right now
we do it only under memory pressure, and only about once per memory cycle,
which is not really even remotely enough to get anything more than "yes,
the user did touch the page"..

In order for mmap() to make a difference with the new code, we could do
something like look at the adjacent VM pages on page-in. That, together
with VM hints, might well be able to do similar things for mmap. But at a
noticeably higher cost than what the current code has.

		Linus


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 22:05   ` Rik van Riel
@ 2001-07-24 20:53     ` Marcelo Tosatti
  2001-07-24 22:27       ` Rik van Riel
                         ` (2 more replies)
  2001-07-24 22:41     ` Daniel Phillips
  1 sibling, 3 replies; 58+ messages in thread
From: Marcelo Tosatti @ 2001-07-24 20:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, linux-kernel, Ben LaHaise, Mike Galbraith



On Tue, 24 Jul 2001, Rik van Riel wrote:

> On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> >
> > > Today's patch tackles the use-once problem, that is, the problem of
> >
> > Well, as I see the patch should remove the problem where
> > drop_behind() deactivates pages of a readahead window even if
> > some of those pages are not "used-once" pages, right ?
> >
> > I just want to make sure the performance improvements you're
> > seeing caused by the fix of this _particular_ problem.
> 
> Fully agreed.
> 
> Especially since it was a one-liner change from worse
> performance to better performance (IIRC) it would be
> nice to see exactly WHY the system behaves the way it
> does.  ;)

Yes. 

Daniel's patch adds "drop behind" (that is, adding swapcache
pages to the inactive dirty) behaviour to swapcache pages. 

This is a _new_ thing, and I would like to know how that is changing the
whole VM behaviour..


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 20:27 ` Marcelo Tosatti
@ 2001-07-24 22:05   ` Rik van Riel
  2001-07-24 20:53     ` Marcelo Tosatti
  2001-07-24 22:41     ` Daniel Phillips
  2001-07-24 22:22   ` Daniel Phillips
  1 sibling, 2 replies; 58+ messages in thread
From: Rik van Riel @ 2001-07-24 22:05 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Daniel Phillips, linux-kernel, Ben LaHaise, Mike Galbraith

On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> On Tue, 24 Jul 2001, Daniel Phillips wrote:
>
> > Today's patch tackles the use-once problem, that is, the problem of
>
> Well, as I see the patch should remove the problem where
> drop_behind() deactivates pages of a readahead window even if
> some of those pages are not "used-once" pages, right ?
>
> I just want to make sure the performance improvements you're
> seeing caused by the fix of this _particular_ problem.

Fully agreed.

Especially since it was a one-liner change from worse
performance to better performance (IIRC) it would be
nice to see exactly WHY the system behaves the way it
does.  ;)

Reading a bunch of 2Q, LRU/k, ... papers and thinking
about the problem very carefully should help us a bit
in this.  Lots of researches have already looked into
this particular problem in quite a lot of detail.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 20:32     ` Rik van Riel
@ 2001-07-24 22:16       ` Patrick Dreker
  2001-07-24 22:25         ` Rik van Riel
  0 siblings, 1 reply; 58+ messages in thread
From: Patrick Dreker @ 2001-07-24 22:16 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, phillips, linux-kernel

Am Dienstag, 24. Juli 2001 22:32 schrieb Rik van Riel:
> On Tue, 24 Jul 2001, Patrick Dreker wrote:
>
> 	[snip program with mmap()]
>
> > I have tested this on my Athlon 600 with 128 Megs of RAM, and it
> > does not make any difference whether I use plain 2.4.7 or
> > 2.4.5-use-once.
>
> As expected. Only programs using generic_file_{read,write}()
> will be impacted at the moment.
D'oh... got some thing wrong there it seems :-(
Anyways, I still have some old routines in the program doing the same via 
read(). I have tried that (although the program is pretty silly, as it reads 
the 240megs in 4 byte chunks.... the call overhead probably is the bigger 
problem there...) and it improved the runtime by aproximately 20% using the 
use_once patch.

linux-2.4.7:
22.300u 135.310s 2:41.38 97.6%  0+0k 0+0io 110pf+0w

linux-2.4.5-use_once:
14.980u 108.870s 2:09.79 95.4%  0+0k 0+0io 200pf+0w

Both measurements taken after reboot and while running KDE2.2
As stated: I am still willing to do further experiments... (read()ing larger 
chunks at once?)

-- 
Patrick Dreker

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 20:27 ` Marcelo Tosatti
  2001-07-24 22:05   ` Rik van Riel
@ 2001-07-24 22:22   ` Daniel Phillips
  1 sibling, 0 replies; 58+ messages in thread
From: Daniel Phillips @ 2001-07-24 22:22 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, Rik van Riel, Ben LaHaise, Mike Galbraith

On Tuesday 24 July 2001 22:27, Marcelo Tosatti wrote:
> On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > Today's patch tackles the use-once problem, that is, the problem of
> > how to identify and discard pages containing data likely to be used
> > only once in a long time, while retaining pages that are used more
> > often.
> >
> > I'll try to explain not only what I did, but the process I went
> > through to arrive at this particular approach.  This requires some
> > background.
>
> Well, as I see the patch should remove the problem where
> drop_behind() deactivates pages of a readahead window even if some of
> those pages are not "used-once" pages, right ?

Yes, that was a specific goal.

> I just want to make sure the performance improvements you're seeing
> caused by the fix of this _particular_ problem.

Of course I chose the test load to show the benefit clearly, but I
also tried to avoid disturbing any of the existing cases that are
working well.  I did the verification and audit in a hurry, so I
wouldn't be surprised at all if I missed something, but I feel
equally that picking these up is more a matter of adjustment than
change in viewpoint.

> If we knew the amount of non-used-once pages which drop_behind() is
> deactivating under _your_ tests, we could make absolute sure about
> the problem.

Yes, for sure, it's time to use the statistics weapon.  I guess we
need to add in a couple more sensor points.  An obvious place is in
check_used_once.

--
Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 22:16       ` Patrick Dreker
@ 2001-07-24 22:25         ` Rik van Riel
  2001-07-25  0:27           ` Linus Torvalds
  0 siblings, 1 reply; 58+ messages in thread
From: Rik van Riel @ 2001-07-24 22:25 UTC (permalink / raw)
  To: Patrick Dreker; +Cc: Linus Torvalds, phillips, linux-kernel

On Wed, 25 Jul 2001, Patrick Dreker wrote:

> Anyways, I still have some old routines in the program doing the
> same via read(). I have tried that (although the program is
> pretty silly, as it reads the 240megs in 4 byte chunks.... the
> call overhead probably is the bigger problem there...) and it
> improved the runtime by aproximately 20% using the use_once
> patch.
>
> linux-2.4.7:
> 22.300u 135.310s 2:41.38 97.6%  0+0k 0+0io 110pf+0w
>
> linux-2.4.5-use_once:
> 14.980u 108.870s 2:09.79 95.4%  0+0k 0+0io 200pf+0w

COOL ....

> Both measurements taken after reboot and while running KDE2.2 As
> stated: I am still willing to do further experiments...
> (read()ing larger chunks at once?)

Could you try reading 4kB chunks at once ?

That way you should truly only touch each page once.

(using smaller chunks, or chunks which aren't a
multiple of 4kB should break the current code)

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 20:53     ` Marcelo Tosatti
@ 2001-07-24 22:27       ` Rik van Riel
  2001-07-24 23:09       ` Daniel Phillips
  2001-07-25  5:12       ` Andrew Morton
  2 siblings, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2001-07-24 22:27 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Daniel Phillips, linux-kernel, Ben LaHaise, Mike Galbraith

On Tue, 24 Jul 2001, Marcelo Tosatti wrote:

> Daniel's patch adds "drop behind" (that is, adding swapcache
> pages to the inactive dirty) behaviour to swapcache pages.
>
> This is a _new_ thing, and I would like to know how that is
> changing the whole VM behaviour..

It means that swap cache pages get about 1 second
to be used (in theory, under load) and after that
they can get evicted.

Should show up nicely with your VM stats patch...

Lets try to get the VM statistics patch merged,
after that we can talk about this stuff without
having to use our phantasy as much as we use it
now ;))

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 22:05   ` Rik van Riel
  2001-07-24 20:53     ` Marcelo Tosatti
@ 2001-07-24 22:41     ` Daniel Phillips
  1 sibling, 0 replies; 58+ messages in thread
From: Daniel Phillips @ 2001-07-24 22:41 UTC (permalink / raw)
  To: Rik van Riel, Marcelo Tosatti; +Cc: linux-kernel, Ben LaHaise, Mike Galbraith

On Wednesday 25 July 2001 00:05, Rik van Riel wrote:
> On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > > Today's patch tackles the use-once problem, that is, the problem
> > > of
> >
> > Well, as I see the patch should remove the problem where
> > drop_behind() deactivates pages of a readahead window even if
> > some of those pages are not "used-once" pages, right ?
> >
> > I just want to make sure the performance improvements you're
> > seeing caused by the fix of this _particular_ problem.
>
> Fully agreed.
>
> Especially since it was a one-liner change from worse
> performance to better performance (IIRC) it would be
> nice to see exactly WHY the system behaves the way it
> does.  ;)

Oh, it wasn't an accident, I knew what I was trying to achieve.
It's just that I didn't immediately understand all the curlicues in
the page life cycles just from staring at the code.  I had to see
the machine moving first.  It was very instructive to see what
happened on reverting to a fifo strategy.  It might even be a good
idea to put a proc hook in there to allow on-the-fly dumbing down
of the lru machinery.  That way we can measure system behaviour
against a baseline without having to go through the
compile-reboot-bench cycle every time, which eats a major amount
of time.

> Reading a bunch of 2Q, LRU/k, ... papers and thinking
> about the problem very carefully should help us a bit
> in this.  Lots of researches have already looked into
> this particular problem in quite a lot of detail.

Yep, unfortunately the subject of memory management in operating
systems seems to have received a lot less attention than memory
management in databases.

--
Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 20:53     ` Marcelo Tosatti
  2001-07-24 22:27       ` Rik van Riel
@ 2001-07-24 23:09       ` Daniel Phillips
  2001-07-24 19:35         ` Rob Landley
  2001-07-25  5:12       ` Andrew Morton
  2 siblings, 1 reply; 58+ messages in thread
From: Daniel Phillips @ 2001-07-24 23:09 UTC (permalink / raw)
  To: Marcelo Tosatti, Rik van Riel; +Cc: linux-kernel, Ben LaHaise, Mike Galbraith

On Tuesday 24 July 2001 22:53, Marcelo Tosatti wrote:
> On Tue, 24 Jul 2001, Rik van Riel wrote:
> > On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > > > Today's patch tackles the use-once problem, that is, the
> > > > problem of
> > >
> > > Well, as I see the patch should remove the problem where
> > > drop_behind() deactivates pages of a readahead window even if
> > > some of those pages are not "used-once" pages, right ?
> > >
> > > I just want to make sure the performance improvements you're
> > > seeing caused by the fix of this _particular_ problem.
> >
> > Fully agreed.
> >
> > Especially since it was a one-liner change from worse
> > performance to better performance (IIRC) it would be
> > nice to see exactly WHY the system behaves the way it
> > does.  ;)
>
> Yes.
>
> Daniel's patch adds "drop behind" (that is, adding swapcache
> pages to the inactive dirty) behaviour to swapcache pages.
>
> This is a _new_ thing, and I would like to know how that is changing
> the whole VM behaviour..

Yes, absolutely.  I knew I was doing that but I also thought it wouldn't
hurt.  Rather it's part of a transition towards a full unification of 
the file and swap paths.

Basically, I just left that part of it hanging.  If you check my 
detailed timings you'll see all my test runs have swaps=0, basically
because I didn't really want to hear about it just then ;-)

I was pretty sure it could be fixed if it broke.

--
Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 16:56   ` Rik van Riel
@ 2001-07-25  0:04     ` Daniel Phillips
  2001-07-25  0:43     ` Anton Altaparmakov
  1 sibling, 0 replies; 58+ messages in thread
From: Daniel Phillips @ 2001-07-25  0:04 UTC (permalink / raw)
  To: Rik van Riel, jlnance; +Cc: linux-kernel

On Tuesday 24 July 2001 18:56, Rik van Riel wrote:
> On Tue, 24 Jul 2001 jlnance@intrex.net wrote:
> > On Tue, Jul 24, 2001 at 05:47:30AM +0200, Daniel Phillips wrote:
> > > So I decided to look for a new way of approaching the use-once
> > > problem that would easily integrated with our current approach.  
> > > What I came up with is pretty simple: instead of starting a newly
> > > allocated page on the active ring, I start it on the inactive
> > > queue with an age of zero. Then, any time generic_file_read or
> > > write references a page, if its age is zero, set its age to
> > > START_AGE and mark it as unreferenced.
> >
> > This is wonderfully simple and ellegant.
>
> It's a tad incorrect, though.
>
> If a page gets 2 1-byte reads in a microsecond, with this
> patch it would get promoted to the active list, even though
> it's really only used once.

Yes, I expected that to be a relatively rare case however - block 
aligned transfers are much more common, and when we have multiple 
blocks per page we may well want the page to go onto the active list 
because there may be quite a delay between successive block accesses.  
>From there it will be aged normally, not such a horrible thing.

To fix this properly I suspect that just a few microseconds is enough 
delay to pick up the temporal groupings you're talking about.  That's 
not hard to achieve.

> Preferably we'd want to have a "new" list of, say, 5% maximum
> of RAM size, where all references to a page are ignored. Any
> references made after the page falls off that list would be
> counted for page replacement purposes.

At that size you'd run a real risk of missing the tell-tale multiple 
references that mark a page as frequently used.  Think about metadata 
here (right now, that usually just means directory pages, but later... 
who knows).  Once somebody has looked at two files in a directory - 
while the page sits on the "ignore" list - chances are very good 
they'll look at a third, but perhaps not before it drops off the south 
end of the inactive queue.

Well, theorizing can only get us so far before we need to take actual 
measurements.

--
Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 20:24   ` [RFC] Optimization for use-once pages Patrick Dreker
  2001-07-24 20:32     ` Rik van Riel
  2001-07-24 20:33     ` Linus Torvalds
@ 2001-07-25  0:18     ` Daniel Phillips
  2 siblings, 0 replies; 58+ messages in thread
From: Daniel Phillips @ 2001-07-25  0:18 UTC (permalink / raw)
  To: Patrick Dreker, Linus Torvalds, phillips, linux-kernel

On Tuesday 24 July 2001 22:24, Patrick Dreker wrote:
> I just decided to give this patch a try, as I have written a little
> application which does some statistics over traces dumped by another
> program by mmap()ing a large file and reading it sequentially. The
> trace file to be analyzed is about 240 megs in size and consists of
> records each 249 bytes long. The analysis program opens and the
> mmap()s the trace file doing some small calculations on the data
> (basically it adds up fields from the records to get overall values).
>
> I have tested this on my Athlon 600 with 128 Megs of RAM, and it does
> not make any difference whether I use plain 2.4.7 or 2.4.5-use-once.
> The program always takes about 1 minute 6 seconds (+- 2 seconds) to
> complete, and the machine starts swapping out stuff

In this case that's an excellent result:

  - The optimization doesn't include mmap's (yet)
  - It doesn't break swap.  (Good, I didn't check that myself)

This is a case of "no news is good news".

> (thus I have
> omitted further stats like vmstat output). I have just taken another
> look into my program to verify it does not do something silly, like
> keeping old data around, but the program cycle is always the same:
> copy a record from the mmap into a struct, perform analysis, and copy
> next record. The struct is always reused for the next struct (so
> there is only one struct at any time).
>
> I can do further tests, if someone asks me to. I could even modify
> the analysis program to check changes in behaviour...

(Already read your mail where you picked up the 20% improvement, 
thanks, it warms my heart:-)

--
Daniel


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 22:25         ` Rik van Riel
@ 2001-07-25  0:27           ` Linus Torvalds
  2001-07-25  8:20             ` Patrick Dreker
  0 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2001-07-25  0:27 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Patrick Dreker, phillips, linux-kernel


On Tue, 24 Jul 2001, Rik van Riel wrote:
>
> (using smaller chunks, or chunks which aren't a
> multiple of 4kB should break the current code)

Maybe Patrick is using stdio? In that case, the small chunks will be
coalesced in the library layer anyway, which might explain the lack of
breakage.

Of course, if it improved performance even when "broken", that would be
even better. I like those kind sof algorithms.

		Linus


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 16:56   ` Rik van Riel
  2001-07-25  0:04     ` Daniel Phillips
@ 2001-07-25  0:43     ` Anton Altaparmakov
  2001-07-25  1:30       ` Daniel Phillips
  1 sibling, 1 reply; 58+ messages in thread
From: Anton Altaparmakov @ 2001-07-25  0:43 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, jlnance, linux-kernel

At 01:04 25/07/2001, Daniel Phillips wrote:
>At that size you'd run a real risk of missing the tell-tale multiple
>references that mark a page as frequently used.  Think about metadata
>here (right now, that usually just means directory pages, but later...
>who knows).

This is not actually implemented yet, but NTFS TNG will use the page cache 
to hold both the LCN (physical clusters) and MFT (on disk inodes) 
allocation bitmaps in addition to file and directory pages. (Admittedly the 
LCN case folds into the file pages in page cache one as the LCN bitmap is 
just stored inside the usual data of a file called $Bitmap, but the MFT 
case is more complex as it is in an additional attribute inside the file 
$MFT so the normal file read/write functions definitely cannot be used. The 
usual data here is the actual on disk inodes...)

Just FYI.

Anton


-- 
   "Nothing succeeds like success." - Alexandre Dumas
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 20:33     ` Linus Torvalds
@ 2001-07-25  1:25       ` Daniel Phillips
  0 siblings, 0 replies; 58+ messages in thread
From: Daniel Phillips @ 2001-07-25  1:25 UTC (permalink / raw)
  To: Linus Torvalds, Patrick Dreker; +Cc: phillips, linux-kernel

On Tuesday 24 July 2001 22:33, Linus Torvalds wrote:
> On Tue, 24 Jul 2001, Patrick Dreker wrote:
> > I just decided to give this patch a try, as I have written a little
> > application which does some statistics over traces dumped by
> > another program by mmap()ing a large file and reading it
> > sequentially. The trace file to be analyzed is about 240 megs in
> > size and consists of records each 249 bytes long. The analysis
> > program opens and the mmap()s the trace file doing some small
> > calculations on the data (basically it adds up fields from the
> > records to get overall values).
>
> Note that the patch won't do much anything for the mmap() case - the
> VM doesn't consider that "use once" anyway, and trying to make it do
> so would be hard, I suspect. It's just very nasty to try to figure
> out whether the user has touched the page a single time or millions
> of times...
>
> We do have the "accessed" bit, but in order to get any access
> patterns we'd have to scan the page tables a _lot_ more than we do
> now. Right now we do it only under memory pressure, and only about
> once per memory cycle, which is not really even remotely enough to
> get anything more than "yes, the user did touch the page"..
>
> In order for mmap() to make a difference with the new code, we could
> do something like look at the adjacent VM pages on page-in. That,
> together with VM hints, might well be able to do similar things for
> mmap. But at a noticeably higher cost than what the current code has.

OK, the truth is, I decided to punt that problem on the assumption I'd 
be able to come up with something workable when the time came.  Now 
having thought about it a little longer I see it's not so easy.  It's 
time for some free-form thinking.

Maybe there's a relatively simple way.  We'll see for either memory 
read or write we'll see the fault.  At that point the page starts its 
death march down the inactive queue and shortly after that the actual 
memory operation takes place.  Another short time later we clear the 
hardware referenced bit by some as-yet-undetermined means.  At the 
south end of the queue we check the hardware referenced bit and 
reactivate or continue with the eviction process as with the current 
patch.

Now I sense there are gotchas here that I'm not entirely on top of - 
the swap side of the memory manager being a little less than crystal 
clear to me at this point.  Let me start by guessing that it's not so 
easy to know which pte to look in for the referenced bit.  However, we 
could cheat and store the pte address in the struct page, or perhaps in 
the buffers, which happen to be still attached at the time.

Hmm, just a first idea, and admittedly ugly.  I think what I really 
have to do is some more code study.

--
Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-25  0:43     ` Anton Altaparmakov
@ 2001-07-25  1:30       ` Daniel Phillips
  2001-07-25 21:18         ` Steve Lord
  0 siblings, 1 reply; 58+ messages in thread
From: Daniel Phillips @ 2001-07-25  1:30 UTC (permalink / raw)
  To: Anton Altaparmakov; +Cc: Rik van Riel, jlnance, linux-kernel

On Wednesday 25 July 2001 02:43, Anton Altaparmakov wrote:
> At 01:04 25/07/2001, Daniel Phillips wrote:
> >At that size you'd run a real risk of missing the tell-tale multiple
> >references that mark a page as frequently used.  Think about
> > metadata here (right now, that usually just means directory pages,
> > but later... who knows).
>
> This is not actually implemented yet, but NTFS TNG will use the page
> cache to hold both the LCN (physical clusters) and MFT (on disk
> inodes) allocation bitmaps in addition to file and directory pages.
> (Admittedly the LCN case folds into the file pages in page cache one
> as the LCN bitmap is just stored inside the usual data of a file
> called $Bitmap, but the MFT case is more complex as it is in an
> additional attribute inside the file $MFT so the normal file
> read/write functions definitely cannot be used. The usual data here
> is the actual on disk inodes...)
>
> Just FYI.

Yes, not a surprise.  Plus, I was working on an experimental patch to 
put Ext2 index blocks into the page cache just before I got off on this 
use-once tangent.  Time to go back to that...

--
Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24  3:47 [RFC] Optimization for use-once pages Daniel Phillips
                   ` (2 preceding siblings ...)
  2001-07-24 20:27 ` Marcelo Tosatti
@ 2001-07-25  2:31 ` Marcelo Tosatti
  3 siblings, 0 replies; 58+ messages in thread
From: Marcelo Tosatti @ 2001-07-25  2:31 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel, Rik van Riel, Ben LaHaise, Mike Galbraith



On Tue, 24 Jul 2001, Daniel Phillips wrote:

> --- ../2.4.5.clean/mm/vmscan.c	Sat May 26 02:00:18 2001
> +++ ./mm/vmscan.c	Mon Jul 23 17:25:27 2001
> @@ -359,10 +359,10 @@
>  		}
>  
>  		/* Page is or was in use?  Move it to the active list. */
> -		if (PageReferenced(page) || page->age > 0 ||
> -				(!page->buffers && page_count(page) > 1)) {
> +		if (PageReferenced(page) || (!page->buffers && page_count(page) > 1)) {
>  			del_page_from_inactive_clean_list(page);
>  			add_page_to_active_list(page);
> +			page->age = PAGE_AGE_START;
>  			continue;
>  		}
>  
> @@ -462,11 +462,11 @@
>  		}
>  
>  		/* Page is or was in use?  Move it to the active list. */
> -		if (PageReferenced(page) || page->age > 0 ||
> -				(!page->buffers && page_count(page) > 1) ||
> +		if (PageReferenced(page) || (!page->buffers && page_count(page) > 1) ||
>  				page_ramdisk(page)) {
>  			del_page_from_inactive_dirty_list(page);
>  			add_page_to_active_list(page);
> +			page->age = PAGE_AGE_START;
>  			continue;
>  		}

That change will cause problems: we indicate that a page is young due to a
young mapped pte by increasing its age (at swap_out()).

With the current way of doing things, if you don't move higher aged pages
from the inactive lists to the active list you'll have severe problems
because of that (inactive list full of unfreeable pages due to mapped
pte's).

I suggest you to change

	if (!page->buffers && page_count(page) > 1) 
to 
	if ((page_count(page) - !!page->buffers) > 1)

at page_launder/reclaim_page to fix that problem. 

This way we end up moving all pages with additional references ignoring
the buffers to the active list. 

We have to unmap all references to a page before being able to make it
reclaimable clean cache anyway, so there is no real point keeping those
pages at the inactive list.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 20:53     ` Marcelo Tosatti
  2001-07-24 22:27       ` Rik van Riel
  2001-07-24 23:09       ` Daniel Phillips
@ 2001-07-25  5:12       ` Andrew Morton
  2001-07-25  6:33         ` Marcelo Tosatti
  2001-07-30 18:09         ` Marcelo Tosatti
  2 siblings, 2 replies; 58+ messages in thread
From: Andrew Morton @ 2001-07-25  5:12 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Rik van Riel, Daniel Phillips, linux-kernel, Ben LaHaise, Mike Galbraith

Marcelo Tosatti wrote:
> 
> Daniel's patch adds "drop behind" (that is, adding swapcache
> pages to the inactive dirty) behaviour to swapcache pages.

In some *brief* testing here, it appears that the use-once changes
make an improvement for light-medium workloads.  With swap-intensive
workloads, the (possibly accidental) changes to swapcache aging
in fact improve things a lot, and use-once makes things a little worse.

This is a modified Galbraith test: 64 megs of RAM, `make -j12
bzImage', dual CPU:

2.4.7:			6:54
2.4.7+Daniel's patch	6:06
2.4.7+the below patch	5:56

--- mm/swap.c	2001/01/23 08:37:48	1.3
+++ mm/swap.c	2001/07/25 04:08:59
@@ -234,8 +234,8 @@
 	DEBUG_ADD_PAGE
 	add_page_to_active_list(page);
 	/* This should be relatively rare */
-	if (!page->age)
-		deactivate_page_nolock(page);
+	deactivate_page_nolock(page);
+	page->age = 0;
 	spin_unlock(&pagemap_lru_lock);
 }

This change to lru_cache_add() is the only change made to 2.4.7,
and it provides the 17% speedup for this swap-intensive load.

With the same setup, running a `grep -r /usr/src' in parallel
with a `make -j3 bzImage', the `make -j3' takes:

2.4.7:			5:13
2.4.7+Daniel:		5:03
2.4.7+the above patch:	5:16

With the same setup, running a `grep -r /usr/src' in parallel
with a `make -j1 bzImage', the `make -j1' takes:

2.4.7:			9:25
2.4.7+Daniel:		8:55
2.4.7+the above patch:	9:35

So with lighter loads, use-once is starting to provide benefit, and the
deactivation is too aggressive.

> This is a _new_ thing, and I would like to know how that is changing the
> whole VM behaviour..

Sure.  Daniel's patch radically changes the aging of swapcache
and other pages, and with some workloads it appears that it is
this change which brings about the performance increase, rather
than the intended use-once stuff.

I suspect the right balance here is to take use-once, but *not*
take its changes to lru_cache_add().  That's a separate thing. 

Seems that lru_cache_add() is making decisions at a too-low level, and
they are sometimes wrong.  The decision as to what age to give the page
and whether it should be activated needs to be made at a higher level.


-

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 19:35         ` Rob Landley
@ 2001-07-25  6:10           ` Marcelo Tosatti
  2001-07-25  8:32           ` Eric W. Biederman
  2001-07-25 12:57           ` Daniel Phillips
  2 siblings, 0 replies; 58+ messages in thread
From: Marcelo Tosatti @ 2001-07-25  6:10 UTC (permalink / raw)
  To: Rob Landley
  Cc: Daniel Phillips, Rik van Riel, linux-kernel, Ben LaHaise, Mike Galbraith



On Tue, 24 Jul 2001, Rob Landley wrote:

> On Tuesday 24 July 2001 19:09, Daniel Phillips wrote:
> > On Tuesday 24 July 2001 22:53, Marcelo Tosatti wrote:
> > > On Tue, 24 Jul 2001, Rik van Riel wrote:
> > > > On Tue, 24 Jul 2001, Marcelo Tosatti wrote:
> > > > > On Tue, 24 Jul 2001, Daniel Phillips wrote:
> > > > > > Today's patch tackles the use-once problem, that is, the
> > > > > > problem of
> > > > >
> > > > > Well, as I see the patch should remove the problem where
> > > > > drop_behind() deactivates pages of a readahead window even if
> > > > > some of those pages are not "used-once" pages, right ?
> > > > >
> > > > > I just want to make sure the performance improvements you're
> > > > > seeing caused by the fix of this _particular_ problem.
> > > >
> > > > Fully agreed.
> > > >
> > > > Especially since it was a one-liner change from worse
> > > > performance to better performance (IIRC) it would be
> > > > nice to see exactly WHY the system behaves the way it
> > > > does.  ;)
> > >
> > > Yes.
> > >
> > > Daniel's patch adds "drop behind" (that is, adding swapcache
> > > pages to the inactive dirty) behaviour to swapcache pages.
> > >
> > > This is a _new_ thing, and I would like to know how that is changing
> > > the whole VM behaviour..
> >
> > Yes, absolutely.  I knew I was doing that but I also thought it wouldn't
> > hurt.  Rather it's part of a transition towards a full unification of
> > the file and swap paths.
> 
> Stupid question time:
> 
> So basically (lemme get this straight):
> 
> All VM allocation in 2.4 has become some variant of mmap.  
>
> Either we're mmaping in the executable and libraries when we exec,
> we're mmaping in a file, or we're mmaping in the swap file/block
> device which is how we do anonymous page allocations.
>
> 
> And this is why 2.4 keeps wanting to allocate swap space for pages that are 
> still in core.  And why the 2.4 VM keeps going ballistic on boxes that have 
> more physical DRAM than they have swap space.  (I.E. the 2X swap actually 
> being NECESSARY now, and 256 megs of "overflow" swap files for a 2 gig ram 
> box actually hurting matters by confusing the VM if swap is enabled AT ALL, 
> since it clashes conceptually.)

2.4 keeps wanting to allocate swap space for pages that are in core to be
able to add them to the LRU lists. That way we can get more detailed
information about the VM usage on the system.

When we run out of memory, we start looking linearly in all processes ptes
to find ptes which have not been used recently.  When we find one pte
which has not been used recently, we unmap it from the physical page.

For anonymous data, we have to unmap the pte and point it to some
"identifier" which can be used later (at page fault time) to find the
data. (wheter its still on memory or already swapped out to disk)

Currently this "identifier" is a value in the swap map. The swap map is a
_direct_ mapping to backing storage (swap space).

That means we cannot unmap a pte which points to anon data without
allocating swap space for the page first. 

We have added an aging scheme which gets detailed information about the
page usage, great! The problem is that we cannot get detailed anon page
usage info without having to allocate swap space. Result: 2xRAM swap rule
created.

Have you got the idea here ? 

That is not going to change on 2.4, I think. Sorry. 

> So getting swap to exclude in-core pages (so they don't take up space in the 
> swap file) would therefore be conceptually similar to implementing "files 
> with holes in them" support like EXT2 has for these swap file mmaps.

Not really. To exclude in-core pages (in-core meaning all pte's which
point to the in-mem page are mapped and valid) from swap we just need to
remove the page from swapcache. Its as simple as that, really.

However that will cause swap fragmentation. And lots of it.

> And the argument that doing so might "fragment the swap file", while true, is 
> actually a bit of a red herriing because the real objection is that it 
> unnecessarily complicates an otherwise clean and straightforward concept.

Nope. The "objections" for not deallocating in-core pages from swap (on
2.4) are:

- The code would need to avoid fragmentation. It would have to "unmap"
contiguous chunks of swap mappings as soon as swap started to get full.

Take into account that the swap clustering code we are using now was wrote
at the time we used to allocate swap space only when actually writing out
swap data: there was no need for the code to be fast.

- 2.4 is out. Fixing the problem in a performance-decent way would have to
change quite some code.


> Am I even close here?  (That was the stupid question I mentioned earlier, 
> knew I'd get around to it...)  Reading the code's a bit harder when you don't 
> have any sort of conceptual understanding of what the heck it's actually 
> trying to do.

Well, I think so.

> 
> > Basically, I just left that part of it hanging.  If you check my
> > detailed timings you'll see all my test runs have swaps=0, basically
> > because I didn't really want to hear about it just then ;-)
> >
> > I was pretty sure it could be fixed if it broke.
> 
> Just thought I'd say that personally, I think your greatest contribution in 
> this whole thread is that somebody finally clearly explained how the new VM 
> works, using small words.  The actual improvements are just gravy.  Yes I'm 
> biased.  :)
> 
> I don't suppose we could get some variant of your initial post into 
> /Documentation/vm/HowItActuallyWorks.txt?  (I take it the biggest "detail" 
> you glossed over was the seperation of memory into zones?)


Last comment:

Who the heck cares about good swapping performance when tasks are
getting killed because there is no more swap space available due to this
swap preallocation ? No one. 

It makes sense to fuck up swap clustering to avoid the OOM killer.

In case someone is having this problem with the OOM killer in practice
(real workloads), please tell us.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-25  5:12       ` Andrew Morton
@ 2001-07-25  6:33         ` Marcelo Tosatti
  2001-07-30 18:09         ` Marcelo Tosatti
  1 sibling, 0 replies; 58+ messages in thread
From: Marcelo Tosatti @ 2001-07-25  6:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Daniel Phillips, linux-kernel, Ben LaHaise, Mike Galbraith



On Wed, 25 Jul 2001, Andrew Morton wrote:

> Marcelo Tosatti wrote:
> > 
> > Daniel's patch adds "drop behind" (that is, adding swapcache
> > pages to the inactive dirty) behaviour to swapcache pages.
> 
> In some *brief* testing here, it appears that the use-once changes
> make an improvement for light-medium workloads.  With swap-intensive
> workloads, the (possibly accidental) changes to swapcache aging
> in fact improve things a lot, and use-once makes things a little worse.
> 
> This is a modified Galbraith test: 64 megs of RAM, `make -j12
> bzImage', dual CPU:
> 
> 2.4.7:			6:54
> 2.4.7+Daniel's patch	6:06
> 2.4.7+the below patch	5:56
> 
> --- mm/swap.c	2001/01/23 08:37:48	1.3
> +++ mm/swap.c	2001/07/25 04:08:59
> @@ -234,8 +234,8 @@
>  	DEBUG_ADD_PAGE
>  	add_page_to_active_list(page);
>  	/* This should be relatively rare */
> -	if (!page->age)
> -		deactivate_page_nolock(page);
> +	deactivate_page_nolock(page);
> +	page->age = 0;
>  	spin_unlock(&pagemap_lru_lock);
>  }
> 
> This change to lru_cache_add() is the only change made to 2.4.7,
> and it provides the 17% speedup for this swap-intensive load.
> 
> With the same setup, running a `grep -r /usr/src' in parallel
> with a `make -j3 bzImage', the `make -j3' takes:
> 
> 2.4.7:			5:13
> 2.4.7+Daniel:		5:03
> 2.4.7+the above patch:	5:16
> 
> With the same setup, running a `grep -r /usr/src' in parallel
> with a `make -j1 bzImage', the `make -j1' takes:
> 
> 2.4.7:			9:25
> 2.4.7+Daniel:		8:55
> 2.4.7+the above patch:	9:35

Ok, now we know what improves the swapping performance. 

I wonder _why_ it improves swapping performance.

> So with lighter loads, use-once is starting to provide benefit, and the
> deactivation is too aggressive.
> 
> > This is a _new_ thing, and I would like to know how that is changing the
> > whole VM behaviour..
> 
> Sure.  Daniel's patch radically changes the aging of swapcache
> and other pages, and with some workloads it appears that it is
> this change which brings about the performance increase, rather
> than the intended use-once stuff.
> 
> I suspect the right balance here is to take use-once, but *not*
> take its changes to lru_cache_add().  That's a separate thing. 

Yep. 

I perfectly agree with Daniel's use-once idea for _file cache_ pages. Its
really nice. 

Now I'm not sure why directly adding swapcache pages to the inactive dirty
lits with 0 zero age improves things. 

It _sounds_ like we avoid moving those swapcache pages to the active list,
then later move them to the inactive list and then writing them out.

That can result in less effective writeout clustering (its just a
guess) and also CPU wastage, as Daniel mentioned previously. 

I'll hopefully going to profile his code tomorrow with the VM stats stuff,
if I have time.

> Seems that lru_cache_add() is making decisions at a too-low level, and
> they are sometimes wrong.  The decision as to what age to give the page
> and whether it should be activated needs to be made at a higher level.

Agreed.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-25  0:27           ` Linus Torvalds
@ 2001-07-25  8:20             ` Patrick Dreker
  2001-07-25 12:57               ` Martin Devera
  2001-07-25 14:16               ` Daniel Phillips
  0 siblings, 2 replies; 58+ messages in thread
From: Patrick Dreker @ 2001-07-25  8:20 UTC (permalink / raw)
  To: Linus Torvalds, Rik van Riel; +Cc: phillips, linux-kernel

Am Mittwoch, 25. Juli 2001 02:27 schrieb Linus Torvalds:
> On Tue, 24 Jul 2001, Rik van Riel wrote:
> > (using smaller chunks, or chunks which aren't a
> > multiple of 4kB should break the current code)
>
> Maybe Patrick is using stdio? In that case, the small chunks will be
> coalesced in the library layer anyway, which might explain the lack of
> breakage.
I am using normal read() calls to read the file after open()ing it on program 
start and each call reads only 4 bytes. So if I am reading this right I am 
seeing an improvement where I should not see one, or at least not such a big 
one, right?

I did a few more test runs on each of the kernels to check if the results are 
reproducible:
2.4.7-plain:
17.320u 115.100s 2:17.36 96.4%	0+0k 0+0io 110pf+0w
17.200u 94.170s 1:53.14 98.4%	0+0k 0+0io 110pf+0w
17.490u 113.730s 2:13.48 98.3%	0+0k 0+0io 110pf+0w

2.4.5-use_once:
14.730u 108.310s 2:09.57 94.9%	0+0k 0+0io 203pf+0w
13.880u 79.410s 1:38.64 94.5%	0+0k 0+0io 226pf+0w
14.840u 78.680s 1:37.86 95.5%	0+0k 0+0io 238pf+0w

The time under 2.4.5-use_once stays constant from the second run on (I tried 
3 more times), while 2.4.7 shows pretty varying performance but I have never 
seen it getting better than the 1:53.14 from the second run above. I had 
stopped all services which I knew to cause periodic activity (exim, 
cron/anacron) which could disturb the tests.

I also noticed, that under 2.4.5 after the 3 test runs the KDE Taskbasr got 
swapped out, while under 2.4.7 this was not the case.

> Of course, if it improved performance even when "broken", that would be
> even better. I like those kind sof algorithms.
Who doesn't? :)

>
> 		Linus

-- 
Patrick Dreker
---------------------------------------------------------------------
> Is there anything else I can contribute?
The latitude and longtitude of the bios writers current position, and
a ballistic missile.        
                         Alan Cox on linux-kernel@vger.kernel.org

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 19:35         ` Rob Landley
  2001-07-25  6:10           ` Marcelo Tosatti
@ 2001-07-25  8:32           ` Eric W. Biederman
  2001-07-25 12:53             ` Daniel Phillips
  2001-07-25 12:57           ` Daniel Phillips
  2 siblings, 1 reply; 58+ messages in thread
From: Eric W. Biederman @ 2001-07-25  8:32 UTC (permalink / raw)
  To: landley
  Cc: Daniel Phillips, Marcelo Tosatti, Rik van Riel, linux-kernel,
	Ben LaHaise, Mike Galbraith

Rob Landley <landley@webofficenow.com> writes:

> Stupid question time:
> 
> So basically (lemme get this straight):
> 
> All VM allocation in 2.4 has become some variant of mmap.  Either we're 
> mmaping in the executable and libraries when we exec, we're mmaping in a 
> file, or we're mmaping in the swap file/block device which is how we do 
> anonymous page allocations.

You are getting warm.  In 2.4 most caching is moving exclusively to
the page cache.  Page cache write support was added between 2.2 and
2.4 allowing things to stop using the block cache.  However everything
you are mentioning used the page cache in 2.2.  The page cache is
superior to the block cache as it caches the data we want to use not
some intermediate form of it.

Having one general purpose cache layer, increases the reliability
of the system, as we only have to have that layer debugged to know
everything is working.  It is by know means the entire VM but it is
the majority and it is where we most pages that can be freed on demand
so the page cache receives most of the attention.

> And this is why 2.4 keeps wanting to allocate swap space for pages that are 
> still in core.  And why the 2.4 VM keeps going ballistic on boxes that have 
> more physical DRAM than they have swap space.  (I.E. the 2X swap actually 
> being NECESSARY now, and 256 megs of "overflow" swap files for a 2 gig ram 
> box actually hurting matters by confusing the VM if swap is enabled AT ALL, 
> since it clashes conceptually.)

Nope.  There are several canidates for this but none of the is simply
using the page cache.

The description, and characterization of the 2.4 VM with respect to
swapping is wrong.  I have seen to much overgeneralization on this
topic already, and refuse to treat it in the abstract.  If you have a
reproducible problem case report it specifically. 

The biggest 2.4 swapping bug was failure free pages in the swap cache
that were not connected to any process.  This should now be fixed.

The biggest issue is that the 2.4 VM has been heavily modified since
2.2 and the bugs are still being worked out.  If you have a an
application that needs noticeably more swap than 2.2 did report it as
a bug.  Any other treatment at this point would be over confidence in
the correctness of the VM in corner cases.

> So getting swap to exclude in-core pages (so they don't take up space in the 
> swap file) would therefore be conceptually similar to implementing "files 
> with holes in them" support like EXT2 has for these swap file mmaps.

Not really.  There are some similiarities.
 
> And the argument that doing so might "fragment the swap file", while true, is 
> actually a bit of a red herriing because the real objection is that it 
> unnecessarily complicates an otherwise clean and straightforward concept.

Fragmenting the swap file is a read herring.  

I have seen no real objections, just no real code.  The thing to watch
out for is that you need to make certain that you actually have the
resources to swap a file out to.  Consider having a swapfile on tmpfs.
Having a deadlock in the swapping code would be bad.

> Am I even close here?  (That was the stupid question I mentioned earlier, 
> knew I'd get around to it...)  Reading the code's a bit harder when you don't 
> have any sort of conceptual understanding of what the heck it's actually 
> trying to do.

See my earlier answers for closenes.
 
> I don't suppose we could get some variant of your initial post into 
> /Documentation/vm/HowItActuallyWorks.txt?  (I take it the biggest "detail" 
> you glossed over was the seperation of memory into zones?)

Becareful here, what is maintainable is a concepts behind how it
works.  The page cache, multpile memory zones, etc.

You have the code for a detailed how it works document, and it is
continually changing.  The driving forces, tools, and ideas behind
what is happening should be static enough to document however.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: linux partitioning in IA64
  2001-07-24 20:26           ` linux partitioning in IA64 hiufungeric.tse
@ 2001-07-25  9:29             ` Mike A. Harris
  0 siblings, 0 replies; 58+ messages in thread
From: Mike A. Harris @ 2001-07-25  9:29 UTC (permalink / raw)
  To: hiufungeric.tse; +Cc: linux-kernel

On Tue, 24 Jul 2001 hiufungeric.tse@utoronto.ca wrote:

>Date: Tue, 24 Jul 2001 16:26:12 -0400 (EDT)
>From: hiufungeric.tse@utoronto.ca
>To: linux-kernel@vger.kernel.org
>Content-Type: 	text/plain; charset=US-ASCII
>Subject: linux partitioning in IA64
>
>Hei,
>
>I am trying to install seawolf(linux IA64) on an IA 64 machine.
>For partitioning I went through several trails without success.
>(total # disk space is 10G). After all the initial settings, I
>restarted the machine to EFI (boot manager). Can't boot to
>linux and the EFI can't recognize files the installation CD
>initially saved on the hard disk. Are there special ways or
>tricks to partition the disk and to make it work?

Much better to send this type of question to ia64-list@redhat.com,
as it is mostly OT here...

Hope this helps.


----------------------------------------------------------------------
    Mike A. Harris  -  Linux advocate  -  Open Source advocate
       Opinions and viewpoints expressed are solely my own.
----------------------------------------------------------------------
Definition: MCSE - Must Consult Someone Experienced


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-25  8:32           ` Eric W. Biederman
@ 2001-07-25 12:53             ` Daniel Phillips
  2001-07-25 16:05               ` Eric W. Biederman
  0 siblings, 1 reply; 58+ messages in thread
From: Daniel Phillips @ 2001-07-25 12:53 UTC (permalink / raw)
  To: Eric W. Biederman, landley
  Cc: Marcelo Tosatti, Rik van Riel, linux-kernel, Ben LaHaise, Mike Galbraith

On Wednesday 25 July 2001 10:32, Eric W. Biederman wrote:
> Consider having a swapfile on tmpfs.

Ooh, a truly twisted thought.

We'd know we're making progress when we can prove it doesn't deadlock.

--
Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-25  8:20             ` Patrick Dreker
@ 2001-07-25 12:57               ` Martin Devera
  2001-07-25 14:16               ` Daniel Phillips
  1 sibling, 0 replies; 58+ messages in thread
From: Martin Devera @ 2001-07-25 12:57 UTC (permalink / raw)
  To: linux-kernel

Hello,

I read whole this thread and it started to interest me.

If I understood the code correctly, the used bit on ptes
is tested once per 64 second (for particular page) from
kswapd or more often under mem pressure and only for
pages on active list. Correct ?

Would it be possible to do cost-based eviction of pages ?
Like to compute cost for each page periodically (with variable
period - resort page to finner-period list when u bit was set for 
too long) and resort page into hash bucket with key equal to 
the cost (cost would be in moderate integer range).

Then you can simply evict pages from lowest hash bucket list
and move them to free list (after cleaning them).

It would allow developers to modify and test various cost
schemas including LRU,LRU/2,aging ... with simple change
of cost function.

I tried to create statistical model of page eviction and after
some simplify it seems that cost c = (Tw-t)^2+(Tr-t)^2
where Tr/w is expected time of next page reference/dirtying
based on EWMA averaging could minimize disc activity during
page eviction.

I wanted to try it but I'd like to know opinions of others.
Probably someone have got similar idea ..
Maybe it would be so expensive ..
fill other "maybe" here .. :-)

devik


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 19:35         ` Rob Landley
  2001-07-25  6:10           ` Marcelo Tosatti
  2001-07-25  8:32           ` Eric W. Biederman
@ 2001-07-25 12:57           ` Daniel Phillips
  2 siblings, 0 replies; 58+ messages in thread
From: Daniel Phillips @ 2001-07-25 12:57 UTC (permalink / raw)
  To: Rob Landley, Marcelo Tosatti, Rik van Riel
  Cc: linux-kernel, Ben LaHaise, Mike Galbraith

On Tuesday 24 July 2001 21:35, Rob Landley wrote:
> I don't suppose we could get some variant of your initial post into
> /Documentation/vm/HowItActuallyWorks.txt?  (I take it the biggest
> "detail" you glossed over was the seperation of memory into zones?)

I glossed over a lot of big details:

  - zones
  - type of pages: anonymous, swap cache, file, high, buffer, ramdisk
  - interaction with page cache
  - various flavors of swap-in and swap-out paths
  - shared memory and the swap cache
  - locking strategy
  - aging strategy
  - scanning policy
  - deadlock and livelock avoidance measures
  - unloaded vs loaded behaviour
  - effect of load changes
  - out of memory handling
  - clustering (or lack of it)
  - IO throttling

Each of these is a topic all by itself.  You'll find all of them 
discussed extensively here on lkml.

--
Daniel
  

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-25  8:20             ` Patrick Dreker
  2001-07-25 12:57               ` Martin Devera
@ 2001-07-25 14:16               ` Daniel Phillips
  1 sibling, 0 replies; 58+ messages in thread
From: Daniel Phillips @ 2001-07-25 14:16 UTC (permalink / raw)
  To: Patrick Dreker, Linus Torvalds, Rik van Riel; +Cc: phillips, linux-kernel

On Wednesday 25 July 2001 10:20, Patrick Dreker wrote:
> I did a few more test runs on each of the kernels to check if the
> results are reproducible:
> 2.4.7-plain:
> 17.320u 115.100s 2:17.36 96.4%	0+0k 0+0io 110pf+0w
> 17.200u 94.170s 1:53.14 98.4%	0+0k 0+0io 110pf+0w
> 17.490u 113.730s 2:13.48 98.3%	0+0k 0+0io 110pf+0w
>
> 2.4.5-use_once:
> 14.730u 108.310s 2:09.57 94.9%	0+0k 0+0io 203pf+0w
> 13.880u 79.410s 1:38.64 94.5%	0+0k 0+0io 226pf+0w
> 14.840u 78.680s 1:37.86 95.5%	0+0k 0+0io 238pf+0w

Look at the CPU dropping along with the times.  Usually it goes the 
other way.

> The time under 2.4.5-use_once stays constant from the second run on
> (I tried 3 more times), while 2.4.7 shows pretty varying performance
> but I have never seen it getting better than the 1:53.14 from the
> second run above. I had stopped all services which I knew to cause
> periodic activity (exim, cron/anacron) which could disturb the tests.
>
> I also noticed, that under 2.4.5 after the 3 test runs the KDE
> Taskbasr got swapped out, while under 2.4.7 this was not the case.

Not swapping out the task bar is a different problem, only loosely 
related.  The use-once thing is a step in the right direction because 
it makes relatively more file IO pages available for deactivation 
versus swap pages, and the task bar has a better chance of surviving.  
However, it's not a really firm connection to the problem.

An additional line of attack is to look at the aging policy.  I have a 
strong sense we can do it better.  Right now we're aging everything 
down to a uniform zero and that really obviously throws away a lot of 
information.

In the 2.5 timeframe, better unification of the page cache and swap 
paths will make it much easier to focus on the taskbar problem.

--
Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-25 12:53             ` Daniel Phillips
@ 2001-07-25 16:05               ` Eric W. Biederman
  0 siblings, 0 replies; 58+ messages in thread
From: Eric W. Biederman @ 2001-07-25 16:05 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: landley, Marcelo Tosatti, Rik van Riel, linux-kernel,
	Ben LaHaise, Mike Galbraith

Daniel Phillips <phillips@bonn-fries.net> writes:

> On Wednesday 25 July 2001 10:32, Eric W. Biederman wrote:
> > Consider having a swapfile on tmpfs.
> 
> Ooh, a truly twisted thought.
> 
> We'd know we're making progress when we can prove it doesn't deadlock.

Given all of that I think I going to go try to hack up a version that
works.  

The ensuring we don't deadlock sounds like it will require some shared
mechanisms with journaling filesystems.  The whole pinned ram
problem.  So to some extent it looks like a 2.4 problem.  Though I'd
rather solve it in 2.5.x first.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-25  1:30       ` Daniel Phillips
@ 2001-07-25 21:18         ` Steve Lord
  0 siblings, 0 replies; 58+ messages in thread
From: Steve Lord @ 2001-07-25 21:18 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Anton Altaparmakov, Rik van Riel, jlnance, linux-kernel

> On Wednesday 25 July 2001 02:43, Anton Altaparmakov wrote:
> > At 01:04 25/07/2001, Daniel Phillips wrote:
> > >At that size you'd run a real risk of missing the tell-tale multiple
> > >references that mark a page as frequently used.  Think about
> > > metadata here (right now, that usually just means directory pages,
> > > but later... who knows).
> >
> > This is not actually implemented yet, but NTFS TNG will use the page
> > cache to hold both the LCN (physical clusters) and MFT (on disk
> > inodes) allocation bitmaps in addition to file and directory pages.
> > (Admittedly the LCN case folds into the file pages in page cache one
> > as the LCN bitmap is just stored inside the usual data of a file
> > called $Bitmap, but the MFT case is more complex as it is in an
> > additional attribute inside the file $MFT so the normal file
> > read/write functions definitely cannot be used. The usual data here
> > is the actual on disk inodes...)
> >
> > Just FYI.
> 
> Yes, not a surprise.  Plus, I was working on an experimental patch to 
> put Ext2 index blocks into the page cache just before I got off on this 
> use-once tangent.  Time to go back to that...
> 
> --
> Daniel
> -

So here is a datapoint from a filesystem which uses pages for all of its
data and metadata, XFS. These are for 2.4.7 with and without the patch.
XFS metadata is held in a special address space, when not in use by XFS
the vm is free to reclaim it via the usual page reclaim methods, so XFS
is probably fairly heavily influenced by a change like this.

All runs on a 2 cpu 450MHz PIII with 128M of memory, single scsi partition.

1. dbench runs at various loads back to back, starting with
   the highest load, 3 runs of each load):

Vanilla 2.4.7 xfs

64
=====
Throughput 6.77363 MB/sec (NB=8.46704 MB/sec  67.7363 MBit/sec)
Throughput 7.03715 MB/sec (NB=8.79644 MB/sec  70.3715 MBit/sec)
Throughput 7.11706 MB/sec (NB=8.89632 MB/sec  71.1706 MBit/sec)
48
=====
Throughput 8.50251 MB/sec (NB=10.6281 MB/sec  85.0251 MBit/sec)
Throughput 8.38426 MB/sec (NB=10.4803 MB/sec  83.8426 MBit/sec)
Throughput 8.58152 MB/sec (NB=10.7269 MB/sec  85.8152 MBit/sec)
32
=====
Throughput 11.6378 MB/sec (NB=14.5473 MB/sec  116.378 MBit/sec)
Throughput 11.7586 MB/sec (NB=14.6983 MB/sec  117.586 MBit/sec)
Throughput 12.0686 MB/sec (NB=15.0858 MB/sec  120.686 MBit/sec)
16
=====
Throughput 16.9877 MB/sec (NB=21.2346 MB/sec  169.877 MBit/sec)
Throughput 16.9502 MB/sec (NB=21.1877 MB/sec  169.502 MBit/sec)
Throughput 16.8703 MB/sec (NB=21.0878 MB/sec  168.703 MBit/sec)
8
=====
Throughput 36.085 MB/sec (NB=45.1062 MB/sec  360.85 MBit/sec)
Throughput 31.8114 MB/sec (NB=39.7642 MB/sec  318.114 MBit/sec)
Throughput 36.0801 MB/sec (NB=45.1001 MB/sec  360.801 MBit/sec)
4
=====
Throughput 42.7741 MB/sec (NB=53.4677 MB/sec  427.741 MBit/sec)
Throughput 46.2591 MB/sec (NB=57.8239 MB/sec  462.591 MBit/sec)
Throughput 42.8487 MB/sec (NB=53.5608 MB/sec  428.487 MBit/sec)
2
=====
Throughput 48.776 MB/sec (NB=60.97 MB/sec  487.76 MBit/sec)
Throughput 48.4417 MB/sec (NB=60.5522 MB/sec  484.417 MBit/sec)
Throughput 47.0938 MB/sec (NB=58.8673 MB/sec  470.938 MBit/sec)
1
=====
Throughput 32.0957 MB/sec (NB=40.1196 MB/sec  320.957 MBit/sec)
Throughput 32.1901 MB/sec (NB=40.2376 MB/sec  321.901 MBit/sec)
Throughput 31.3972 MB/sec (NB=39.2465 MB/sec  313.972 MBit/sec)

2.4.7 xfs + patch

64
=====
Throughput 8.99448 MB/sec (NB=11.2431 MB/sec  89.9448 MBit/sec)
Throughput 9.07137 MB/sec (NB=11.3392 MB/sec  90.7137 MBit/sec)
Throughput 9.09924 MB/sec (NB=11.3741 MB/sec  90.9924 MBit/sec)
48
=====
Throughput 10.7328 MB/sec (NB=13.416 MB/sec  107.328 MBit/sec)
Throughput 10.791 MB/sec (NB=13.4887 MB/sec  107.91 MBit/sec)
Throughput 10.9185 MB/sec (NB=13.6482 MB/sec  109.185 MBit/sec)
32
=====
Throughput 12.6625 MB/sec (NB=15.8282 MB/sec  126.625 MBit/sec)
Throughput 12.7601 MB/sec (NB=15.9501 MB/sec  127.601 MBit/sec)
Throughput 12.6732 MB/sec (NB=15.8415 MB/sec  126.732 MBit/sec)
16
=====
Throughput 19.5692 MB/sec (NB=24.4614 MB/sec  195.692 MBit/sec)
Throughput 19.6163 MB/sec (NB=24.5204 MB/sec  196.163 MBit/sec)
Throughput 19.6773 MB/sec (NB=24.5966 MB/sec  196.773 MBit/sec)
8
=====
Throughput 36.8323 MB/sec (NB=46.0404 MB/sec  368.323 MBit/sec)
Throughput 36.9346 MB/sec (NB=46.1683 MB/sec  369.346 MBit/sec)
Throughput 36.4622 MB/sec (NB=45.5778 MB/sec  364.622 MBit/sec)
4
=====
Throughput 42.883 MB/sec (NB=53.6037 MB/sec  428.83 MBit/sec)
Throughput 42.1596 MB/sec (NB=52.6995 MB/sec  421.596 MBit/sec)
Throughput 42.9602 MB/sec (NB=53.7002 MB/sec  429.602 MBit/sec)
2
=====
Throughput 44.2798 MB/sec (NB=55.3498 MB/sec  442.798 MBit/sec)
Throughput 39.5441 MB/sec (NB=49.4301 MB/sec  395.441 MBit/sec)
Throughput 46.8569 MB/sec (NB=58.5711 MB/sec  468.569 MBit/sec)
1
=====
Throughput 32.0682 MB/sec (NB=40.0852 MB/sec  320.682 MBit/sec)
Throughput 32.1923 MB/sec (NB=40.2403 MB/sec  321.923 MBit/sec)
Throughput 32.1677 MB/sec (NB=40.2097 MB/sec  321.677 MBit/sec)

A definite improvement under high loads, and the lower loads appear
about the same. I reran the dbench 2 numbers and got results in line
with the non patched kernel.

2. Kernel builds

Unmodified kernel:

time make -j3 bzImage
948.880u 70.660s 8:45.88 193.8% 0+0k 0+0io 721072pf+0w
948.360u 69.950s 8:43.68 194.4% 0+0k 0+0io 721072pf+0w

Patched kernel:
942.660u 54.320s 8:29.49 195.6% 0+0k 0+0io 721044pf+0w
942.150u 54.900s 8:29.35 195.7% 0+0k 0+0io 721041pf+0w

Looks good to me, 22% less system time for the kernel build
is pretty good, also looks like my cpus are slower than most
people's nowadays :-(

Steve



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-25  5:12       ` Andrew Morton
  2001-07-25  6:33         ` Marcelo Tosatti
@ 2001-07-30 18:09         ` Marcelo Tosatti
  1 sibling, 0 replies; 58+ messages in thread
From: Marcelo Tosatti @ 2001-07-30 18:09 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Rik van Riel, Daniel Phillips, lkml, Ben LaHaise, Mike Galbraith



On Wed, 25 Jul 2001, Andrew Morton wrote:

> Marcelo Tosatti wrote:
> > 
> > Daniel's patch adds "drop behind" (that is, adding swapcache
> > pages to the inactive dirty) behaviour to swapcache pages.
> 
> In some *brief* testing here, it appears that the use-once changes
> make an improvement for light-medium workloads.  With swap-intensive
> workloads, the (possibly accidental) changes to swapcache aging
> in fact improve things a lot, and use-once makes things a little worse.
> 
> This is a modified Galbraith test: 64 megs of RAM, `make -j12
> bzImage', dual CPU:
> 
> 2.4.7:			6:54
> 2.4.7+Daniel's patch	6:06
> 2.4.7+the below patch	5:56
> 
> --- mm/swap.c	2001/01/23 08:37:48	1.3
> +++ mm/swap.c	2001/07/25 04:08:59
> @@ -234,8 +234,8 @@
>  	DEBUG_ADD_PAGE
>  	add_page_to_active_list(page);
>  	/* This should be relatively rare */
> -	if (!page->age)
> -		deactivate_page_nolock(page);
> +	deactivate_page_nolock(page);
> +	page->age = 0;
>  	spin_unlock(&pagemap_lru_lock);
>  }
> 
> This change to lru_cache_add() is the only change made to 2.4.7,
> and it provides the 17% speedup for this swap-intensive load.

After some thoughs I think this is due to swapin readahead.

We add "drop behind" behaviour to swap readahead, so we have less impact
on the system due to swap readahead "misses". 

That is nice but dangerous under swap IO intensive loads: 

We start swapin readahead, bring the first page of the "cluster" in
memory, block on the next swap page's IO (the one's we are doing
readahead) and in the meantime the first page we read is reclaimed by
someone else.

Then we'll have to reread the first page at at the direct
read_swap_cache_async() called by do_swap_page().

I really want to confirm if this is happening right now. Hope to do it
soon.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-26 10:38       ` Marcelo Tosatti
@ 2001-07-26 12:17         ` Daniel Phillips
  0 siblings, 0 replies; 58+ messages in thread
From: Daniel Phillips @ 2001-07-26 12:17 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Eric W. Biederman, Rik van Riel, Andrew Morton, linux-kernel,
	Ben LaHaise, Mike Galbraith

On Thursday 26 July 2001 12:38, Marcelo Tosatti wrote:
> On Thu, 26 Jul 2001, Daniel Phillips wrote:
> > There does seem to be a dangling thread here though - when a
> > process page is unmapped and added to swap cache in try_to_swap_out
> > then later faulted back in, I don't see where we "rescue" the page
> > from the inactive queue.  Maybe I'm just not looking hard enough.
>
> do_swap_page()->lookup_swap_cache()->__find_page_nolock()->SetPageRef
>erenced().
>
> The referenced bit will make page_launder/reclaim_page move the page
> to the active list.

Yes.  And there I set ->age to a known state (START_AGE), so everything 
seems to be in order.

--
Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-26  8:36   ` Eric W. Biederman
@ 2001-07-26 12:06     ` Daniel Phillips
  2001-07-26 10:38       ` Marcelo Tosatti
  0 siblings, 1 reply; 58+ messages in thread
From: Daniel Phillips @ 2001-07-26 12:06 UTC (permalink / raw)
  To: Eric W. Biederman, Rik van Riel
  Cc: Marcelo Tosatti, Andrew Morton, linux-kernel, Ben LaHaise,
	Mike Galbraith

On Thursday 26 July 2001 10:36, Eric W. Biederman wrote:
> Rik van Riel <riel@conectiva.com.br> writes:
> > On Wed, 25 Jul 2001, Daniel Phillips wrote:
> > > On Wednesday 25 July 2001 08:33, Marcelo Tosatti wrote:
> > > > Now I'm not sure why directly adding swapcache pages to the
> > > > inactive dirty lits with 0 zero age improves things.
> > >
> > > Because it moves the page rapidly down the inactive queue towards
> > > the ->writepage instead of leaving it floating around on the
> > > active ring waiting to be noticed.  We already know we want to
> > > evict that page,
> >
> > We don't.
>
> Agreed.  The kinds of ``aging'' don't match up so we can't tell if
> it meets our usual criteria for aging.

Well, in the absence of of benchmark evidence its hard to tell how 
valuable the usual criteria really are.  That's another topic though, 
because the question here is not how the aging matches up, but how the 
inactive queue handling matches up, see below.

> > The page gets unmapped and added to the swap cache the first
> > time it wasn't referenced by the process.
> >
> > This is before any page aging is done.
>
> Actually there has been aging done.  Unless you completely disable
> testing for pte_young.  It is a different kind of aging but it is
> aging.

In the case of a process page and in the case of a file IO page the 
situation is the same: we have decided to put the page on trial.  Once 
we have arrived at that decision its previous history doesn't matter,
so it makes sense to set its age to a known state.  In this case it's 
0, meaning "on trial".

Consider that the process page will only ever be down-aged after being 
unmapped.  So if we put it on the active queue, it just acts like a 
big, slow, inefficient timer.  Why not put it on a fifo queue instead?  
It's the same thing, just more efficient.  But we already have a fifo 
queue, the inactive_dirty_queue, which we are using for everything 
else, so why not use it for this too, then at least we know its fair.

In other words, there is no obvious need to treat a process page 
differently from a file IO page once decide to put it on trial.

There does seem to be a dangling thread here though - when a process 
page is unmapped and added to swap cache in try_to_swap_out then later 
faulted back in, I don't see where we "rescue" the page from the 
inactive queue.  Maybe I'm just not looking hard enough.

--
Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-26 12:06     ` Daniel Phillips
@ 2001-07-26 10:38       ` Marcelo Tosatti
  2001-07-26 12:17         ` Daniel Phillips
  0 siblings, 1 reply; 58+ messages in thread
From: Marcelo Tosatti @ 2001-07-26 10:38 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Eric W. Biederman, Rik van Riel, Andrew Morton, linux-kernel,
	Ben LaHaise, Mike Galbraith



On Thu, 26 Jul 2001, Daniel Phillips wrote:

> There does seem to be a dangling thread here though - when a process 
> page is unmapped and added to swap cache in try_to_swap_out then later 
> faulted back in, I don't see where we "rescue" the page from the 
> inactive queue.  Maybe I'm just not looking hard enough.

do_swap_page()->lookup_swap_cache()->__find_page_nolock()->SetPageReferenced().

The referenced bit will make page_launder/reclaim_page move the page to
the active list.




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-25 16:41 ` Rik van Riel
  2001-07-25 17:46   ` Daniel Phillips
@ 2001-07-26  8:36   ` Eric W. Biederman
  2001-07-26 12:06     ` Daniel Phillips
  1 sibling, 1 reply; 58+ messages in thread
From: Eric W. Biederman @ 2001-07-26  8:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Daniel Phillips, Marcelo Tosatti, Andrew Morton, linux-kernel,
	Ben LaHaise, Mike Galbraith

Rik van Riel <riel@conectiva.com.br> writes:

> On Wed, 25 Jul 2001, Daniel Phillips wrote:
> > On Wednesday 25 July 2001 08:33, Marcelo Tosatti wrote:
> > > Now I'm not sure why directly adding swapcache pages to the inactive
> > > dirty lits with 0 zero age improves things.
> >
> > Because it moves the page rapidly down the inactive queue towards the
> > ->writepage instead of leaving it floating around on the active ring
> > waiting to be noticed.  We already know we want to evict that page,
> 
> We don't.

Agreed.  The kinds of ``aging'' don't match up so we can't tell if
it meets our usual criteria for aging.
 
> The page gets unmapped and added to the swap cache the first
> time it wasn't referenced by the process.
> 
> This is before any page aging is done.

Actually there has been aging done.  Unless you completely disable
testing for pte_young.  It is a different kind of aging but it is
aging.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
@ 2001-07-26  3:27 Ed Tomlinson
  0 siblings, 0 replies; 58+ messages in thread
From: Ed Tomlinson @ 2001-07-26  3:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Daniel Phillips

Hi,

Here are some more figures.  

2.4.7 + ide patch + ide dma timeout fix + lvm beta8 + reiserfs fs

Throughput 19.3959 MB/sec (NB=24.2449 MB/sec  193.959 MBit/sec)
dbench 20  24.88s user 70.89s system 69% cpu 2:17.23 total

Throughput 19.0447 MB/sec (NB=23.8059 MB/sec  190.447 MBit/sec)
dbench 20  24.37s user 69.24s system 67% cpu 2:19.63 total

tob -f - -full misc > misc-0725.tob  235.80s user 321.73s system 42% cpu 21:52.49 total


2.4.7 + ide patch + ide dma timeout fix + lvm beta8 + reiserfs fs

Throughput 19.3395 MB/sec (NB=24.1744 MB/sec  193.395 MBit/sec)
dbench 20  23.70s user 70.68s system 68% cpu 2:17.52 total

Throughput 19.1872 MB/sec (NB=23.9841 MB/sec  191.872 MBit/sec)
dbench 20  24.30s user 69.42s system 67% cpu 2:18.60 total

tob -f - -full misc > /back/misc-0725.tob  229.64s user 326.26s system 43% cpu 21:11.77 total

Basicily no big changes here - which is good news.

Ed Tomlinson

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-25 16:41 ` Rik van Riel
@ 2001-07-25 17:46   ` Daniel Phillips
  2001-07-26  8:36   ` Eric W. Biederman
  1 sibling, 0 replies; 58+ messages in thread
From: Daniel Phillips @ 2001-07-25 17:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Marcelo Tosatti, Andrew Morton, linux-kernel, Ben LaHaise,
	Mike Galbraith

On Wednesday 25 July 2001 18:41, Rik van Riel wrote:
> On Wed, 25 Jul 2001, Daniel Phillips wrote:
> > On Wednesday 25 July 2001 08:33, Marcelo Tosatti wrote:
> > > Now I'm not sure why directly adding swapcache pages to the
> > > inactive dirty lits with 0 zero age improves things.
> >
> > Because it moves the page rapidly down the inactive queue towards
> > the ->writepage instead of leaving it floating around on the active
> > ring waiting to be noticed.  We already know we want to evict that
> > page,
>
> We don't.
>
> The page gets unmapped and added to the swap cache the first
> time it wasn't referenced by the process.
>
> This is before any page aging is done.

True, it's more accurate to say that we already know we want to *try* 
evicting that page.  A wrong guess should not make it all the way down 
the inactive queue.

--
Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
       [not found] <0107251802300B.00907@starship>
@ 2001-07-25 16:41 ` Rik van Riel
  2001-07-25 17:46   ` Daniel Phillips
  2001-07-26  8:36   ` Eric W. Biederman
  0 siblings, 2 replies; 58+ messages in thread
From: Rik van Riel @ 2001-07-25 16:41 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Marcelo Tosatti, Andrew Morton, linux-kernel, Ben LaHaise,
	Mike Galbraith

On Wed, 25 Jul 2001, Daniel Phillips wrote:
> On Wednesday 25 July 2001 08:33, Marcelo Tosatti wrote:
> > Now I'm not sure why directly adding swapcache pages to the inactive
> > dirty lits with 0 zero age improves things.
>
> Because it moves the page rapidly down the inactive queue towards the
> ->writepage instead of leaving it floating around on the active ring
> waiting to be noticed.  We already know we want to evict that page,

We don't.

The page gets unmapped and added to the swap cache the first
time it wasn't referenced by the process.

This is before any page aging is done.

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 18:15       ` Mike Castle
@ 2001-07-24 18:21         ` Rik van Riel
  0 siblings, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2001-07-24 18:21 UTC (permalink / raw)
  To: Mike Castle; +Cc: linux-kernel

On Tue, 24 Jul 2001, Mike Castle wrote:
> On Tue, Jul 24, 2001 at 02:51:03PM -0300, Rik van Riel wrote:
> > Let me give you an example:
>
> Of course, such a program is poorly written anyway.  That many
> individual system calls!

That's not the point I was trying to make ;))

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 17:51     ` Rik van Riel
  2001-07-24 18:09       ` Stuart MacDonald
@ 2001-07-24 18:15       ` Mike Castle
  2001-07-24 18:21         ` Rik van Riel
  1 sibling, 1 reply; 58+ messages in thread
From: Mike Castle @ 2001-07-24 18:15 UTC (permalink / raw)
  To: linux-kernel

On Tue, Jul 24, 2001 at 02:51:03PM -0300, Rik van Riel wrote:
> Let me give you an example:
> 
> - sequential access of a file
> - script reads the file in 80-byte segments
>   (parsing some arcane data structure)
> - these segments are accessed in rapid succession
> - each 80-byte segment is accessed ONCE
> 
> In this case, even though the data is accessed only
> once, each page is touched PAGE_SIZE/80 times, with
> one 80-byte read() each time.

Of course, such a program is poorly written anyway.  That many individual
system calls!  At least rewrite to use fread().  ;->

Change the example to using mmap(), and one feels less likely throttle the
programmer.

mrc
-- 
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 17:51     ` Rik van Riel
@ 2001-07-24 18:09       ` Stuart MacDonald
  2001-07-24 18:15       ` Mike Castle
  1 sibling, 0 replies; 58+ messages in thread
From: Stuart MacDonald @ 2001-07-24 18:09 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

From: "Rik van Riel" <riel@conectiva.com.br>
> Actually, the length of this interval could be even smaller
> and is often a point of furious debating.

Which is why I was avoiding flames earlier; I sorta
figured this might be a hot issue.

> Let me give you an example:
>
> - sequential access of a file
> - script reads the file in 80-byte segments
>   (parsing some arcane data structure)
> - these segments are accessed in rapid succession
> - each 80-byte segment is accessed ONCE
>
> In this case, even though the data is accessed only
> once, each page is touched PAGE_SIZE/80 times, with
> one 80-byte read() each time.

I'd figured that sort of scenario was the basis for counting
them all as one. What about

- random access of same file
- script reads one arcane 80 byte struct
- script updates that struct say PAGE_SIZE/80
times, with one 80-byte write() each time

If they're all counted as one, would the page age
correctly, if the script happens to take a few seconds
break between another flurry of all-as-one updating?

..Stu



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 17:44   ` Mike Castle
@ 2001-07-24 17:52     ` Rik van Riel
  0 siblings, 0 replies; 58+ messages in thread
From: Rik van Riel @ 2001-07-24 17:52 UTC (permalink / raw)
  To: Mike Castle; +Cc: linux-kernel

On Tue, 24 Jul 2001, Mike Castle wrote:
> On Tue, Jul 24, 2001 at 02:07:32PM -0300, Rik van Riel wrote:
> > > I hate to bother you directly, but I don't wish to start a flame
> > > war on lkml. How exactly would you explain two accesses as
> > > being "used once"?
> >
> > Because they occur in a very short interval, an interval MUCH
> > shorter than the time scale in which the VM subsystem looks at
> > referenced bits, etc...
>
> Would mmap() then a for(;;) loop over the range be an example of
> such a use?

Yes. The problem here, however, would be that we'll only find
the referenced bit in the page table some time later.

This is another reason for having a "new" queue like 2Q's A1in
queue. It means we can both count all accesses in the first short
period as one access AND we can avoid actually doing anything
special as these accesses happen....

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 17:42   ` Stuart MacDonald
@ 2001-07-24 17:51     ` Rik van Riel
  2001-07-24 18:09       ` Stuart MacDonald
  2001-07-24 18:15       ` Mike Castle
  0 siblings, 2 replies; 58+ messages in thread
From: Rik van Riel @ 2001-07-24 17:51 UTC (permalink / raw)
  To: Stuart MacDonald; +Cc: linux-kernel

On Tue, 24 Jul 2001, Stuart MacDonald wrote:
> From: "Rik van Riel" <riel@conectiva.com.br>
> > Because they occur in a very short interval, an interval MUCH
> > shorter than the time scale in which the VM subsystem looks at
> > referenced bits, etc...
>
> So what you're sayng is that any number of accesses, as long
> as they all occur within the interval < VM subsystem scanning
> interval, are all counted as one?

Actually, the length of this interval could be even smaller
and is often a point of furious debating.

The 2Q algorithm seems to have solved this problem by not
using an interval, but a FIFO queue of small, fixed length.

> > This seems to be generally accepted theory and practice in the
> > field of page replacement.
>
> Okay, just seems odd to me, but IANAVMGuru.

Seems odd at first glance, true.

Let me give you an example:

- sequential access of a file
- script reads the file in 80-byte segments
  (parsing some arcane data structure)
- these segments are accessed in rapid succession
- each 80-byte segment is accessed ONCE

In this case, even though the data is accessed only
once, each page is touched PAGE_SIZE/80 times, with
one 80-byte read() each time.

I hope this example gives you some holding point to
make it easier and grasp this - somewhat counter
intuitive - concept.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 17:07 ` Rik van Riel
  2001-07-24 17:42   ` Stuart MacDonald
@ 2001-07-24 17:44   ` Mike Castle
  2001-07-24 17:52     ` Rik van Riel
  1 sibling, 1 reply; 58+ messages in thread
From: Mike Castle @ 2001-07-24 17:44 UTC (permalink / raw)
  To: linux-kernel

On Tue, Jul 24, 2001 at 02:07:32PM -0300, Rik van Riel wrote:
> > I hate to bother you directly, but I don't wish to start a flame
> > war on lkml. How exactly would you explain two accesses as
> > being "used once"?
> 
> Because they occur in a very short interval, an interval MUCH
> shorter than the time scale in which the VM subsystem looks at
> referenced bits, etc...


Would mmap() then a for(;;) loop over the range be an example of such a
use?

mrc
-- 
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
  2001-07-24 17:07 ` Rik van Riel
@ 2001-07-24 17:42   ` Stuart MacDonald
  2001-07-24 17:51     ` Rik van Riel
  2001-07-24 17:44   ` Mike Castle
  1 sibling, 1 reply; 58+ messages in thread
From: Stuart MacDonald @ 2001-07-24 17:42 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

From: "Rik van Riel" <riel@conectiva.com.br>
> Because they occur in a very short interval, an interval MUCH
> shorter than the time scale in which the VM subsystem looks at
> referenced bits, etc...

So what you're sayng is that any number of accesses, as long
as they all occur within the interval < VM subsystem scanning
interval, are all counted as one?

> This seems to be generally accepted theory and practice in the
> field of page replacement.

Okay, just seems odd to me, but IANAVMGuru.

..Stu



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC] Optimization for use-once pages
       [not found] <010301c11463$1ee00440$294b82ce@connecttech.com>
@ 2001-07-24 17:07 ` Rik van Riel
  2001-07-24 17:42   ` Stuart MacDonald
  2001-07-24 17:44   ` Mike Castle
  0 siblings, 2 replies; 58+ messages in thread
From: Rik van Riel @ 2001-07-24 17:07 UTC (permalink / raw)
  To: Stuart MacDonald; +Cc: linux-kernel

[ CC: to linux-kernel added back since the topic isn't flammable ;) ]

On Tue, 24 Jul 2001, Stuart MacDonald wrote:
> From: "Rik van Riel" <riel@conectiva.com.br>
> > If a page gets 2 1-byte reads in a microsecond, with this
> > patch it would get promoted to the active list, even though
> > it's really only used once.
>
> I hate to bother you directly, but I don't wish to start a flame
> war on lkml. How exactly would you explain two accesses as
> being "used once"?

Because they occur in a very short interval, an interval MUCH
shorter than the time scale in which the VM subsystem looks at
referenced bits, etc...

Generally a CPU doesn't read more than one cache line at a time,
so I guess all "single references" are in reality multiple
accesses very shortly following each other.

This seems to be generally accepted theory and practice in the
field of page replacement.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2001-07-30 19:39 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-07-24  3:47 [RFC] Optimization for use-once pages Daniel Phillips
2001-07-24 12:38 ` jlnance
2001-07-24 16:56   ` Rik van Riel
2001-07-25  0:04     ` Daniel Phillips
2001-07-25  0:43     ` Anton Altaparmakov
2001-07-25  1:30       ` Daniel Phillips
2001-07-25 21:18         ` Steve Lord
2001-07-24 16:48 ` Linus Torvalds
2001-07-24 17:04   ` Rik van Riel
2001-07-24 18:14     ` Daniel Phillips
2001-07-24 18:15       ` Rik van Riel
2001-07-24 19:41         ` Is /dev/epoll scheduled for official inclusion anytime soon? David E. Weekly
2001-07-24 20:05           ` Rik van Riel
2001-07-24 20:26           ` linux partitioning in IA64 hiufungeric.tse
2001-07-25  9:29             ` Mike A. Harris
2001-07-24 20:24   ` [RFC] Optimization for use-once pages Patrick Dreker
2001-07-24 20:32     ` Rik van Riel
2001-07-24 22:16       ` Patrick Dreker
2001-07-24 22:25         ` Rik van Riel
2001-07-25  0:27           ` Linus Torvalds
2001-07-25  8:20             ` Patrick Dreker
2001-07-25 12:57               ` Martin Devera
2001-07-25 14:16               ` Daniel Phillips
2001-07-24 20:33     ` Linus Torvalds
2001-07-25  1:25       ` Daniel Phillips
2001-07-25  0:18     ` Daniel Phillips
2001-07-24 20:27 ` Marcelo Tosatti
2001-07-24 22:05   ` Rik van Riel
2001-07-24 20:53     ` Marcelo Tosatti
2001-07-24 22:27       ` Rik van Riel
2001-07-24 23:09       ` Daniel Phillips
2001-07-24 19:35         ` Rob Landley
2001-07-25  6:10           ` Marcelo Tosatti
2001-07-25  8:32           ` Eric W. Biederman
2001-07-25 12:53             ` Daniel Phillips
2001-07-25 16:05               ` Eric W. Biederman
2001-07-25 12:57           ` Daniel Phillips
2001-07-25  5:12       ` Andrew Morton
2001-07-25  6:33         ` Marcelo Tosatti
2001-07-30 18:09         ` Marcelo Tosatti
2001-07-24 22:41     ` Daniel Phillips
2001-07-24 22:22   ` Daniel Phillips
2001-07-25  2:31 ` Marcelo Tosatti
     [not found] <010301c11463$1ee00440$294b82ce@connecttech.com>
2001-07-24 17:07 ` Rik van Riel
2001-07-24 17:42   ` Stuart MacDonald
2001-07-24 17:51     ` Rik van Riel
2001-07-24 18:09       ` Stuart MacDonald
2001-07-24 18:15       ` Mike Castle
2001-07-24 18:21         ` Rik van Riel
2001-07-24 17:44   ` Mike Castle
2001-07-24 17:52     ` Rik van Riel
     [not found] <0107251802300B.00907@starship>
2001-07-25 16:41 ` Rik van Riel
2001-07-25 17:46   ` Daniel Phillips
2001-07-26  8:36   ` Eric W. Biederman
2001-07-26 12:06     ` Daniel Phillips
2001-07-26 10:38       ` Marcelo Tosatti
2001-07-26 12:17         ` Daniel Phillips
2001-07-26  3:27 Ed Tomlinson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).