All of lore.kernel.org
 help / color / mirror / Atom feed
* [2.4] heavy-load under swap space shortage
@ 2004-02-02 10:12 j-nomura
  2004-02-02 13:29 ` Hugh Dickins
  0 siblings, 1 reply; 42+ messages in thread
From: j-nomura @ 2004-02-02 10:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: j-nomura

Hello,

swap_out() seems always trying scan even if there are no swap space available.
This keeps CPU(s) busy with rare successful page-out and may cause lock
contention in big smp system.

  swap_out()
    ..
    try_to_swap_out()
      ..
      entry = get_swap_page()
      /* find no swap page available */

How about checking nr_swap_pages first and giving up if it's 0?
Applying the patch below extremely reduced systime consumption by swap_out
under swap space shortage.

Systems without swap also suffer from the same problem.

Any comments?

Best regards.
--
NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>

--- linux-2.4.24/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -326,8 +326,11 @@ out_unlock:
 static int swap_out(zone_t * classzone)
 {
 	int counter, nr_pages = SWAP_CLUSTER_MAX;
 	struct mm_struct *mm;
+
+	if (nr_swap_pages <= 0)
+		return 0;
 
 	counter = mmlist_nr << 1;
 	do {
		if (unlikely(current->need_resched)) {


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-02-02 10:12 [2.4] heavy-load under swap space shortage j-nomura
@ 2004-02-02 13:29 ` Hugh Dickins
  2004-02-03  7:53   ` j-nomura
  0 siblings, 1 reply; 42+ messages in thread
From: Hugh Dickins @ 2004-02-02 13:29 UTC (permalink / raw)
  To: j-nomura; +Cc: linux-kernel

On Mon, 2 Feb 2004 j-nomura@ce.jp.nec.com wrote:
> 
> swap_out() seems always trying scan even if there are no swap space available.
> This keeps CPU(s) busy with rare successful page-out and may cause lock
> contention in big smp system.
>... 
> How about checking nr_swap_pages first and giving up if it's 0?

Sorry, no.  Don't be misled by the name, swap_out() is used to free
all kinds of mapped pages, not just those which would end up on swap.
Your patch just disables freeing mapped pages under memory pressure.

You could try the untested patch below to swap_out_vma(), but I don't
really recommend it: it still skips freeing up a less common category
of clean pages, just when you'd most like to free them.

Hugh

--- 2.4.25-pre8/mm/vmscan.c	2004-01-30 13:41:14.000000000 +0000
+++ linux/mm/vmscan.c	2004-02-02 13:01:28.067918544 +0000
@@ -263,6 +263,14 @@
 	if (vma->vm_flags & VM_RESERVED)
 		return count;
 
+	/* If no swap, don't waste time on areas which need it */
+	if (nr_swap_pages <= 0) {
+		if (!vma->vm_ops ||
+		    !vma->vm_ops->nopage ||
+		    vma->vm_ops->nopage == shmem_nopage)
+			return count;
+	}
+
 	pgdir = pgd_offset(mm, address);
 
 	end = vma->vm_end;


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-02-02 13:29 ` Hugh Dickins
@ 2004-02-03  7:53   ` j-nomura
  2004-02-03 17:19     ` Hugh Dickins
  0 siblings, 1 reply; 42+ messages in thread
From: j-nomura @ 2004-02-03  7:53 UTC (permalink / raw)
  To: hugh, linux-kernel; +Cc: j-nomura

Thanks for for your comment.

> Your patch just disables freeing mapped pages under memory pressure.

Right.

> You could try the untested patch below to swap_out_vma(), but I don't
> really recommend it: it still skips freeing up a less common category
> of clean pages, just when you'd most like to free them.

Hmm, your patch to swap_out_vma didn't solve the problem.

The main cause of the heavy load seems hard contention on page_table_lock.
The CPUs are virtually seriarized for swap_out_mm with each unfruitful
scannings.

The contention could be avoided by changing the spinlock to trylock.
How about the patch below?
With this change, the scannings become more efficient because they can be
done in parallel.

I'm not sure in this case whether the test for nr_swap_pages is necessary.

Best regards.
--
NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>

--- linux-2.4.24/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -292,7 +292,11 @@ static inline int swap_out_mm(struct mm_
 	 * Find the proper vm-area after freezing the vma chain 
 	 * and ptes.
 	 */
-	spin_lock(&mm->page_table_lock);
+	if (nr_swap_pages <= 0) {
+		if (!spin_trylock(&mm->page_table_lock))
+			return count; /* avoid contention */
+	} else
+		spin_lock(&mm->page_table_lock);
 	address = mm->swap_address;
 	if (address == TASK_SIZE || swap_mm != mm) {
 		/* We raced: don't count this mm but try again */

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-02-03  7:53   ` j-nomura
@ 2004-02-03 17:19     ` Hugh Dickins
  2004-02-04 11:40       ` j-nomura
  0 siblings, 1 reply; 42+ messages in thread
From: Hugh Dickins @ 2004-02-03 17:19 UTC (permalink / raw)
  To: j-nomura; +Cc: linux-kernel

On Tue, 3 Feb 2004 j-nomura@ce.jp.nec.com wrote:
> 
> The main cause of the heavy load seems hard contention on page_table_lock.
> The CPUs are virtually seriarized for swap_out_mm with each unfruitful
> scannings.

Ah yes, now I remember observing the same, back in 2.4.13 days.
I haven't given it much thought between then and now.

> The contention could be avoided by changing the spinlock to trylock.
> How about the patch below?

It looks plausible and fooled me at first.  But since you return from
swap_out_mm without adjusting swap_address or *mmcounter, I think
you'll find that tasks encountering that contention just spin around
the swap_out loop, retrying the same mm, decrementing counter until
it's zero.  May save CPU, but won't do the freeing expected of it.
Or am I reading your patch wrongly?

> I'm not sure in this case whether the test for nr_swap_pages is necessary.

Unnecessary, nr_swap_pages is irrelevant, better spin_trylock in all cases.

Let me dig out my patch for 2.4.13, it appears to apply much the
same to 2.4.24 or 2.4.25-pre8, though untested recently.  I think I
got involved in something else, and never pushed it out back then.
Do you find it helpful?

It's a little more complicated than yours because although it's good
to let the tasks encountering page_table_lock contention go on to
try another mm, it would not be good to update swap_mm too readily:
the contentious mm is likely to be the one which most needs freeing.

Think carefully about swap_address here: I seem to recall that
it is racy, but benign - won't go wrong often enough to matter.

Hugh

--- 2.4.25-pre8/mm/vmscan.c	2004-01-30 13:41:14.000000000 +0000
+++ linux/mm/vmscan.c	2004-02-03 16:33:43.212770936 +0000
@@ -292,12 +292,11 @@
 	 * Find the proper vm-area after freezing the vma chain 
 	 * and ptes.
 	 */
-	spin_lock(&mm->page_table_lock);
 	address = mm->swap_address;
 	if (address == TASK_SIZE || swap_mm != mm) {
 		/* We raced: don't count this mm but try again */
 		++*mmcounter;
-		goto out_unlock;
+		goto out;
 	}
 	vma = find_vma(mm, address);
 	if (vma) {
@@ -310,15 +309,14 @@
 			if (!vma)
 				break;
 			if (!count)
-				goto out_unlock;
+				goto out;
 			address = vma->vm_start;
 		}
 	}
 	/* Indicate that we reached the end of address space */
 	mm->swap_address = TASK_SIZE;
 
-out_unlock:
-	spin_unlock(&mm->page_table_lock);
+out:
 	return count;
 }
 
@@ -344,13 +342,18 @@
 				goto empty;
 			swap_mm = mm;
 		}
+		while (!spin_trylock(&mm->page_table_lock)) {
+			mm = list_entry(mm->mmlist.next, struct mm_struct, mmlist);
+			if (mm == &init_mm)
+				mm = list_entry(mm->mmlist.next, struct mm_struct, mmlist);
+		}
 
 		/* Make sure the mm doesn't disappear when we drop the lock.. */
 		atomic_inc(&mm->mm_users);
 		spin_unlock(&mmlist_lock);
 
 		nr_pages = swap_out_mm(mm, nr_pages, &counter, classzone);
-
+		spin_unlock(&mm->page_table_lock);
 		mmput(mm);
 
 		if (!nr_pages)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-02-03 17:19     ` Hugh Dickins
@ 2004-02-04 11:40       ` j-nomura
  2004-02-05 18:42         ` Hugh Dickins
  0 siblings, 1 reply; 42+ messages in thread
From: j-nomura @ 2004-02-04 11:40 UTC (permalink / raw)
  To: hugh; +Cc: j-nomura, linux-kernel

> May save CPU, but won't do the freeing expected of it.
> Or am I reading your patch wrongly?

No, you're right.

> Let me dig out my patch for 2.4.13, it appears to apply much the
> same to 2.4.24 or 2.4.25-pre8, though untested recently.  I think I
> got involved in something else, and never pushed it out back then.
> Do you find it helpful?

With slight modification (please see the patch below), it's really helpful.
I hope you push it again to the mainline.

> It's a little more complicated than yours because although it's good
> to let the tasks encountering page_table_lock contention go on to
> try another mm, it would not be good to update swap_mm too readily:
> the contentious mm is likely to be the one which most needs freeing.

I agree.

> Think carefully about swap_address here: I seem to recall that
> it is racy, but benign - won't go wrong often enough to matter.

I had to remove raciness check in swap_out_mm, otherwise swap_out_mm
returns immediately and contend on mmlist_lock in mmput().
I think removal is ok because we now avoid the 'rush to same mm' by trylock.

I added the check for 'mm == swap_mm'. It might be necessary to avoid
the corner case where mmlist_lock being held too long.

Best regards.
--
NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>


--- linux-2.4.24/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -292,13 +292,7 @@ static inline int swap_out_mm(struct mm_
 	 * Find the proper vm-area after freezing the vma chain 
 	 * and ptes.
 	 */
-	spin_lock(&mm->page_table_lock);
 	address = mm->swap_address;
-	if (address == TASK_SIZE || swap_mm != mm) {
-		/* We raced: don't count this mm but try again */
-		++*mmcounter;
-		goto out_unlock;
-	}
 	vma = find_vma(mm, address);
 	if (vma) {
 		if (address < vma->vm_start)
@@ -310,15 +304,14 @@ static inline int swap_out_mm(struct mm_
 			if (!vma)
 				break;
 			if (!count)
-				goto out_unlock;
+				goto out;
 			address = vma->vm_start;
 		}
 	}
 	/* Indicate that we reached the end of address space */
 	mm->swap_address = TASK_SIZE;
 
-out_unlock:
-	spin_unlock(&mm->page_table_lock);
+out:
 	return count;
 }
 
@@ -345,12 +338,20 @@ static int swap_out(zone_t * classzone)
 			swap_mm = mm;
 		}
 
+		/* scan mmlist and lock the first available mm */
+		while (mm == &init_mm || !spin_trylock(&mm->page_table_lock)) {
+			mm = list_entry(mm->mmlist.next, struct mm_struct, mmlist);
+			if (mm == swap_mm)
+				goto empty;
+		}
+
 		/* Make sure the mm doesn't disappear when we drop the lock.. */
 		atomic_inc(&mm->mm_users);
 		spin_unlock(&mmlist_lock);
 
 		nr_pages = swap_out_mm(mm, nr_pages, &counter, classzone);
 
+		spin_unlock(&mm->page_table_lock);
 		mmput(mm);
 
 		if (!nr_pages)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-02-04 11:40       ` j-nomura
@ 2004-02-05 18:42         ` Hugh Dickins
  2004-02-06  9:03           ` j-nomura
  2004-03-10 10:57           ` j-nomura
  0 siblings, 2 replies; 42+ messages in thread
From: Hugh Dickins @ 2004-02-05 18:42 UTC (permalink / raw)
  To: j-nomura; +Cc: linux-kernel

On Wed, 4 Feb 2004 j-nomura@ce.jp.nec.com wrote:
> 
> With slight modification (please see the patch below), it's really helpful.
> I hope you push it again to the mainline.

Okay, glad to hear it, I'll try pushing to Marcelo in 2.4.26-pre.
Can you describe the benefit you see?

> I had to remove raciness check in swap_out_mm, otherwise swap_out_mm
> returns immediately and contend on mmlist_lock in mmput().

Ah yes, thanks a lot, leaving that swap_mm != mm check behind
made an utter nonsense of my patch.

> I think removal is ok because we now avoid the 'rush to same mm' by trylock.

I agree, swap_out_mm's TASK_SIZE check was for the previously all too
common case where one by one they spin on and visit the same old mm
too late, no need for it now.  But if we remove that block, the
mmcounter arg becomes pointless, so removed in version below.

> I added the check for 'mm == swap_mm'. It might be necessary to avoid
> the corner case where mmlist_lock being held too long.

Oh, good point.  But I'm uneasy about treating a trip round the mmlist
failing to get a lock as the same thing as finding no pages to free,
your "goto empty": drop lock and come around again instead, as below?

Hugh

--- 2.4.25-rc1/mm/vmscan.c	2004-02-05 17:37:28.755210944 +0000
+++ linux/mm/vmscan.c	2004-02-05 17:48:03.764674832 +0000
@@ -283,7 +283,7 @@
 /*
  * Returns remaining count of pages to be swapped out by followup call.
  */
-static inline int swap_out_mm(struct mm_struct * mm, int count, int * mmcounter, zone_t * classzone)
+static inline int swap_out_mm(struct mm_struct * mm, int count, zone_t * classzone)
 {
 	unsigned long address;
 	struct vm_area_struct* vma;
@@ -292,13 +292,7 @@
 	 * Find the proper vm-area after freezing the vma chain 
 	 * and ptes.
 	 */
-	spin_lock(&mm->page_table_lock);
 	address = mm->swap_address;
-	if (address == TASK_SIZE || swap_mm != mm) {
-		/* We raced: don't count this mm but try again */
-		++*mmcounter;
-		goto out_unlock;
-	}
 	vma = find_vma(mm, address);
 	if (vma) {
 		if (address < vma->vm_start)
@@ -310,15 +304,13 @@
 			if (!vma)
 				break;
 			if (!count)
-				goto out_unlock;
+				goto out;
 			address = vma->vm_start;
 		}
 	}
 	/* Indicate that we reached the end of address space */
 	mm->swap_address = TASK_SIZE;
-
-out_unlock:
-	spin_unlock(&mm->page_table_lock);
+out:
 	return count;
 }
 
@@ -330,6 +322,7 @@
 
 	counter = mmlist_nr << 1;
 	do {
+top:
 		if (unlikely(current->need_resched)) {
 			__set_current_state(TASK_RUNNING);
 			schedule();
@@ -345,12 +338,21 @@
 			swap_mm = mm;
 		}
 
+		/* Scan mmlist and lock the first available mm */
+		while (mm == &init_mm || !spin_trylock(&mm->page_table_lock)) {
+			mm = list_entry(mm->mmlist.next, struct mm_struct, mmlist);
+			if (mm == swap_mm) {
+				spin_unlock(&mmlist_lock);
+				goto top;
+			}
+		}
+
 		/* Make sure the mm doesn't disappear when we drop the lock.. */
 		atomic_inc(&mm->mm_users);
 		spin_unlock(&mmlist_lock);
 
-		nr_pages = swap_out_mm(mm, nr_pages, &counter, classzone);
-
+		nr_pages = swap_out_mm(mm, nr_pages, classzone);
+		spin_unlock(&mm->page_table_lock);
 		mmput(mm);
 
 		if (!nr_pages)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-02-05 18:42         ` Hugh Dickins
@ 2004-02-06  9:03           ` j-nomura
  2004-03-10 10:57           ` j-nomura
  1 sibling, 0 replies; 42+ messages in thread
From: j-nomura @ 2004-02-06  9:03 UTC (permalink / raw)
  To: hugh; +Cc: j-nomura, linux-kernel

> > With slight modification (please see the patch below), it's really helpful.
> > I hope you push it again to the mainline.
> 
> Okay, glad to hear it, I'll try pushing to Marcelo in 2.4.26-pre.

Thank you.

> Can you describe the benefit you see?

OK.
The benefit is simple.

Before applying your patch, the system became hardly responsive
under a certain situation (that is no free swap space, running page cache
intensive applications).
The system time went up 80-100% for long time (30 minutes to hours).

After applying your patch, under the same situation, the responsiveness
of the system does not get worse.
The system time goes up high for a few seconds, but it goes down soon.

> > I added the check for 'mm == swap_mm'. It might be necessary to avoid
> > the corner case where mmlist_lock being held too long.
> 
> Oh, good point.  But I'm uneasy about treating a trip round the mmlist
> failing to get a lock as the same thing as finding no pages to free,
> your "goto empty": drop lock and come around again instead, as below?

I feel your approach is better than mine to keep the current semantics.

Best regards.
--
NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-02-05 18:42         ` Hugh Dickins
  2004-02-06  9:03           ` j-nomura
@ 2004-03-10 10:57           ` j-nomura
  2004-03-14 19:47             ` Marcelo Tosatti
  2004-05-26 12:41             ` Marcelo Tosatti
  1 sibling, 2 replies; 42+ messages in thread
From: j-nomura @ 2004-03-10 10:57 UTC (permalink / raw)
  To: linux-kernel, marcelo.tosatti; +Cc: j-nomura

[-- Attachment #1: Type: Text/Plain, Size: 1550 bytes --]

After discussion with Hugh and recommendation from Andrea,
it turns out that Andrea's 05_vm_22_vm-anon-lru-3 in 2.4.23aa2 solves
the problem.
ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.23aa2/05_vm_22_vm-anon-lru-3

The patch adds a sysctl which accelerate the performance on
huge memory machine. It doesn't affect anything if turned off.

Marcelo, could you apply this to 2.4.26-pre?
(I attached the slightly modified patch in which the feature is turned
off by default and which is cleanly applied to bk tree.)


My test case was:
  - there is a process with large anonymous mapping
  - there are large amount of page caches and active I/O processes
  - there are not much of file mappings

So the problem happens in this way:
  - shrink_cache tries scanning inactive list in which most of pages
    are anonymous mapped
  - it soon fall into swap_out because of too many anonymous pages
  - when no free swap space, it hardly frees anything
  - it retries again but soon calls swap_out again and again

Without the patch, snapshot of readprofile looks like:
   3590781 total
   3289271 swap_out
    212029 smp_call_function
     22598 shrink_cache
     21833 lru_cache_add
      7787 get_user_pages

Most of the time was spent in swap_out. (contention on pagetable_lock)

After applying the patch, the snapshot is like:
    17420 total
     3929 copy_page
     3677 statm_pgd_range
     1317 try_to_free_buffers
     1312 __copy_user
      593 scsi_make_request

Best regards.
--
NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>

[-- Attachment #2: 05_vm_22_vm-anon-lru-3_2.4.25.diff --]
[-- Type: Text/Plain, Size: 3632 bytes --]

--- linux/include/linux/swap.h	2004/02/19 04:12:39	1.1.1.26
+++ linux/include/linux/swap.h	2004/03/10 10:09:11
@@ -116,7 +116,7 @@ extern void swap_setup(void);
 extern wait_queue_head_t kswapd_wait;
 extern int FASTCALL(try_to_free_pages_zone(zone_t *, unsigned int));
 extern int FASTCALL(try_to_free_pages(unsigned int));
-extern int vm_vfs_scan_ratio, vm_cache_scan_ratio, vm_lru_balance_ratio, vm_passes, vm_gfp_debug, vm_mapped_ratio;
+extern int vm_vfs_scan_ratio, vm_cache_scan_ratio, vm_lru_balance_ratio, vm_passes, vm_gfp_debug, vm_mapped_ratio, vm_anon_lru;
 
 /* linux/mm/page_io.c */
 extern void rw_swap_page(int, struct page *);
--- linux/include/linux/sysctl.h	2004/02/19 04:12:39	1.1.1.23
+++ linux/include/linux/sysctl.h	2004/03/10 10:09:11
@@ -156,6 +156,7 @@ enum
 	VM_MAPPED_RATIO=20,     /* amount of unfreeable pages that triggers swapout */
 	VM_LAPTOP_MODE=21,	/* kernel in laptop flush mode */
 	VM_BLOCK_DUMP=22,	/* dump fs activity to log */
+	VM_ANON_LRU=23,		/* immediatly insert anon pages in the vm page lru */
 };
 
 
--- linux/kernel/sysctl.c	2003/12/02 04:48:47	1.1.1.22
+++ linux/kernel/sysctl.c	2004/03/10 10:09:12
@@ -287,6 +287,8 @@ static ctl_table vm_table[] = {
 	 &vm_cache_scan_ratio, sizeof(int), 0644, NULL, &proc_dointvec},
 	{VM_MAPPED_RATIO, "vm_mapped_ratio", 
 	 &vm_mapped_ratio, sizeof(int), 0644, NULL, &proc_dointvec},
+	{VM_ANON_LRU, "vm_anon_lru", 
+	 &vm_anon_lru, sizeof(int), 0644, NULL, &proc_dointvec},
 	{VM_LRU_BALANCE_RATIO, "vm_lru_balance_ratio", 
 	 &vm_lru_balance_ratio, sizeof(int), 0644, NULL, &proc_dointvec},
 	{VM_PASSES, "vm_passes", 
--- linux/mm/memory.c	2003/12/02 04:48:47	1.1.1.31
+++ linux/mm/memory.c	2004/03/10 10:09:12
@@ -984,7 +984,8 @@ static int do_wp_page(struct mm_struct *
 		if (PageReserved(old_page))
 			++mm->rss;
 		break_cow(vma, new_page, address, page_table);
-		lru_cache_add(new_page);
+		if (vm_anon_lru)
+			lru_cache_add(new_page);
 
 		/* Free the old page.. */
 		new_page = old_page;
@@ -1215,7 +1216,8 @@ static int do_anonymous_page(struct mm_s
 		mm->rss++;
 		flush_page_to_ram(page);
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
-		lru_cache_add(page);
+		if (vm_anon_lru)
+			lru_cache_add(page);
 		mark_page_accessed(page);
 	}
 
@@ -1270,7 +1272,8 @@ static int do_no_page(struct mm_struct *
 		}
 		copy_user_highpage(page, new_page, address);
 		page_cache_release(new_page);
-		lru_cache_add(page);
+		if (vm_anon_lru)
+			lru_cache_add(page);
 		new_page = page;
 	}
 
--- linux/mm/vmscan.c	2004/02/19 04:12:33	1.1.1.32
+++ linux/mm/vmscan.c	2004/03/10 10:09:13
@@ -65,6 +65,27 @@ int vm_lru_balance_ratio = 2;
 int vm_vfs_scan_ratio = 6;
 
 /*
+ * "vm_anon_lru" select if to immdiatly insert anon pages in the
+ * lru. Immediatly means as soon as they're allocated during the
+ * page faults.
+ *
+ * If this is set to 0, they're inserted only after the first
+ * swapout.
+ *
+ * Having anon pages immediatly inserted in the lru allows the
+ * VM to know better when it's worthwhile to start swapping
+ * anonymous ram, it will start to swap earlier and it should
+ * swap smoother and faster, but it will decrease scalability
+ * on the >16-ways of an order of magnitude. Big SMP/NUMA
+ * definitely can't take an hit on a global spinlock at
+ * every anon page allocation. So this is off by default.
+ *
+ * Low ram machines that swaps all the time want to turn
+ * this on (i.e. set to 1).
+ */
+int vm_anon_lru = 1;
+
+/*
  * The swap-out function returns 1 if it successfully
  * scanned all the pages it was asked to (`count').
  * It returns zero if it couldn't do anything,

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-10 10:57           ` j-nomura
@ 2004-03-14 19:47             ` Marcelo Tosatti
  2004-03-14 19:54               ` Rik van Riel
  2004-03-14 20:15               ` Andrew Morton
  2004-05-26 12:41             ` Marcelo Tosatti
  1 sibling, 2 replies; 42+ messages in thread
From: Marcelo Tosatti @ 2004-03-14 19:47 UTC (permalink / raw)
  To: j-nomura; +Cc: linux-kernel, akpm, andrea, riel, torvalds


Hi kernel colleagues, 

At first I was skeptic to the inclusion of this patch in v2.4 (due to the
freeze), but after thinking a bit more about I have a few points in favour 
of this modification (read Nomura's message below and the patch to know 
what I'm talking about):

- It is off by default 
- It is very simple (non intrusive), it just changes the point in which 
anonymous pages are inserted into the LRU. 
- When turned on, I dont see it being a reason for introducing new 
bugs.

What you think of this? 

On Wed, 10 Mar 2004 j-nomura@ce.jp.nec.com wrote:

> After discussion with Hugh and recommendation from Andrea,
> it turns out that Andrea's 05_vm_22_vm-anon-lru-3 in 2.4.23aa2 solves
> the problem.
> ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.23aa2/05_vm_22_vm-anon-lru-3
> 
> The patch adds a sysctl which accelerate the performance on
> huge memory machine. It doesn't affect anything if turned off.
> 
> Marcelo, could you apply this to 2.4.26-pre?
> (I attached the slightly modified patch in which the feature is turned
> off by default and which is cleanly applied to bk tree.)
> 
> 
> My test case was:
>   - there is a process with large anonymous mapping
>   - there are large amount of page caches and active I/O processes
>   - there are not much of file mappings
> 
> So the problem happens in this way:
>   - shrink_cache tries scanning inactive list in which most of pages
>     are anonymous mapped
>   - it soon fall into swap_out because of too many anonymous pages
>   - when no free swap space, it hardly frees anything
>   - it retries again but soon calls swap_out again and again
> 
> Without the patch, snapshot of readprofile looks like:
>    3590781 total
>    3289271 swap_out
>     212029 smp_call_function
>      22598 shrink_cache
>      21833 lru_cache_add
>       7787 get_user_pages
> 
> Most of the time was spent in swap_out. (contention on pagetable_lock)
> 
> After applying the patch, the snapshot is like:
>     17420 total
>      3929 copy_page
>      3677 statm_pgd_range
>      1317 try_to_free_buffers
>      1312 __copy_user
>       593 scsi_make_request
> 
> Best regards.

--- linux/include/linux/swap.h	2004/02/19 04:12:39	1.1.1.26
+++ linux/include/linux/swap.h	2004/03/10 10:09:11
@@ -116,7 +116,7 @@ extern void swap_setup(void);
 extern wait_queue_head_t kswapd_wait;
 extern int FASTCALL(try_to_free_pages_zone(zone_t *, unsigned int));
 extern int FASTCALL(try_to_free_pages(unsigned int));
-extern int vm_vfs_scan_ratio, vm_cache_scan_ratio, vm_lru_balance_ratio, vm_passes, vm_gfp_debug, vm_mapped_ratio;
+extern int vm_vfs_scan_ratio, vm_cache_scan_ratio, vm_lru_balance_ratio, vm_passes, vm_gfp_debug, vm_mapped_ratio, vm_anon_lru;
 
 /* linux/mm/page_io.c */
 extern void rw_swap_page(int, struct page *);
--- linux/include/linux/sysctl.h	2004/02/19 04:12:39	1.1.1.23
+++ linux/include/linux/sysctl.h	2004/03/10 10:09:11
@@ -156,6 +156,7 @@ enum
 	VM_MAPPED_RATIO=20,     /* amount of unfreeable pages that triggers swapout */
 	VM_LAPTOP_MODE=21,	/* kernel in laptop flush mode */
 	VM_BLOCK_DUMP=22,	/* dump fs activity to log */
+	VM_ANON_LRU=23,		/* immediatly insert anon pages in the vm page lru */
 };
 
 
--- linux/kernel/sysctl.c	2003/12/02 04:48:47	1.1.1.22
+++ linux/kernel/sysctl.c	2004/03/10 10:09:12
@@ -287,6 +287,8 @@ static ctl_table vm_table[] = {
 	 &vm_cache_scan_ratio, sizeof(int), 0644, NULL, &proc_dointvec},
 	{VM_MAPPED_RATIO, "vm_mapped_ratio", 
 	 &vm_mapped_ratio, sizeof(int), 0644, NULL, &proc_dointvec},
+	{VM_ANON_LRU, "vm_anon_lru", 
+	 &vm_anon_lru, sizeof(int), 0644, NULL, &proc_dointvec},
 	{VM_LRU_BALANCE_RATIO, "vm_lru_balance_ratio", 
 	 &vm_lru_balance_ratio, sizeof(int), 0644, NULL, &proc_dointvec},
 	{VM_PASSES, "vm_passes", 
--- linux/mm/memory.c	2003/12/02 04:48:47	1.1.1.31
+++ linux/mm/memory.c	2004/03/10 10:09:12
@@ -984,7 +984,8 @@ static int do_wp_page(struct mm_struct *
 		if (PageReserved(old_page))
 			++mm->rss;
 		break_cow(vma, new_page, address, page_table);
-		lru_cache_add(new_page);
+		if (vm_anon_lru)
+			lru_cache_add(new_page);
 
 		/* Free the old page.. */
 		new_page = old_page;
@@ -1215,7 +1216,8 @@ static int do_anonymous_page(struct mm_s
 		mm->rss++;
 		flush_page_to_ram(page);
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
-		lru_cache_add(page);
+		if (vm_anon_lru)
+			lru_cache_add(page);
 		mark_page_accessed(page);
 	}
 
@@ -1270,7 +1272,8 @@ static int do_no_page(struct mm_struct *
 		}
 		copy_user_highpage(page, new_page, address);
 		page_cache_release(new_page);
-		lru_cache_add(page);
+		if (vm_anon_lru)
+			lru_cache_add(page);
 		new_page = page;
 	}
 
--- linux/mm/vmscan.c	2004/02/19 04:12:33	1.1.1.32
+++ linux/mm/vmscan.c	2004/03/10 10:09:13
@@ -65,6 +65,27 @@ int vm_lru_balance_ratio = 2;
 int vm_vfs_scan_ratio = 6;
 
 /*
+ * "vm_anon_lru" select if to immdiatly insert anon pages in the
+ * lru. Immediatly means as soon as they're allocated during the
+ * page faults.
+ *
+ * If this is set to 0, they're inserted only after the first
+ * swapout.
+ *
+ * Having anon pages immediatly inserted in the lru allows the
+ * VM to know better when it's worthwhile to start swapping
+ * anonymous ram, it will start to swap earlier and it should
+ * swap smoother and faster, but it will decrease scalability
+ * on the >16-ways of an order of magnitude. Big SMP/NUMA
+ * definitely can't take an hit on a global spinlock at
+ * every anon page allocation. So this is off by default.
+ *
+ * Low ram machines that swaps all the time want to turn
+ * this on (i.e. set to 1).
+ */
+int vm_anon_lru = 1;
+
+/*
  * The swap-out function returns 1 if it successfully
  * scanned all the pages it was asked to (`count').
  * It returns zero if it couldn't do anything,


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-14 19:47             ` Marcelo Tosatti
@ 2004-03-14 19:54               ` Rik van Riel
  2004-03-14 20:15               ` Andrew Morton
  1 sibling, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2004-03-14 19:54 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: j-nomura, linux-kernel, akpm, andrea, torvalds

On Sun, 14 Mar 2004, Marcelo Tosatti wrote:

> - It is off by default 
> - It is very simple (non intrusive), it just changes the point in which 
> anonymous pages are inserted into the LRU. 
> - When turned on, I dont see it being a reason for introducing new 
> bugs.
> 
> What you think of this? 

1) Yes, the patch is harmless enough.
2) As long as the default behaviour doesn't change from
   how things are done now, the patch shouldn't introduce
   any regressions, so it should be safe to apply.


-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-14 19:47             ` Marcelo Tosatti
  2004-03-14 19:54               ` Rik van Riel
@ 2004-03-14 20:15               ` Andrew Morton
       [not found]                 ` <20040314230138.GV30940@dualathlon.random>
  1 sibling, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2004-03-14 20:15 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: j-nomura, linux-kernel, andrea, riel, torvalds

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
>  At first I was skeptic to the inclusion of this patch in v2.4 (due to the
>  freeze), but after thinking a bit more about I have a few points in favour 
>  of this modification (read Nomura's message below and the patch to know 
>  what I'm talking about):
> 
>  - It is off by default 
>  - It is very simple (non intrusive), it just changes the point in which 
>  anonymous pages are inserted into the LRU. 
>  - When turned on, I dont see it being a reason for introducing new 
>  bugs.
> 
>  What you think of this? 

hm, I hadn't noticed that 2.4 was changed to _not_ add anon pages to the
LRU.  I'd always regarded that as a workaround for pagemap_lru_lock contention
on large SMP machines which would never get beyond the suse kernel.

Having a magic knob is a weak solution: the majority of people who are
affected by this problem won't know to turn it on.

I confess that I don't really understand the failure mode.  So we have
zillions of anon pages which are not on the LRU.  We call swap_out() and
scan all these pages, failing to find swapcache space for them.

Why does adding the pages to the LRU up-front solve the problem?

(And why cannot we lazily add these anon pages to the LRU in swap_out, and
avoid the need for the knob?)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
       [not found]                 ` <20040314230138.GV30940@dualathlon.random>
@ 2004-03-14 23:22                   ` Andrew Morton
  2004-03-15  0:14                     ` Andrea Arcangeli
                                       ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Andrew Morton @ 2004-03-14 23:22 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: marcelo.tosatti, j-nomura, linux-kernel, riel, torvalds

Andrea Arcangeli <andrea@suse.de> wrote:
>
> > 
> > Having a magic knob is a weak solution: the majority of people who are
> > affected by this problem won't know to turn it on.
> 
> that's why I turned it _on_ by default in my tree ;)

So maybe Marcelo should apply this patch, and also turn it on by default.

> There are workloads where adding anonymous pages to the lru is
> suboptimal for both the vm (cache shrinking) and the fast path too
> (lru_cache_add), not sure how 2.6 optimizes those bits, since with 2.6
> you're forced to add those pages to the lru somehow and that implies
> some form of locking.

Basically a bunch of tweeaks:

- Per-zone lru locks (which implicitly made them per-node)

- Adding/removing sixteen pages for one taking of the lock.

- Making the lock irq-safe (it had to be done for other reasons, but
  reduced contention by 30% on 4-way due to not having a CPU wander off to
  service an interrupt while holding a critical lock).

- In page reclaim, snip 32 pages off the lru completely and drop the
  lock while we go off and process them.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-14 23:22                   ` Andrew Morton
@ 2004-03-15  0:14                     ` Andrea Arcangeli
  2004-03-15  4:38                       ` Nick Piggin
  2004-03-16  6:31                     ` Marcelo Tosatti
  2004-11-22 15:01                     ` Lazily add anonymous pages to LRU on v2.4? was " Marcelo Tosatti
  2 siblings, 1 reply; 42+ messages in thread
From: Andrea Arcangeli @ 2004-03-15  0:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: marcelo.tosatti, j-nomura, linux-kernel, riel, torvalds

On Sun, Mar 14, 2004 at 03:22:53PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > > 
> > > Having a magic knob is a weak solution: the majority of people who are
> > > affected by this problem won't know to turn it on.
> > 
> > that's why I turned it _on_ by default in my tree ;)
> 
> So maybe Marcelo should apply this patch, and also turn it on by default.

yes, I would suggest so. If anybody can find any swap-regression on
small UP machines then reporting to us on l-k will be welcome. So far
nobody could notice any swap difference at swap regime AFIK, and the
improvement for the fast path is dramatic on the big smp boxes.

> > There are workloads where adding anonymous pages to the lru is
> > suboptimal for both the vm (cache shrinking) and the fast path too
> > (lru_cache_add), not sure how 2.6 optimizes those bits, since with 2.6
> > you're forced to add those pages to the lru somehow and that implies
> > some form of locking.
> 
> Basically a bunch of tweeaks:
> 
> - Per-zone lru locks (which implicitly made them per-node)

the 16-ways weren't numa, and these days 16-ways HT (8-ways phys) are
not so uncommon anymore.

> 
> - Adding/removing sixteen pages for one taking of the lock.
> 
> - Making the lock irq-safe (it had to be done for other reasons, but
>   reduced contention by 30% on 4-way due to not having a CPU wander off to
>   service an interrupt while holding a critical lock).
> 
> - In page reclaim, snip 32 pages off the lru completely and drop the
>   lock while we go off and process them.

sounds good, thanks.

I don't see other ways to optimize it (and I never enjoyed too much the
per-zone lru since it has some downside too with a worst case on 2G
systems). peraphs a further optimization could be a transient per-cpu
lru refiled only by the page reclaim (so absolutely lazy while lots of
ram is free), but maybe that's already what you're doing when you say
"Adding/removing sixteen pages for one taking of the lock". Though the
fact you say "sixteen pages" sounds like it's not as lazy as it could
be.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15  0:14                     ` Andrea Arcangeli
@ 2004-03-15  4:38                       ` Nick Piggin
  2004-03-15 11:49                         ` Andrea Arcangeli
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2004-03-15  4:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, marcelo.tosatti, j-nomura, linux-kernel, riel, torvalds



Andrea Arcangeli wrote:

>
>I don't see other ways to optimize it (and I never enjoyed too much the
>per-zone lru since it has some downside too with a worst case on 2G
>systems). peraphs a further optimization could be a transient per-cpu
>lru refiled only by the page reclaim (so absolutely lazy while lots of
>ram is free), but maybe that's already what you're doing when you say
>"Adding/removing sixteen pages for one taking of the lock". Though the
>fact you say "sixteen pages" sounds like it's not as lazy as it could
>be.
>

Hi Andrea,
What are the downsides on a 2G system?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15  4:38                       ` Nick Piggin
@ 2004-03-15 11:49                         ` Andrea Arcangeli
  2004-03-15 13:23                           ` Rik van Riel
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Arcangeli @ 2004-03-15 11:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, marcelo.tosatti, j-nomura, linux-kernel, riel, torvalds

Hi Nick,

On Mon, Mar 15, 2004 at 03:38:51PM +1100, Nick Piggin wrote:
> 
> 
> Andrea Arcangeli wrote:
> 
> >
> >I don't see other ways to optimize it (and I never enjoyed too much the
> >per-zone lru since it has some downside too with a worst case on 2G
> >systems). peraphs a further optimization could be a transient per-cpu
> >lru refiled only by the page reclaim (so absolutely lazy while lots of
> >ram is free), but maybe that's already what you're doing when you say
> >"Adding/removing sixteen pages for one taking of the lock". Though the
> >fact you say "sixteen pages" sounds like it's not as lazy as it could
> >be.
> >
> 
> Hi Andrea,
> What are the downsides on a 2G system?

it is the absolutely worst case since both lru could be of around the same
size (800M zone-normal-lru and 1.2G zone-highmem-lru), maximizing the
loss of "age" information needed for optimal reclaim decisions.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 11:49                         ` Andrea Arcangeli
@ 2004-03-15 13:23                           ` Rik van Riel
  2004-03-15 14:37                             ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2004-03-15 13:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Andrew Morton, marcelo.tosatti, j-nomura,
	linux-kernel, torvalds

On Mon, 15 Mar 2004, Andrea Arcangeli wrote:

> it is the absolutely worst case since both lru could be of around the same
> size (800M zone-normal-lru and 1.2G zone-highmem-lru), maximizing the
> loss of "age" information needed for optimal reclaim decisions.

You only lose age information if you don't put equal aging
pressure on both zones.  If you make sure the allocation and
pageout pressure are more or less in line with the zone sizes,
why would you lose any aging information ?

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 13:23                           ` Rik van Riel
@ 2004-03-15 14:37                             ` Nick Piggin
  2004-03-15 14:50                               ` Andrea Arcangeli
  0 siblings, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2004-03-15 14:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Andrew Morton, marcelo.tosatti, j-nomura,
	linux-kernel, torvalds



Rik van Riel wrote:

>On Mon, 15 Mar 2004, Andrea Arcangeli wrote:
>
>
>>it is the absolutely worst case since both lru could be of around the same
>>size (800M zone-normal-lru and 1.2G zone-highmem-lru), maximizing the
>>loss of "age" information needed for optimal reclaim decisions.
>>
>
>You only lose age information if you don't put equal aging
>pressure on both zones.  If you make sure the allocation and
>pageout pressure are more or less in line with the zone sizes,
>why would you lose any aging information ?
>
>

I can't see that you would, no. But maybe I've missed something.
We apply pressure equally except when there is a shortage in a
low memory zone, in which case we can scan only the required
zone(s).

This case I think is well worth the unfairness it causes, because it
means your zone's pages can be freed quickly and without freeing pages
from other zones.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 14:37                             ` Nick Piggin
@ 2004-03-15 14:50                               ` Andrea Arcangeli
  2004-03-15 18:35                                 ` Andrew Morton
  2004-03-15 22:05                                 ` Nick Piggin
  0 siblings, 2 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2004-03-15 14:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rik van Riel, Andrew Morton, marcelo.tosatti, j-nomura,
	linux-kernel, torvalds

On Tue, Mar 16, 2004 at 01:37:04AM +1100, Nick Piggin wrote:
> This case I think is well worth the unfairness it causes, because it
> means your zone's pages can be freed quickly and without freeing pages
> from other zones.

freeing pages from other zones is perfectly fine, the classzone design
gets it right, you have to free memory from the other zones too or you
have no way to work on a 1G machine. you call the thing "unfair" when it
has nothing to do with fariness, your unfariness is the slowdown I
pointed out, it's all about being able to maintain a more reliable cache
information from the point of view of the pagecache users (the pagecache
users cares at the _classzone_, they can't care about the zones
themself), it has nothing to do with fairness.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 14:50                               ` Andrea Arcangeli
@ 2004-03-15 18:35                                 ` Andrew Morton
  2004-03-15 18:51                                   ` Andrea Arcangeli
  2004-03-15 22:05                                 ` Nick Piggin
  1 sibling, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2004-03-15 18:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: piggin, riel, marcelo.tosatti, j-nomura, linux-kernel, torvalds

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Tue, Mar 16, 2004 at 01:37:04AM +1100, Nick Piggin wrote:
> > This case I think is well worth the unfairness it causes, because it
> > means your zone's pages can be freed quickly and without freeing pages
> > from other zones.
> 
> freeing pages from other zones is perfectly fine, the classzone design
> gets it right, you have to free memory from the other zones too or you
> have no way to work on a 1G machine. you call the thing "unfair" when it
> has nothing to do with fariness, your unfariness is the slowdown I
> pointed out,

This "slowdown" is purely theoretical and has never been demonstrated.

One could just as easily point at the fact that on a 32GB machine with a
single LRU we have to send 64 highmem pages to the wrong end of the LRU for
each scanned lowmem page, thus utterly destroying any concept of it being
an LRU in the first place.  But this is also theoretical, and has never
been demonstrated and is thus uninteresting.

What _is_ interesting is the way in which the single LRU collapses when
there are a huge number amount of highmem pages on the tail and then there
is a surge in lowmem demand.  This was demonstrated, and is what prompted
the per-zone LRU.




Begin forwarded message:

Date: Sun, 04 Aug 2002 01:35:22 -0700
From: Andrew Morton <akpm@zip.com.au>
To: "linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: how not to write a search algorithm


Worked out why my box is going into a 3-5 minute coma with one test.
Think what the LRUs look like when the test first hits page reclaim
on this 2.5G ia32 box:

               head                           tail
active_list:   <800M of ZONE_NORMAL> <200M of ZONE_HIGHMEM>
inactive_list:          <1.5G of ZONE_HIGHMEM>

now, somebody does a GFP_KERNEL allocation.

uh-oh.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans 5000 pages, achieving nothing.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans about 10000 pages, achieving nothing.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans about 20000 pages, achieving nothing.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans about 40000 pages, achieving nothing.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans about 80000 pages, achieving nothing.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans about 160000 pages, achieving nothing.

VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
the inactive list.  It then scans about 320000 pages, achieving nothing.

The page allocation fails.  So __alloc_pages tries it all again.


This all gets rather boring.


Per-zone LRUs will fix it up.  We need that anyway, because a ZONE_NORMAL
request will bogusly refile, on average, memory_size/800M pages to the
head of the inactive list, thus wrecking page aging.

Alan's kernel has a nice-looking implementation.  I'll lift that out
next week unless someone beats me to it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 18:35                                 ` Andrew Morton
@ 2004-03-15 18:51                                   ` Andrea Arcangeli
  2004-03-15 19:02                                     ` Andrew Morton
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Arcangeli @ 2004-03-15 18:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: piggin, riel, marcelo.tosatti, j-nomura, linux-kernel, torvalds

On Mon, Mar 15, 2004 at 10:35:10AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > On Tue, Mar 16, 2004 at 01:37:04AM +1100, Nick Piggin wrote:
> > > This case I think is well worth the unfairness it causes, because it
> > > means your zone's pages can be freed quickly and without freeing pages
> > > from other zones.
> > 
> > freeing pages from other zones is perfectly fine, the classzone design
> > gets it right, you have to free memory from the other zones too or you
> > have no way to work on a 1G machine. you call the thing "unfair" when it
> > has nothing to do with fariness, your unfariness is the slowdown I
> > pointed out,
> 
> This "slowdown" is purely theoretical and has never been demonstrated.

on a 32G box the slowdown is zero, as it's zero on a 1G box too, you
definitely need a 2G box to measure it.

The effect is that you can do stuff like 'cvs up' and you will end up
caching just 1G instead of 2G. Or do I miss something? If I would own a
2G box I would hate to be able to cache just 1 G (yeah, the cache is 2G
but half of that cache is pinned and it sits there with years old data,
so effectively you lose 50% of the ram in the box in terms of cache
utilization).

> One could just as easily point at the fact that on a 32GB machine with a
> single LRU we have to send 64 highmem pages to the wrong end of the LRU for
> each scanned lowmem page, thus utterly destroying any concept of it being
> an LRU in the first place.  But this is also theoretical, and has never
> been demonstrated and is thus uninteresting.

the lowmem zone on a 32G box is completely reserved for zone-normal
allocation, and dcache shrinks aren't too frequent in some workload, but
you're certainly right that on a 32G box per-zone lru is optimal in
terms of cpu utilization (on 64bit either ways doesn't make any
difference, the GFP_DMA allocations are so seldom that throwing a bit of
cpu at those seldom allocation is fine).

> 
> Worked out why my box is going into a 3-5 minute coma with one test.
> Think what the LRUs look like when the test first hits page reclaim
> on this 2.5G ia32 box:
> 
>                head                           tail
> active_list:   <800M of ZONE_NORMAL> <200M of ZONE_HIGHMEM>
> inactive_list:          <1.5G of ZONE_HIGHMEM>
> 
> now, somebody does a GFP_KERNEL allocation.
> 
> uh-oh.
> 
> VM calls refill_inactive.  That moves 25 ZONE_HIGHMEM pages onto
> the inactive list.  It then scans 5000 pages, achieving nothing.

I fixed this in my tree a long time ago, you certainly don't need
per-zone lru to fix this (though for a 32G box the per-zone lru doesn't
only fix it, it also save lots of cpu too compared to the global lru).
See the refill_inactive code in my tree:

static void refill_inactive(int nr_pages, zone_t * classzone)
{
	struct list_head * entry;
	unsigned long ratio;

	ratio = (unsigned long) nr_pages * classzone->nr_active_pages /
(((unsigned long) classzone->nr_inactive_pages * vm_lru_balance_ratio) +
1);

	entry = active_list.prev;
	while (ratio && entry != &active_list) {
		struct page * page;
		int related_metadata = 0;

		page = list_entry(entry, struct page, lru);
		entry = entry->prev;

		if (!memclass(page_zone(page), classzone)) {
			/*
			 * Hack to address an issue found by Rik. The
			 * problem is that
			 * highmem pages can hold buffer headers
			 * allocated
			 * from the slab on lowmem, and so if we are
			 * working
			 * on the NORMAL classzone here, it is correct
			 * not to
			 * try to free the highmem pages themself (that
			 * would be useless)
			 * but we must make sure to drop any lowmem
			 * metadata related to those
			 * highmem pages.
			 */
			if (page->buffers && page->mapping) { /* fast path racy check */
				if (unlikely(TryLockPage(page)))
					continue;
				if (page->buffers && page->mapping && memclass_related_bhs(page, classzone)) /* non racy check */
					related_metadata = 1;
				UnlockPage(page);
			}
			if (!related_metadata)
				continue;
		}

		if (PageTestandClearReferenced(page)) {
			list_del(&page->lru);
			list_add(&page->lru, &active_list);
			continue;
		}

		if (!related_metadata)
			ratio--;

		del_page_from_active_list(page);
		add_page_to_inactive_list(page);
		SetPageReferenced(page);
	}
	if (entry != &active_list) {
		list_del(&active_list);
		list_add(&active_list, entry);
	}
}


the memclass checks guarantees that we make progress. the old vm code
(that you inherit in 2.5) missed those bits I believe.

without those fixes the 2.4 vm wouldn't perform on 32G (as you also
found during 2.5).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 18:51                                   ` Andrea Arcangeli
@ 2004-03-15 19:02                                     ` Andrew Morton
  2004-03-15 21:55                                       ` Andrea Arcangeli
  0 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2004-03-15 19:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: piggin, riel, marcelo.tosatti, j-nomura, linux-kernel, torvalds

Andrea Arcangeli <andrea@suse.de> wrote:
>
> The effect is that you can do stuff like 'cvs up' and you will end up
>  caching just 1G instead of 2G. Or do I miss something? If I would own a
>  2G box I would hate to be able to cache just 1 G (yeah, the cache is 2G
>  but half of that cache is pinned and it sits there with years old data,
>  so effectively you lose 50% of the ram in the box in terms of cache
>  utilization).

Nope, we fill all zones with pagecache and once they've all reached
pages_low we scan all zones in proportion to their size.  So the
probability of a page being scanned is independent of its zone.

It took a bit of diddling, but it seems to work OK now.  Here are the
relevant bits of /proc/vmstat from a 1G machine, running 2.6.4-rc1-mm1 with
13 days uptime:

pgalloc_high 65658111
pgalloc_normal 384294820
pgalloc_dma 617780

pgrefill_high 5980273
pgrefill_normal 11873490
pgrefill_dma 69861

pgsteal_high 2377905
pgsteal_normal 10504356
pgsteal_dma 4756

pgscan_kswapd_high 3621882
pgscan_kswapd_normal 15652593
pgscan_kswapd_dma 99

pgscan_direct_high 54120
pgscan_direct_normal 162353
pgscan_direct_dma 69377

These are approximately balanced wrt the zone sizes, with a bias towards
ZONE_NORMAL because of non-highmem allocations.  It's not perfect, but we
did fix a few things up after 2.6.4-rc1-mm1.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 19:02                                     ` Andrew Morton
@ 2004-03-15 21:55                                       ` Andrea Arcangeli
  0 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2004-03-15 21:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: piggin, riel, marcelo.tosatti, j-nomura, linux-kernel, torvalds

On Mon, Mar 15, 2004 at 11:02:40AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > The effect is that you can do stuff like 'cvs up' and you will end up
> >  caching just 1G instead of 2G. Or do I miss something? If I would own a
> >  2G box I would hate to be able to cache just 1 G (yeah, the cache is 2G
> >  but half of that cache is pinned and it sits there with years old data,
> >  so effectively you lose 50% of the ram in the box in terms of cache
> >  utilization).
> 
> Nope, we fill all zones with pagecache and once they've all reached
> pages_low we scan all zones in proportion to their size.  So the
> probability of a page being scanned is independent of its zone.
> 
> It took a bit of diddling, but it seems to work OK now.  Here are the
> relevant bits of /proc/vmstat from a 1G machine, running 2.6.4-rc1-mm1 with
> 13 days uptime:
> 
> pgalloc_high 65658111
> pgalloc_normal 384294820
> pgalloc_dma 617780
> 
> pgrefill_high 5980273
> pgrefill_normal 11873490
> pgrefill_dma 69861
> 
> pgsteal_high 2377905
> pgsteal_normal 10504356
> pgsteal_dma 4756
> 
> pgscan_kswapd_high 3621882
> pgscan_kswapd_normal 15652593
> pgscan_kswapd_dma 99
> 
> pgscan_direct_high 54120
> pgscan_direct_normal 162353
> pgscan_direct_dma 69377
> 
> These are approximately balanced wrt the zone sizes, with a bias towards
> ZONE_NORMAL because of non-highmem allocations.  It's not perfect, but we
> did fix a few things up after 2.6.4-rc1-mm1.

as far as you don't always start from the highmem zone (so you need a
per-classzone variable to keep track of the last zone scanned and to
start shrinking from zone-normal and zone-dma if needed), the above
should avoid the problem I mentioned for the 2G setup.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 14:50                               ` Andrea Arcangeli
  2004-03-15 18:35                                 ` Andrew Morton
@ 2004-03-15 22:05                                 ` Nick Piggin
  2004-03-15 22:24                                   ` Andrea Arcangeli
  2004-03-16  7:25                                   ` Marcelo Tosatti
  1 sibling, 2 replies; 42+ messages in thread
From: Nick Piggin @ 2004-03-15 22:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Andrew Morton, marcelo.tosatti, j-nomura,
	linux-kernel, torvalds



Andrea Arcangeli wrote:

>On Tue, Mar 16, 2004 at 01:37:04AM +1100, Nick Piggin wrote:
>
>>This case I think is well worth the unfairness it causes, because it
>>means your zone's pages can be freed quickly and without freeing pages
>>from other zones.
>>
>
>freeing pages from other zones is perfectly fine, the classzone design
>gets it right, you have to free memory from the other zones too or you
>have no way to work on a 1G machine. you call the thing "unfair" when it
>has nothing to do with fariness, your unfariness is the slowdown I
>pointed out, it's all about being able to maintain a more reliable cache
>information from the point of view of the pagecache users (the pagecache
>users cares at the _classzone_, they can't care about the zones
>themself), it has nothing to do with fairness.
>
>

What I meant by unfairness is that low zone scanning in response
to low zone pressure will not put any pressure on higher zones.
Thus pages in higher zones have an advantage.

We do scan lowmem in response to highmem pressure.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 22:05                                 ` Nick Piggin
@ 2004-03-15 22:24                                   ` Andrea Arcangeli
  2004-03-15 22:41                                     ` Nick Piggin
  2004-03-15 22:41                                     ` Rik van Riel
  2004-03-16  7:25                                   ` Marcelo Tosatti
  1 sibling, 2 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2004-03-15 22:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rik van Riel, Andrew Morton, marcelo.tosatti, j-nomura,
	linux-kernel, torvalds

On Tue, Mar 16, 2004 at 09:05:32AM +1100, Nick Piggin wrote:
> 
> What I meant by unfairness is that low zone scanning in response
> to low zone pressure will not put any pressure on higher zones.
> Thus pages in higher zones have an advantage.

Ok I see what you mean now, in this sense the unfariness is the same
with the global lru too.

> We do scan lowmem in response to highmem pressure.

As I told Andrew, you've also to make sure not to start always from the
highmemzone, and from the code this seems not the case, so my 2G
scenario still applies.

Obviously I expected that you would can the lowmem zones too, otherwise
you couldn't allocate in cache more than 100M or so in a 1G box.

shrink_caches(struct zone **zones, int priority, int *total_scanned,
		int gfp_mask, int nr_pages, struct page_state *ps)
{
	int ret = 0;
	int i;

	for (i = 0; zones[i] != NULL; i++) {
		int to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX);
		struct zone *zone = zones[i];
		int nr_mapped = 0;
		int max_scan;

you seem to start always from zones[0] (that zone ** thing is the
zonelist so it starts with highmem, the normal then dma, depending on
the classzone you're shrinking). That will generate the waste of cache
in a 2G box that I described.

I'm reading 2.4.4 mainline here.

to really fix it, you need a global information keeping track of the
last zone shrinked to keep going in round robin.

Either that or you can choose to do some overwork and to shrink from all
the zones removing this break:

		if (ret >= nr_pages)
			break;

but as far as I can tell, the 50% waste of cache in a 2G box can happen
in 2.6.4 and it won't happen in 2.4.x.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 22:24                                   ` Andrea Arcangeli
@ 2004-03-15 22:41                                     ` Nick Piggin
  2004-03-15 22:44                                       ` Andrea Arcangeli
  2004-03-15 22:41                                     ` Rik van Riel
  1 sibling, 1 reply; 42+ messages in thread
From: Nick Piggin @ 2004-03-15 22:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Andrew Morton, marcelo.tosatti, j-nomura,
	linux-kernel, torvalds



Andrea Arcangeli wrote:

>
>Either that or you can choose to do some overwork and to shrink from all
>the zones removing this break:
>
>		if (ret >= nr_pages)
>			break;
>
>but as far as I can tell, the 50% waste of cache in a 2G box can happen
>in 2.6.4 and it won't happen in 2.4.x.
>
>

Yeah you are right. Some patches have since gone into 2.6-bk and
this is one of the things fixed up.

Nick


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 22:24                                   ` Andrea Arcangeli
  2004-03-15 22:41                                     ` Nick Piggin
@ 2004-03-15 22:41                                     ` Rik van Riel
  2004-03-15 23:32                                       ` Andrea Arcangeli
  1 sibling, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2004-03-15 22:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Andrew Morton, marcelo.tosatti, j-nomura,
	linux-kernel, torvalds

On Mon, 15 Mar 2004, Andrea Arcangeli wrote:

> As I told Andrew, you've also to make sure not to start always from the
> highmemzone, and from the code this seems not the case, so my 2G
> scenario still applies.

Agreed, the scenario applies.  However, I don't see how a
global LRU would fix it in eg. the case of an AMD64 NUMA
system...

And once we fix it right for those NUMA systems, we can
use the same code to take care of balancing between zones
on normal PCs, giving us the scalability benefits of the
per-zone lists and locks.

> 	for (i = 0; zones[i] != NULL; i++) {
> 		int to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX);

> Either that or you can choose to do some overwork and to shrink from all
> the zones removing this break:
> 
> 		if (ret >= nr_pages)
> 			break;

That's probably the nicest solution.  Though you will want
to cap it at a certain high water mark (2 * pages_high?) so
you don't end up freeing all of highmem on a burst of lowmem
pressure.

> but as far as I can tell, the 50% waste of cache in a 2G box can happen
> in 2.6.4 and it won't happen in 2.4.x.

How about AMD64 NUMA systems ?
What evens out the LRU pressure there in 2.4 ?

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 22:41                                     ` Nick Piggin
@ 2004-03-15 22:44                                       ` Andrea Arcangeli
  0 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2004-03-15 22:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rik van Riel, Andrew Morton, marcelo.tosatti, j-nomura,
	linux-kernel, torvalds

On Tue, Mar 16, 2004 at 09:41:24AM +1100, Nick Piggin wrote:
> 
> 
> Andrea Arcangeli wrote:
> 
> >
> >Either that or you can choose to do some overwork and to shrink from all
> >the zones removing this break:
> >
> >		if (ret >= nr_pages)
> >			break;
> >
> >but as far as I can tell, the 50% waste of cache in a 2G box can happen
> >in 2.6.4 and it won't happen in 2.4.x.
> >
> >
> 
> Yeah you are right. Some patches have since gone into 2.6-bk and
> this is one of the things fixed up.

sounds great, thanks!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 22:41                                     ` Rik van Riel
@ 2004-03-15 23:32                                       ` Andrea Arcangeli
  2004-03-16  6:27                                         ` Nick Piggin
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Arcangeli @ 2004-03-15 23:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Nick Piggin, Andrew Morton, marcelo.tosatti, j-nomura,
	linux-kernel, torvalds

On Mon, Mar 15, 2004 at 05:41:54PM -0500, Rik van Riel wrote:
> On Mon, 15 Mar 2004, Andrea Arcangeli wrote:
> 
> > As I told Andrew, you've also to make sure not to start always from the
> > highmemzone, and from the code this seems not the case, so my 2G
> > scenario still applies.
> 
> Agreed, the scenario applies.  However, I don't see how a
> global LRU would fix it in eg. the case of an AMD64 NUMA
> system...

I think I mentioned the per-node lru would be enough for numa, I'm only
talking here about per-zone lru, per-node numa needs are another matter.
For 64bit per-node or per-zone is basically the same in practice.

however after 2.6.4 will be fixed even the per-zone should not generate
loss of caching info, so with that part fixed I'm not against per-zone
even if it's more difficult to be fair.

> And once we fix it right for those NUMA systems, we can
> use the same code to take care of balancing between zones
> on normal PCs, giving us the scalability benefits of the
> per-zone lists and locks.

I argue those scalability benefits of the locks, on a 32G machine or on
a 1G machine those locks benefits are near zero. The only significant
benefit is in terms of computational complexity of the normal-zone
allocations, where we'll only walk on the zone-normal and zone-dma
pages.

> How about AMD64 NUMA systems ?
> What evens out the LRU pressure there in 2.4 ?

by the time you say 64bit you can forget the per-zone per-node
differences.  sure there will be still a difference but it's cosmetical
so I don't care about those per-zone lru issues for 64bit hardware,
infact on 64bit hardware per-zone (even if totally unfair) is the most
optimal just in case somebody asks for ZONE_DMA more than once per day.
But the difference is so small in practice that even global would be ok.

the per-node on numa (not necessairly on amd64, infact in amd64 the
penality is so small that I doubt things like that will payoff big)
still remains but that's not the thing I was discussing here.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 23:32                                       ` Andrea Arcangeli
@ 2004-03-16  6:27                                         ` Nick Piggin
  0 siblings, 0 replies; 42+ messages in thread
From: Nick Piggin @ 2004-03-16  6:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Andrew Morton, marcelo.tosatti, j-nomura,
	linux-kernel, torvalds



Andrea Arcangeli wrote:

>
>I argue those scalability benefits of the locks, on a 32G machine or on
>a 1G machine those locks benefits are near zero. The only significant
>benefit is in terms of computational complexity of the normal-zone
>allocations, where we'll only walk on the zone-normal and zone-dma
>pages.
>
>

Out of interest, are there workloads on 8 and 16-way UMA systems
that have lru_lock scalability problems in 2.6? Anyone know?



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-14 23:22                   ` Andrew Morton
  2004-03-15  0:14                     ` Andrea Arcangeli
@ 2004-03-16  6:31                     ` Marcelo Tosatti
  2004-03-16 13:47                       ` Andrea Arcangeli
  2004-11-22 15:01                     ` Lazily add anonymous pages to LRU on v2.4? was " Marcelo Tosatti
  2 siblings, 1 reply; 42+ messages in thread
From: Marcelo Tosatti @ 2004-03-16  6:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, marcelo.tosatti, j-nomura, linux-kernel, riel,
	torvalds



On Sun, 14 Mar 2004, Andrew Morton wrote:

> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > > 
> > > Having a magic knob is a weak solution: the majority of people who are
> > > affected by this problem won't know to turn it on.
> > 
> > that's why I turned it _on_ by default in my tree ;)
> 
> So maybe Marcelo should apply this patch, and also turn it on by default.

Hhhmm, not so easy I guess. What about the added overhead of 
lru_cache_add() for every anonymous page created? 

I bet this will cause problems for users which are happy with the current 
behaviour. Wont it?

Andrea, do you have any numbers (or at least estimates) for the added
overhead of instantly addition of anon pages to the LRU? That would be
cool to know.

> > There are workloads where adding anonymous pages to the lru is
> > suboptimal for both the vm (cache shrinking) and the fast path too
> > (lru_cache_add), not sure how 2.6 optimizes those bits, since with 2.6
> > you're forced to add those pages to the lru somehow and that implies
> > some form of locking.
> 
> Basically a bunch of tweeaks:
> 
> - Per-zone lru locks (which implicitly made them per-node)
> 
> - Adding/removing sixteen pages for one taking of the lock.
> 
> - Making the lock irq-safe (it had to be done for other reasons, but
>   reduced contention by 30% on 4-way due to not having a CPU wander off to
>   service an interrupt while holding a critical lock).
> 
> - In page reclaim, snip 32 pages off the lru completely and drop the
>   lock while we go off and process them.

Obviously we dont have, and dont want to, such things in 2.4.

Anyway, it seems this discussion is being productive. Glad!


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-15 22:05                                 ` Nick Piggin
  2004-03-15 22:24                                   ` Andrea Arcangeli
@ 2004-03-16  7:25                                   ` Marcelo Tosatti
  1 sibling, 0 replies; 42+ messages in thread
From: Marcelo Tosatti @ 2004-03-16  7:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrea Arcangeli, Rik van Riel, Andrew Morton, marcelo.tosatti,
	j-nomura, linux-kernel, torvalds



On Tue, 16 Mar 2004, Nick Piggin wrote:

> 
> 
> Andrea Arcangeli wrote:
> 
> >On Tue, Mar 16, 2004 at 01:37:04AM +1100, Nick Piggin wrote:
> >
> >>This case I think is well worth the unfairness it causes, because it
> >>means your zone's pages can be freed quickly and without freeing pages
> >>from other zones.
> >>
> >
> >freeing pages from other zones is perfectly fine, the classzone design
> >gets it right, you have to free memory from the other zones too or you
> >have no way to work on a 1G machine. you call the thing "unfair" when it
> >has nothing to do with fariness, your unfariness is the slowdown I
> >pointed out, it's all about being able to maintain a more reliable cache
> >information from the point of view of the pagecache users (the pagecache
> >users cares at the _classzone_, they can't care about the zones
> >themself), it has nothing to do with fairness.
> >
> >
> 
> What I meant by unfairness is that low zone scanning in response
> to low zone pressure will not put any pressure on higher zones.
> Thus pages in higher zones have an advantage.
> 
> We do scan lowmem in response to highmem pressure.

Hi Nick, 

I'm having a good time reading this discussion, so let me jump in.

Sure, the "unfairness" between lowmem and highmem exists. Quoting what 
you said, "pages in higher zones have an advantage". 

That is natural, after all the necessity for lowmem pages is much higher
than the need for highmem pages. And this necessity for the lowmem
precious increases as far as the lowmem/highmem ratio grows.

As Andrew has demonstrated, the problems previously caused by such
"unfairness" are non existant with per-zone LRU lists.

So, yes, we have unfairness between lowmem and highmem, and yes, that is
the way it should be.

I felt you had a problem with such a thing, however I dont see one.

Am I missing something?

Regards


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-16  6:31                     ` Marcelo Tosatti
@ 2004-03-16 13:47                       ` Andrea Arcangeli
  2004-03-16 16:59                         ` Marcelo Tosatti
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Arcangeli @ 2004-03-16 13:47 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, j-nomura, linux-kernel, riel, torvalds

On Tue, Mar 16, 2004 at 03:31:33AM -0300, Marcelo Tosatti wrote:
> 
> 
> On Sun, 14 Mar 2004, Andrew Morton wrote:
> 
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > >
> > > > 
> > > > Having a magic knob is a weak solution: the majority of people who are
> > > > affected by this problem won't know to turn it on.
> > > 
> > > that's why I turned it _on_ by default in my tree ;)
> > 
> > So maybe Marcelo should apply this patch, and also turn it on by default.
> 
> Hhhmm, not so easy I guess. What about the added overhead of 
> lru_cache_add() for every anonymous page created? 
> 
> I bet this will cause problems for users which are happy with the current 
> behaviour. Wont it?

the lru_cache_add is happening in 2.4 mainline, the only point of the
patch is to _avoid_ calling lru_cache_add (tunable with a sysctl so you
can get to the old behaviour of calling lru_cache_add for every anon
page).

> Andrea, do you have any numbers (or at least estimates) for the added
> overhead of instantly addition of anon pages to the LRU? That would be
> cool to know.

I've the numbers for the removed overhead, that's significant in some
workload, but only in the >=16-ways.

> Obviously we dont have, and dont want to, such things in 2.4.

agreed ;)

> Anyway, it seems this discussion is being productive. Glad!

yep!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-16 13:47                       ` Andrea Arcangeli
@ 2004-03-16 16:59                         ` Marcelo Tosatti
  0 siblings, 0 replies; 42+ messages in thread
From: Marcelo Tosatti @ 2004-03-16 16:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Marcelo Tosatti, Andrew Morton, j-nomura, linux-kernel, riel, torvalds



On Tue, 16 Mar 2004, Andrea Arcangeli wrote:

> On Tue, Mar 16, 2004 at 03:31:33AM -0300, Marcelo Tosatti wrote:
> > 
> > 
> > On Sun, 14 Mar 2004, Andrew Morton wrote:
> > 
> > > Andrea Arcangeli <andrea@suse.de> wrote:
> > > >
> > > > > 
> > > > > Having a magic knob is a weak solution: the majority of people who are
> > > > > affected by this problem won't know to turn it on.
> > > > 
> > > > that's why I turned it _on_ by default in my tree ;)
> > > 
> > > So maybe Marcelo should apply this patch, and also turn it on by default.
> > 
> > Hhhmm, not so easy I guess. What about the added overhead of 
> > lru_cache_add() for every anonymous page created? 
> > 
> > I bet this will cause problems for users which are happy with the current 
> > behaviour. Wont it?
> 
> the lru_cache_add is happening in 2.4 mainline, the only point of the
> patch is to _avoid_ calling lru_cache_add (tunable with a sysctl so you
> can get to the old behaviour of calling lru_cache_add for every anon
> page).

Uh oh, just ignore me. 

I misread the message, and misunderstood the whole thing. Will go reread
the patch, and the code.

> > Andrea, do you have any numbers (or at least estimates) for the added
> > overhead of instantly addition of anon pages to the LRU? That would be
> > cool to know.2A2A2A
> 
> I've the numbers for the removed overhead, that's significant in some
> workload, but only in the >=16-ways.

And for those workloads one should turn be able to turn it off - right.

> > Obviously we dont have, and dont want to, such things in 2.4.
> 
> agreed ;)
> 
> > Anyway, it seems this discussion is being productive. Glad!
> 
> yep!
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-03-10 10:57           ` j-nomura
  2004-03-14 19:47             ` Marcelo Tosatti
@ 2004-05-26 12:41             ` Marcelo Tosatti
  2004-05-26 18:24               ` Marc-Christian Petersen
                                 ` (3 more replies)
  1 sibling, 4 replies; 42+ messages in thread
From: Marcelo Tosatti @ 2004-05-26 12:41 UTC (permalink / raw)
  To: j-nomura; +Cc: linux-kernel, andrea, Andrew Morton, hugh

Andrea, Hugh, Jun'ichi,

I think we can merge this patch.

Its very safe - default behaviour unchanged. 

Jun, are you willing to do another test for us if this gets merged
in v2.4.27-pre4 ?

Maybe we should document the VM tunables somewhere outside source code
(Documentation/) ?

On Wed, Mar 10, 2004 at 07:57:07PM +0900, j-nomura@ce.jp.nec.com wrote:
> After discussion with Hugh and recommendation from Andrea,
> it turns out that Andrea's 05_vm_22_vm-anon-lru-3 in 2.4.23aa2 solves
> the problem.
> ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.23aa2/05_vm_22_vm-anon-lru-3
> 
> The patch adds a sysctl which accelerate the performance on
> huge memory machine. It doesn't affect anything if turned off.
> 
> Marcelo, could you apply this to 2.4.26-pre?
> (I attached the slightly modified patch in which the feature is turned
> off by default and which is cleanly applied to bk tree.)
> 
> 
> My test case was:
>   - there is a process with large anonymous mapping
>   - there are large amount of page caches and active I/O processes
>   - there are not much of file mappings
> 
> So the problem happens in this way:
>   - shrink_cache tries scanning inactive list in which most of pages
>     are anonymous mapped
>   - it soon fall into swap_out because of too many anonymous pages
>   - when no free swap space, it hardly frees anything
>   - it retries again but soon calls swap_out again and again
> 
> Without the patch, snapshot of readprofile looks like:
>    3590781 total
>    3289271 swap_out
>     212029 smp_call_function
>      22598 shrink_cache
>      21833 lru_cache_add
>       7787 get_user_pages
> 
> Most of the time was spent in swap_out. (contention on pagetable_lock)
> 
> After applying the patch, the snapshot is like:
>     17420 total
>      3929 copy_page
>      3677 statm_pgd_range
>      1317 try_to_free_buffers
>      1312 __copy_user
>       593 scsi_make_request
> 
> Best regards.
> --
> NOMURA, Jun'ichi <j-nomura@ce.jp.nec.com>

> --- linux/include/linux/swap.h	2004/02/19 04:12:39	1.1.1.26
> +++ linux/include/linux/swap.h	2004/03/10 10:09:11
> @@ -116,7 +116,7 @@ extern void swap_setup(void);
>  extern wait_queue_head_t kswapd_wait;
>  extern int FASTCALL(try_to_free_pages_zone(zone_t *, unsigned int));
>  extern int FASTCALL(try_to_free_pages(unsigned int));
> -extern int vm_vfs_scan_ratio, vm_cache_scan_ratio, vm_lru_balance_ratio, vm_passes, vm_gfp_debug, vm_mapped_ratio;
> +extern int vm_vfs_scan_ratio, vm_cache_scan_ratio, vm_lru_balance_ratio, vm_passes, vm_gfp_debug, vm_mapped_ratio, vm_anon_lru;
>  
>  /* linux/mm/page_io.c */
>  extern void rw_swap_page(int, struct page *);
> --- linux/include/linux/sysctl.h	2004/02/19 04:12:39	1.1.1.23
> +++ linux/include/linux/sysctl.h	2004/03/10 10:09:11
> @@ -156,6 +156,7 @@ enum
>  	VM_MAPPED_RATIO=20,     /* amount of unfreeable pages that triggers swapout */
>  	VM_LAPTOP_MODE=21,	/* kernel in laptop flush mode */
>  	VM_BLOCK_DUMP=22,	/* dump fs activity to log */
> +	VM_ANON_LRU=23,		/* immediatly insert anon pages in the vm page lru */
>  };
>  
>  
> --- linux/kernel/sysctl.c	2003/12/02 04:48:47	1.1.1.22
> +++ linux/kernel/sysctl.c	2004/03/10 10:09:12
> @@ -287,6 +287,8 @@ static ctl_table vm_table[] = {
>  	 &vm_cache_scan_ratio, sizeof(int), 0644, NULL, &proc_dointvec},
>  	{VM_MAPPED_RATIO, "vm_mapped_ratio", 
>  	 &vm_mapped_ratio, sizeof(int), 0644, NULL, &proc_dointvec},
> +	{VM_ANON_LRU, "vm_anon_lru", 
> +	 &vm_anon_lru, sizeof(int), 0644, NULL, &proc_dointvec},
>  	{VM_LRU_BALANCE_RATIO, "vm_lru_balance_ratio", 
>  	 &vm_lru_balance_ratio, sizeof(int), 0644, NULL, &proc_dointvec},
>  	{VM_PASSES, "vm_passes", 
> --- linux/mm/memory.c	2003/12/02 04:48:47	1.1.1.31
> +++ linux/mm/memory.c	2004/03/10 10:09:12
> @@ -984,7 +984,8 @@ static int do_wp_page(struct mm_struct *
>  		if (PageReserved(old_page))
>  			++mm->rss;
>  		break_cow(vma, new_page, address, page_table);
> -		lru_cache_add(new_page);
> +		if (vm_anon_lru)
> +			lru_cache_add(new_page);
>  
>  		/* Free the old page.. */
>  		new_page = old_page;
> @@ -1215,7 +1216,8 @@ static int do_anonymous_page(struct mm_s
>  		mm->rss++;
>  		flush_page_to_ram(page);
>  		entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
> -		lru_cache_add(page);
> +		if (vm_anon_lru)
> +			lru_cache_add(page);
>  		mark_page_accessed(page);
>  	}
>  
> @@ -1270,7 +1272,8 @@ static int do_no_page(struct mm_struct *
>  		}
>  		copy_user_highpage(page, new_page, address);
>  		page_cache_release(new_page);
> -		lru_cache_add(page);
> +		if (vm_anon_lru)
> +			lru_cache_add(page);
>  		new_page = page;
>  	}
>  
> --- linux/mm/vmscan.c	2004/02/19 04:12:33	1.1.1.32
> +++ linux/mm/vmscan.c	2004/03/10 10:09:13
> @@ -65,6 +65,27 @@ int vm_lru_balance_ratio = 2;
>  int vm_vfs_scan_ratio = 6;
>  
>  /*
> + * "vm_anon_lru" select if to immdiatly insert anon pages in the
> + * lru. Immediatly means as soon as they're allocated during the
> + * page faults.
> + *
> + * If this is set to 0, they're inserted only after the first
> + * swapout.
> + *
> + * Having anon pages immediatly inserted in the lru allows the
> + * VM to know better when it's worthwhile to start swapping
> + * anonymous ram, it will start to swap earlier and it should
> + * swap smoother and faster, but it will decrease scalability
> + * on the >16-ways of an order of magnitude. Big SMP/NUMA
> + * definitely can't take an hit on a global spinlock at
> + * every anon page allocation. So this is off by default.
> + *
> + * Low ram machines that swaps all the time want to turn
> + * this on (i.e. set to 1).
> + */
> +int vm_anon_lru = 1;
> +
> +/*
>   * The swap-out function returns 1 if it successfully
>   * scanned all the pages it was asked to (`count').
>   * It returns zero if it couldn't do anything,


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-05-26 12:41             ` Marcelo Tosatti
@ 2004-05-26 18:24               ` Marc-Christian Petersen
  2004-05-27 11:16                 ` Marcelo Tosatti
  2004-05-26 19:06               ` Hugh Dickins
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Marc-Christian Petersen @ 2004-05-26 18:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Marcelo Tosatti, j-nomura, andrea, Andrew Morton, hugh

[-- Attachment #1: Type: text/plain, Size: 606 bytes --]

On Wednesday 26 May 2004 14:41, Marcelo Tosatti wrote:

Marcelo,

> I think we can merge this patch.

I think this too =)


> Its very safe - default behaviour unchanged.
> Jun, are you willing to do another test for us if this gets merged
> in v2.4.27-pre4 ?
> Maybe we should document the VM tunables somewhere outside source code
> (Documentation/) ?

I think we should merge the attached patches to finally remove utterly bogus 
and non-existent documentation things and clean up stuff a bit and document 
the -aa VM bits.

Agreed?

Kinda same cleanups and more following soon for 2.6-mm.

ciao, Marc


[-- Attachment #2: 02_add-new-docu-VM.patch --]
[-- Type: text/x-diff, Size: 24952 bytes --]

--- a/Documentation/sysctl/vm.txt	2004-05-26 19:57:15.000000000 +0200
+++ b/Documentation/sysctl/vm.txt	2004-05-26 20:06:20.000000000 +0200
@@ -1,111 +1,143 @@
-Documentation for /proc/sys/vm/*	kernel version 2.4.19
-	(c) 1998, 1999,  Rik van Riel <riel@nl.linux.org>
+Documentation for /proc/sys/vm/*	Kernel version 2.4.26
+=============================================================
 
-For general info and legal blurb, please look in README.
+ (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
+    - Initial version
 
-==============================================================
+ (c) 2004, Marc-Christian Petersen <m.c.p@linux-systeme.com>
+    - Removed non-existent knobs which were removed in early
+      2.4 stages
+    - Corrected values for bdflush
+    - Documented missing tunables
+    - Documented aa-vm tunables
+
+
+
+For general info and legal blurb, please look in README.
+=============================================================
 
 This file contains the documentation for the sysctl files in
-/proc/sys/vm and is valid for Linux kernel version 2.4.
+/proc/sys/vm and is valid for Linux kernel v2.4.26.
 
 The files in this directory can be used to tune the operation
 of the virtual memory (VM) subsystem of the Linux kernel, and
-one of the files (bdflush) also has a little influence on disk
-usage.
+three of the files (bdflush, max-readahead, min-readahead)
+also have some influence on disk usage.
 
 Default values and initialization routines for most of these
-files can be found in mm/swap.c.
+files can be found in mm/vmscan.c, mm/page_alloc.c and
+mm/filemap.c.
 
 Currently, these files are in /proc/sys/vm:
 - bdflush
+- block_dump
 - kswapd
+- laptop_mode
+- max-readahead
+- min-readahead
 - max_map_count
 - overcommit_memory
 - page-cluster
 - pagetable_cache
+- vm_anon_lru
+- vm_cache_scan_ratio
+- vm_gfp_debug
+- vm_lru_balance_ratio
+- vm_mapped_ratio
+- vm_passes
+- vm_vfs_scan_ratio
+=============================================================
 
-==============================================================
 
-bdflush:
 
+bdflush:
+--------
 This file controls the operation of the bdflush kernel
 daemon. The source code to this struct can be found in
-linux/fs/buffer.c. It currently contains 9 integer values,
+fs/buffer.c. It currently contains 9 integer values,
 of which 6 are actually used by the kernel.
 
-From linux/fs/buffer.c:
---------------------------------------------------------------
-union bdflush_param {
-	struct {
-		int nfract;	/* Percentage of buffer cache dirty to
-				   activate bdflush */
-		int ndirty;	/* Maximum number of dirty blocks to write out per
-				   wake-cycle */
-		int dummy2;	/* old "nrefill" */
-		int dummy3;	/* unused */
-		int interval;	/* jiffies delay between kupdate flushes */
-		int age_buffer;	/* Time for normal buffer to age before we flush it */
-		int nfract_sync;/* Percentage of buffer cache dirty to
-				   activate bdflush synchronously */
-		int nfract_stop_bdflush; /* Percentage of buffer cache dirty to stop bdflush */
-		int dummy5;	/* unused */
-	} b_un;
-	unsigned int data[N_PARAM];
-} bdf_prm = {{30, 500, 0, 0, 5*HZ, 30*HZ, 60, 20, 0}};
---------------------------------------------------------------
-
-int nfract:
-The first parameter governs the maximum number of dirty
-buffers in the buffer cache. Dirty means that the contents
-of the buffer still have to be written to disk (as opposed
-to a clean buffer, which can just be forgotten about).
-Setting this to a high value means that Linux can delay disk
-writes for a long time, but it also means that it will have
-to do a lot of I/O at once when memory becomes short. A low
-value will spread out disk I/O more evenly, at the cost of
-more frequent I/O operations.  The default value is 30%,
-the minimum is 0%, and the maximum is 100%.
-
-int ndirty:
-The second parameter (ndirty) gives the maximum number of
-dirty buffers that bdflush can write to the disk in one time.
-A high value will mean delayed, bursty I/O, while a small
-value can lead to memory shortage when bdflush isn't woken
-up often enough.
-
-int interval:
-The fifth parameter, interval, is the minimum rate at
-which kupdate will wake and flush.  The value is expressed in
-jiffies (clockticks), the number of jiffies per second is
-normally 100 (Alpha is 1024). Thus, x*HZ is x seconds.  The
-default value is 5 seconds, the minimum is 0 seconds, and the
-maximum is 600 seconds.
-
-int age_buffer:
-The sixth parameter, age_buffer, governs the maximum time
-Linux waits before writing out a dirty buffer to disk.  The
-value is in jiffies.  The default value is 30 seconds,
-the minimum is 1 second, and the maximum 6,000 seconds.
-
-int nfract_sync:
-The seventh parameter, nfract_sync, governs the percentage
-of buffer cache that is dirty before bdflush activates
-synchronously.  This can be viewed as the hard limit before
-bdflush forces buffers to disk.  The default is 60%, the
-minimum is 0%, and the maximum is 100%.
-
-int nfract_stop_bdflush:
-The eighth parameter, nfract_stop_bdflush, governs the percentage
-of buffer cache that is dirty which will stop bdflush.
-The default is 20%, the miniumum is 0%, and the maxiumum is 100%.
-==============================================================
+nfract:		The first parameter governs the maximum
+		number of dirty buffers in the buffer
+		cache. Dirty means that the contents of the
+		buffer still have to be written to disk (as
+		opposed to a clean buffer, which can just be
+		forgotten about). Setting this to a high
+		value means that Linux can delay disk writes
+		for a long time, but it also means that it
+		will have to do a lot of I/O at once when
+		memory becomes short. A low value will
+		spread out disk I/O more evenly, at the cost
+		of more frequent I/O operations. The default
+		value is 30%, the minimum is 0%, and the
+		maximum is 100%.
+
+ndirty:		The second parameter (ndirty) gives the
+		maximum number of dirty buffers that bdflush
+		can write to the disk in one time. A high
+		value will mean delayed, bursty I/O, while a
+		small value can lead to memory shortage when
+		bdflush isn't woken up often enough. The
+		default value is 500 dirty buffers, the
+		minimum is 1, and the maximum is 50000.
+
+dummy2:		The third parameter is not used.
+
+dummy3:		The fourth parameter is not used.
+
+interval:	The fifth parameter, interval, is the minimum
+		rate at which kupdate will wake and flush.
+		The value is in jiffies (clockticks), the
+		number of jiffies per second is normally 100
+		(Alpha is 1024). Thus, x*HZ is x seconds. The
+		default value is 5 seconds, the minimum	is 0
+		seconds, and the maximum is 10,000 seconds.
+
+age_buffer:	The sixth parameter, age_buffer, governs the
+		maximum time Linux waits before writing out a
+		dirty buffer to disk. The value is in jiffies.
+		The default value is 30 seconds, the minimum
+		is 1 second, and the maximum 10,000 seconds.
+
+sync:		The seventh parameter, nfract_sync, governs
+		the percentage of buffer cache that is dirty
+		before bdflush activates synchronously. This
+		can be viewed as the hard limit before
+		bdflush forces buffers to disk. The default
+		is 60%,	the minimum is 0%, and the maximum
+		is 100%.
+
+stop_bdflush:	The eighth parameter, nfract_stop_bdflush,
+		governs the percentage of buffer cache that
+		is dirty which will stop bdflush. The default
+		is 20%, the miniumum is 0%, and the maxiumum
+		is 100%.
+
+dummy5:		The ninth parameter is not used.
+
+So the default is: 30 500 0 0 500 3000 60 20 0   for 100 HZ.
+=============================================================
+
+
+
+block_dump:
+-----------
+It can happen that the disk still keeps spinning up and you
+don't quite know why or what causes it. The laptop mode patch
+has a little helper for that as well. When set to 1, it will
+dump info to the kernel message buffer about what process
+caused the io. Be careful when playing with this setting.
+It is advisable to shut down syslog first! The default is 0.
+=============================================================
+
 
-kswapd:
 
+kswapd:
+-------
 Kswapd is the kernel swapout daemon. That is, kswapd is that
 piece of the kernel that frees memory when it gets fragmented
-or full. Since every system is different, you'll probably want
-some control over this piece of the system.
+or full. Since every system is different, you'll probably
+want some control over this piece of the system.
 
 The numbers in this page correspond to the numbers in the
 struct pager_daemon {tries_base, tries_min, swap_cluster
@@ -117,39 +149,83 @@ tries_base	The maximum number of pages k
 		number. Usually this number will be divided
 		by 4 or 8 (see mm/vmscan.c), so it isn't as
 		big as it looks.
-		When you need to increase the bandwidth to/from
-		swap, you'll want to increase this number.
+		When you need to increase the bandwidth to/
+		from swap, you'll want to increase this
+		number.
+
 tries_min	This is the minimum number of times kswapd
 		tries to free a page each time it is called.
 		Basically it's just there to make sure that
 		kswapd frees some pages even when it's being
 		called with minimum priority.
+
 swap_cluster	This is the number of pages kswapd writes in
 		one turn. You want this large so that kswapd
 		does it's I/O in large chunks and the disk
-		doesn't have to seek often, but you don't want
-		it to be too large since that would flood the
-		request queue.
+		doesn't have to seek often, but you don't
+		want it to be too large since that would
+		flood the request queue.
+
+The default value is: 512 32 8.
+=============================================================
 
-==============================================================
 
-overcommit_memory:
 
-This value contains a flag that enables memory overcommitment.
-When this flag is 0, the kernel checks before each malloc()
-to see if there's enough memory left. If the flag is nonzero,
-the system pretends there's always enough memory.
+laptop_mode:
+------------
+Setting this to 1 switches the vm (and block layer) to laptop
+mode. Leaving it to 0 makes the kernel work like before. When
+in laptop mode, you also want to extend the intervals
+desribed in Documentation/laptop-mode.txt.
+See the laptop-mode.sh script for how to do that.
+
+The default value is 0.
+=============================================================
 
-This feature can be very useful because there are a lot of
-programs that malloc() huge amounts of memory "just-in-case"
-and don't use much of it.
 
-Look at: mm/mmap.c::vm_enough_memory() for more information.
 
-==============================================================
+max-readahead:
+--------------
+This tunable affects how early the Linux VFS will fetch the
+next block of a file from memory. File readahead values are
+determined on a per file basis in the VFS and are adjusted
+based on the behavior of the application accessing the file.
+Anytime the current position being read in a file plus the
+current read ahead value results in the file pointer pointing
+to the next block in the file, that block will be fetched
+from disk. By raising this value, the Linux kernel will allow
+the readahead value to grow larger, resulting in more blocks
+being prefetched from disks which predictably access files in
+uniform linear fashion. This can result in performance
+improvements, but can also result in excess (and often
+unnecessary) memory usage. Lowering this value has the
+opposite affect. By forcing readaheads to be less aggressive,
+memory may be conserved at a potential performance impact.
+
+The default value is 31.
+=============================================================
 
-max_map_count:
 
+
+min-readahead:
+--------------
+Like max-readahead, min-readahead places a floor on the
+readahead value. Raising this number forces a files readahead
+value to be unconditionally higher, which can bring about
+performance improvements, provided that all file access in
+the system is predictably linear from the start to the end of
+a file. This of course results in higher memory usage from
+the pagecache. Conversely, lowering this value, allows the
+kernel to conserve pagecache memory, at a potential
+performance cost.
+
+The default value is 3.
+=============================================================
+
+
+
+max_map_count:
+--------------
 This file contains the maximum number of memory map areas a
 process may have. Memory map areas are used as a side-effect
 of calling malloc, directly by mmap and mprotect, and also
@@ -159,10 +235,29 @@ While most applications need less than a
 certain programs, particularly malloc debuggers, may consume 
 lots of them, e.g. up to one or two maps per allocation.
 
-==============================================================
+The default value is 65536.
+=============================================================
+
+
+
+overcommit_memory:
+------------------
+This value contains a flag to enable memory overcommitment.
+When this flag is 0, the kernel checks before each malloc()
+to see if there's enough memory left. If the flag is nonzero,
+the system pretends there's always enough memory.
+
+This feature can be very useful because there are a lot of
+programs that malloc() huge amounts of memory "just-in-case"
+and don't use much of it. The default value is 0.
+
+Look at: mm/mmap.c::vm_enough_memory() for more information.
+=============================================================
+
 
-page-cluster:
 
+page-cluster:
+-------------
 The Linux VM subsystem avoids excessive disk seeks by reading
 multiple pages on a page fault. The number of pages it reads
 is dependent on the amount of memory in your machine.
@@ -170,11 +265,12 @@ is dependent on the amount of memory in 
 The number of pages the kernel reads in at once is equal to
 2 ^ page-cluster. Values above 2 ^ 5 don't make much sense
 for swap because we only cluster swap data in 32-page groups.
+=============================================================
 
-==============================================================
 
-pagetable_cache:
 
+pagetable_cache:
+----------------
 The kernel keeps a number of page tables in a per-processor
 cache (this helps a lot on SMP systems). The cache size for
 each processor will be between the low and the high value.
@@ -188,3 +284,98 @@ For large systems, the settings are prob
 systems they won't hurt a bit. For small systems (<16MB ram)
 it might be advantageous to set both values to 0.
 
+The default value is: 25 50.
+=============================================================
+
+
+
+vm_anon_lru:
+------------
+select if to immdiatly insert anon pages in the lru.
+Immediatly means as soon as they're allocated during the page
+faults. If this is set to 0, they're inserted only after the
+first swapout.
+  
+Having anon pages immediatly inserted in the lru allows the
+VM to know better when it's worthwhile to start swapping
+anonymous ram, it will start to swap earlier and it should
+swap smoother and faster, but it will decrease scalability
+on the >16-ways of an order of magnitude. Big SMP/NUMA
+definitely can't take an hit on a global spinlock at
+every anon page allocation.
+
+Low ram machines that swaps all the time want to turn
+this on (i.e. set to 1).
+
+The default value is 1.
+=============================================================
+
+
+
+vm_cache_scan_ratio:
+--------------------
+is how much of the inactive LRU queue we will scan in one go.
+A value of 6 for vm_cache_scan_ratio implies that we'll scan
+1/6 of the inactive lists during a normal aging round.
+
+The default value is 6.
+=============================================================
+
+
+
+vm_gfp_debug:
+------------
+is when __alloc_pages fails, dump us a stack. This will
+mostly happen during OOM conditions (hopefully ;)
+
+The default value is 0.
+=============================================================
+
+
+
+vm_lru_balance_ratio:
+---------------------
+controls the balance between active and inactive cache. The
+bigger vm_balance is, the easier the active cache will grow,
+because we'll rotate the active list slowly. A value of 2
+means we'll go towards a balance of 1/3 of the cache being
+inactive.
+
+The default value is 2.
+=============================================================
+
+
+
+vm_mapped_ratio:
+----------------
+controls the pageout rate, the smaller, the earlier we'll
+start to pageout.
+
+The default value is 100.
+=============================================================
+
+
+
+vm_passes:
+----------
+is the number of vm passes before failing the memory
+balancing. Take into account 3 passes are needed for a
+flush/wait/free cycle and that we only scan
+1/vm_cache_scan_ratio of the inactive list at each pass.
+
+The default value is 60.
+=============================================================
+
+
+
+vm_vfs_scan_ratio:
+------------------
+is what proportion of the VFS queues we will scan in one go.
+A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the
+unused-inode, dentry and dquot caches will be freed during a
+normal aging round.
+Big fileservers (NFS, SMB etc.) probably want to set this
+value to 3 or 2.
+
+The default value is 6.
+=============================================================
--- a/Documentation/filesystems/proc.txt	2004-05-23 00:08:31.000000000 +0200
+++ b/Documentation/filesystems/proc.txt	2004-05-23 02:33:41.000000000 +0200
@@ -936,172 +936,7 @@ program to load modules on demand.
 
 2.4 /proc/sys/vm - The virtual memory subsystem
 -----------------------------------------------
-
-The files  in  this directory can be used to tune the operation of the virtual
-memory (VM)  subsystem  of  the  Linux  kernel.  In addition, one of the files
-(bdflush) has some influence on disk usage.
-
-bdflush
--------
-
-This file  controls  the  operation of the bdflush kernel daemon. It currently
-contains nine  integer  values,  six of which are actually used by the kernel.
-They are listed in table 2-2.
-
-
-Table 2-2: Parameters in /proc/sys/vm/bdflush 
-..............................................................................
- Value      Meaning                                                            
- nfract     Percentage of buffer cache dirty to activate bdflush              
- ndirty     Maximum number of dirty blocks to  write out per wake-cycle        
- dummy      Unused                                                             
- dummy      Unused                                                             
- interval   jiffies delay between kupdate flushes
- age_buffer Time for normal buffer to age before we flush it                   
- nfract_sync Percentage of buffer cache dirty to activate bdflush synchronously
- nfract_stop_bdflush Percetange of buffer cache dirty to stop bdflush
- dummy      Unused                                                             
-..............................................................................
-
-nfract
-------
-
-This parameter  governs  the  maximum  number  of  dirty buffers in the buffer
-cache. Dirty means that the contents of the buffer still have to be written to
-disk (as  opposed  to  a  clean  buffer,  which  can just be forgotten about).
-Setting this  to  a  higher value means that Linux can delay disk writes for a
-long time, but it also means that it will have to do a lot of I/O at once when
-memory becomes short. A lower value will spread out disk I/O more evenly.
-
-interval
---------
-
-The interval between two kupdate runs. The value is expressed in
-jiffies (clockticks),  the  number of jiffies per second is 100.
-
-ndirty
-------
-
-Ndirty gives the maximum number of dirty buffers that bdflush can write to the
-disk at  one  time.  A high value will mean delayed, bursty I/O, while a small
-value can lead to memory shortage when bdflush isn't woken up often enough.
-
-age_buffer
-----------
-
-Finally, the age_buffer parameter govern the maximum time Linux
-waits before  writing  out  a  dirty buffer to disk. The value is expressed in
-jiffies (clockticks),  the  number of jiffies per second is 100.
-
-nfract_sync
------------
-
-nfract_stop_bdflush
--------------------
-
-kswapd
-------
-
-Kswapd is  the  kernel  swap  out daemon. That is, kswapd is that piece of the
-kernel that  frees  memory when it gets fragmented or full. Since every system
-is different, you'll probably want some control over this piece of the system.
-
-The file contains three numbers:
-
-tries_base
-----------
-
-The maximum  number  of  pages kswapd tries to free in one round is calculated
-from this  number.  Usually  this  number  will  be  divided  by  4  or 8 (see
-mm/vmscan.c), so it isn't as big as it looks.
-
-When you  need to increase the bandwidth to/from swap, you'll want to increase
-this number.
-
-tries_min
----------
-
-This is  the  minimum number of times kswapd tries to free a page each time it
-is called. Basically it's just there to make sure that kswapd frees some pages
-even when it's being called with minimum priority.
-
-overcommit_memory
------------------
-
-This file  contains  one  value.  The following algorithm is used to decide if
-there's enough  memory:  if  the  value of overcommit_memory is positive, then
-there's always  enough  memory. This is a useful feature, since programs often
-malloc() huge  amounts  of  memory 'just in case', while they only use a small
-part of  it.  Leaving  this value at 0 will lead to the failure of such a huge
-malloc(), when in fact the system has enough memory for the program to run.
-
-On the  other  hand,  enabling this feature can cause you to run out of memory
-and thrash the system to death, so large and/or important servers will want to
-set this value to 0.
-
-pagetable_cache
----------------
-
-The kernel  keeps a number of page tables in a per-processor cache (this helps
-a lot  on  SMP systems). The cache size for each processor will be between the
-low and the high value.
-
-On a  low-memory,  single  CPU system, you can safely set these values to 0 so
-you don't  waste  memory.  It  is  used  on SMP systems so that the system can
-perform fast  pagetable allocations without having to acquire the kernel memory
-lock.
-
-For large  systems,  the  settings  are probably fine. For normal systems they
-won't hurt  a  bit.  For  small  systems  (  less  than  16MB ram) it might be
-advantageous to set both values to 0.
-
-swapctl
--------
-
-This file  contains  no less than 8 variables. All of these values are used by
-kswapd.
-
-The first four variables
-* sc_max_page_age,
-* sc_page_advance,
-* sc_page_decline and
-* sc_page_initial_age
-are used  to  keep  track  of  Linux's page aging. Page aging is a bookkeeping
-method to  track  which pages of memory are often used, and which pages can be
-swapped out without consequences.
-
-When a  page  is  swapped in, it starts at sc_page_initial_age (default 3) and
-when the  page  is  scanned  by  kswapd,  its age is adjusted according to the
-following scheme:
-
-* If  the  page  was used since the last time we scanned, its age is increased
-  by sc_page_advance  (default  3).  Where  the  maximum  value  is  given  by
-  sc_max_page_age (default 20).
-* Otherwise  (meaning  it wasn't used) its age is decreased by sc_page_decline
-  (default 1).
-
-When a page reaches age 0, it's ready to be swapped out.
-
-The variables  sc_age_cluster_fract, sc_age_cluster_min, sc_pageout_weight and
-sc_bufferout_weight, can  be  used  to  control  kswapd's  aggressiveness  in
-swapping out pages.
-
-Sc_age_cluster_fract is used to calculate how many pages from a process are to
-be scanned by kswapd. The formula used is
-
-(sc_age_cluster_fract divided by 1024) times resident set size
-
-So if you want kswapd to scan the whole process, sc_age_cluster_fract needs to
-have a  value  of  1024.  The  minimum  number  of  pages  kswapd will scan is
-represented by sc_age_cluster_min, which is done so that kswapd will also scan
-small processes.
-
-The values  of  sc_pageout_weight  and sc_bufferout_weight are used to control
-how many  tries  kswapd  will make in order to swap out one page/buffer. These
-values can  be used to fine-tune the ratio between user pages and buffer/cache
-memory. When  you find that your Linux system is swapping out too many process
-pages in  order  to  satisfy  buffer  memory  demands,  you may want to either
-increase sc_bufferout_weight, or decrease the value of sc_pageout_weight.
+Please read Documentation/sysctl/vm.txt
 
 2.5 /proc/sys/dev - Device specific parameters
 ----------------------------------------------
@@ -1719,10 +1719,3 @@ need to  recompile  the kernel, or even 
 command to write value into these files, thereby changing the default settings
 of the kernel.
 ------------------------------------------------------------------------------
-
-
-
-
-
-
-

[-- Attachment #3: 01_remove-old-docu-VM.patch --]
[-- Type: text/x-diff, Size: 5132 bytes --]

--- a/Documentation/sysctl/vm.txt	2002-11-28 16:53:08.000000000 -0700
+++ b/Documentation/sysctl/vm.txt	2003-11-12 17:35:11.000000000 -0700
@@ -18,13 +18,10 @@
 
 Currently, these files are in /proc/sys/vm:
 - bdflush
-- buffermem
-- freepages
 - kswapd
 - max_map_count
 - overcommit_memory
 - page-cluster
-- pagecache
 - pagetable_cache
 
 ==============================================================
@@ -102,38 +99,6 @@
 of buffer cache that is dirty which will stop bdflush.
 The default is 20%, the miniumum is 0%, and the maxiumum is 100%.
 ==============================================================
-buffermem:
-
-The three values in this file correspond to the values in
-the struct buffer_mem. It controls how much memory should
-be used for buffer memory. The percentage is calculated
-as a percentage of total system memory.
-
-The values are:
-min_percent	-- this is the minimum percentage of memory
-		   that should be spent on buffer memory
-borrow_percent  -- UNUSED
-max_percent     -- UNUSED
-
-==============================================================
-freepages:
-
-This file contains the values in the struct freepages. That
-struct contains three members: min, low and high.
-
-The meaning of the numbers is:
-
-freepages.min	When the number of free pages in the system
-		reaches this number, only the kernel can
-		allocate more memory.
-freepages.low	If the number of free pages gets below this
-		point, the kernel starts swapping aggressively.
-freepages.high	The kernel tries to keep up to this amount of
-		memory free; if memory comes below this point,
-		the kernel gently starts swapping in the hopes
-		that it never has to do real aggressive swapping.
-
-==============================================================
 
 kswapd:
 
@@ -208,24 +173,6 @@
 
 ==============================================================
 
-pagecache:
-
-This file does exactly the same as buffermem, only this
-file controls the struct page_cache, and thus controls
-the amount of memory used for the page cache.
-
-In 2.2, the page cache is used for 3 main purposes:
-- caching read() data from files
-- caching mmap()ed data and executable files
-- swap cache
-
-When your system is both deep in swap and high on cache,
-it probably means that a lot of the swapped data is being
-cached, making for more efficient swapping than possible
-with the 2.0 kernel.
-
-==============================================================
-
 pagetable_cache:
 
 The kernel keeps a number of page tables in a per-processor
--- a/Documentation/filesystems/proc.txt	2004-05-21 22:54:13.000000000 +0200
+++ b/Documentation/filesystems/proc.txt	2004-05-23 00:08:09.000000000 +0200
@@ -999,54 +999,6 @@ nfract_sync
 nfract_stop_bdflush
 -------------------
 
-buffermem
----------
-
-The three  values  in  this  file  control  how much memory should be used for
-buffer memory.  The  percentage  is calculated as a percentage of total system
-memory.
-
-The values are:
-
-min_percent
------------
-
-This is  the  minimum  percentage  of  memory  that  should be spent on buffer
-memory.
-
-borrow_percent
---------------
-
-When Linux is short on memory, and the buffer cache uses more than it has been
-allotted, the  memory  management  (MM)  subsystem will prune the buffer cache
-more heavily than other memory to compensate.
-
-max_percent
------------
-
-This is the maximum amount of memory that can be used for buffer memory.
-
-freepages
----------
-
-This file contains three values: min, low and high:
-
-min
----
-When the  number  of  free  pages  in the system reaches this number, only the
-kernel can allocate more memory.
-
-low
----
-If the number of free pages falls below this point, the kernel starts swapping
-aggressively.
-
-high
-----
-The kernel  tries  to  keep  up to this amount of memory free; if memory falls
-below this point, the kernel starts gently swapping in the hopes that it never
-has to do really aggressive swapping.
-
 kswapd
 ------
 
@@ -1073,16 +1025,6 @@ This is  the  minimum number of times ks
 is called. Basically it's just there to make sure that kswapd frees some pages
 even when it's being called with minimum priority.
 
-swap_cluster
-------------
-
-This is probably the greatest influence on system performance.
-
-swap_cluster is  the  number  of  pages kswapd writes in one turn. You'll want
-this value  to  be  large  so that kswapd does its I/O in large chunks and the
-disk doesn't  have  to  seek  as  often, but you don't want it to be too large
-since that would flood the request queue.
-
 overcommit_memory
 -----------------
 
@@ -1097,15 +1039,6 @@ On the  other  hand,  enabling this feat
 and thrash the system to death, so large and/or important servers will want to
 set this value to 0.
 
-pagecache
----------
-
-This file  does exactly the same job as buffermem, only this file controls the
-amount of memory allowed for memory mapping and generic caching of files.
-
-You don't  want  the  minimum level to be too low, otherwise your system might
-thrash when memory is tight or fragmentation is high.
-
 pagetable_cache
 ---------------
 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-05-26 12:41             ` Marcelo Tosatti
  2004-05-26 18:24               ` Marc-Christian Petersen
@ 2004-05-26 19:06               ` Hugh Dickins
  2004-05-26 22:23               ` Andrea Arcangeli
  2004-05-28  2:55               ` j-nomura
  3 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2004-05-26 19:06 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: j-nomura, linux-kernel, andrea, Andrew Morton

On Wed, 26 May 2004, Marcelo Tosatti wrote:

> Andrea, Hugh, Jun'ichi,
> 
> I think we can merge this patch.

I guess so.  I'm unenthusiastic since I've never worked out whether
it's _right_, or just an ad hoc hack that happens to work around
more fundamental issues, quite successfully in some workloads.

Andrea seems to have devised it to reduce pagemap_lru_lock
contention on bigiron, yet here it's solving a different problem.
Which may be a sign that it's a great patch, or a sign that we
(I!) don't understand what goes on here well enough.

Please don't count me as against it: I just don't know.

(My involvement was earlier when Jun'ichi reported page_table_lock
contention there.  We were working together on an entirely different
kind of patch addressing that issue, when Andrea suggested he try this
vm_anon_lru patch.  As I understand it, that solved Jun'ichi's particular
problem much more satisfactorily than our own dabblings; but I rather
dropped out at that point.)

> Its very safe - default behaviour unchanged. 

Yes, but please update the comments to reflect that, they imply
vm_anon_lru 0 by default, presumably how it was in Andrea's tree.

The tunability, of course, does unfairly make it look more like a
hack than it is; but if we're uncertain, yes, a tunable hack is
much better than a wrong decision now.

Hugh


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-05-26 12:41             ` Marcelo Tosatti
  2004-05-26 18:24               ` Marc-Christian Petersen
  2004-05-26 19:06               ` Hugh Dickins
@ 2004-05-26 22:23               ` Andrea Arcangeli
  2004-05-28  2:55               ` j-nomura
  3 siblings, 0 replies; 42+ messages in thread
From: Andrea Arcangeli @ 2004-05-26 22:23 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: j-nomura, linux-kernel, Andrew Morton, hugh

On Wed, May 26, 2004 at 09:41:04AM -0300, Marcelo Tosatti wrote:
> Andrea, Hugh, Jun'ichi,
> 
> I think we can merge this patch.
> 
> Its very safe - default behaviour unchanged. 

agreed. And from a stability standpoint it's very safe even when the
behaviour is changed with the non-default setting ;).

thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-05-26 18:24               ` Marc-Christian Petersen
@ 2004-05-27 11:16                 ` Marcelo Tosatti
  0 siblings, 0 replies; 42+ messages in thread
From: Marcelo Tosatti @ 2004-05-27 11:16 UTC (permalink / raw)
  To: Marc-Christian Petersen
  Cc: linux-kernel, j-nomura, andrea, Andrew Morton, hugh, riel

On Wed, May 26, 2004 at 08:24:34PM +0200, Marc-Christian Petersen wrote:
> On Wednesday 26 May 2004 14:41, Marcelo Tosatti wrote:
> 
> Marcelo,
> 
> > I think we can merge this patch.
> 
> I think this too =)
> 
> 
> > Its very safe - default behaviour unchanged.
> > Jun, are you willing to do another test for us if this gets merged
> > in v2.4.27-pre4 ?
> > Maybe we should document the VM tunables somewhere outside source code
> > (Documentation/) ?
> 
> I think we should merge the attached patches to finally remove utterly bogus 
> and non-existent documentation things and clean up stuff a bit and document 
> the -aa VM bits.
> 
> Agreed?
>
> Kinda same cleanups and more following soon for 2.6-mm.

Hi Marc, 

Looks ok for v2.4 -- would be good if Rik and Andrea
could go over it as well.

> --- a/Documentation/sysctl/vm.txt	2004-05-26 19:57:15.000000000 +0200
> +++ b/Documentation/sysctl/vm.txt	2004-05-26 20:06:20.000000000 +0200
> @@ -1,111 +1,143 @@
> -Documentation for /proc/sys/vm/*	kernel version 2.4.19
> -	(c) 1998, 1999,  Rik van Riel <riel@nl.linux.org>
> +Documentation for /proc/sys/vm/*	Kernel version 2.4.26
> +=============================================================
>  
> -For general info and legal blurb, please look in README.
> + (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
> +    - Initial version
>  
> -==============================================================
> + (c) 2004, Marc-Christian Petersen <m.c.p@linux-systeme.com>
> +    - Removed non-existent knobs which were removed in early
> +      2.4 stages
> +    - Corrected values for bdflush
> +    - Documented missing tunables
> +    - Documented aa-vm tunables
> +
> +
> +
> +For general info and legal blurb, please look in README.
> +=============================================================
>  
>  This file contains the documentation for the sysctl files in
> -/proc/sys/vm and is valid for Linux kernel version 2.4.
> +/proc/sys/vm and is valid for Linux kernel v2.4.26.
>  
>  The files in this directory can be used to tune the operation
>  of the virtual memory (VM) subsystem of the Linux kernel, and
> -one of the files (bdflush) also has a little influence on disk
> -usage.
> +three of the files (bdflush, max-readahead, min-readahead)
> +also have some influence on disk usage.
>  
>  Default values and initialization routines for most of these
> -files can be found in mm/swap.c.
> +files can be found in mm/vmscan.c, mm/page_alloc.c and
> +mm/filemap.c.
>  
>  Currently, these files are in /proc/sys/vm:
>  - bdflush
> +- block_dump
>  - kswapd
> +- laptop_mode
> +- max-readahead
> +- min-readahead
>  - max_map_count
>  - overcommit_memory
>  - page-cluster
>  - pagetable_cache
> +- vm_anon_lru
> +- vm_cache_scan_ratio
> +- vm_gfp_debug
> +- vm_lru_balance_ratio
> +- vm_mapped_ratio
> +- vm_passes
> +- vm_vfs_scan_ratio
> +=============================================================
>  
> -==============================================================
>  
> -bdflush:
>  
> +bdflush:
> +--------
>  This file controls the operation of the bdflush kernel
>  daemon. The source code to this struct can be found in
> -linux/fs/buffer.c. It currently contains 9 integer values,
> +fs/buffer.c. It currently contains 9 integer values,
>  of which 6 are actually used by the kernel.
>  
> -From linux/fs/buffer.c:
> ---------------------------------------------------------------
> -union bdflush_param {
> -	struct {
> -		int nfract;	/* Percentage of buffer cache dirty to
> -				   activate bdflush */
> -		int ndirty;	/* Maximum number of dirty blocks to write out per
> -				   wake-cycle */
> -		int dummy2;	/* old "nrefill" */
> -		int dummy3;	/* unused */
> -		int interval;	/* jiffies delay between kupdate flushes */
> -		int age_buffer;	/* Time for normal buffer to age before we flush it */
> -		int nfract_sync;/* Percentage of buffer cache dirty to
> -				   activate bdflush synchronously */
> -		int nfract_stop_bdflush; /* Percentage of buffer cache dirty to stop bdflush */
> -		int dummy5;	/* unused */
> -	} b_un;
> -	unsigned int data[N_PARAM];
> -} bdf_prm = {{30, 500, 0, 0, 5*HZ, 30*HZ, 60, 20, 0}};
> ---------------------------------------------------------------
> -
> -int nfract:
> -The first parameter governs the maximum number of dirty
> -buffers in the buffer cache. Dirty means that the contents
> -of the buffer still have to be written to disk (as opposed
> -to a clean buffer, which can just be forgotten about).
> -Setting this to a high value means that Linux can delay disk
> -writes for a long time, but it also means that it will have
> -to do a lot of I/O at once when memory becomes short. A low
> -value will spread out disk I/O more evenly, at the cost of
> -more frequent I/O operations.  The default value is 30%,
> -the minimum is 0%, and the maximum is 100%.
> -
> -int ndirty:
> -The second parameter (ndirty) gives the maximum number of
> -dirty buffers that bdflush can write to the disk in one time.
> -A high value will mean delayed, bursty I/O, while a small
> -value can lead to memory shortage when bdflush isn't woken
> -up often enough.
> -
> -int interval:
> -The fifth parameter, interval, is the minimum rate at
> -which kupdate will wake and flush.  The value is expressed in
> -jiffies (clockticks), the number of jiffies per second is
> -normally 100 (Alpha is 1024). Thus, x*HZ is x seconds.  The
> -default value is 5 seconds, the minimum is 0 seconds, and the
> -maximum is 600 seconds.
> -
> -int age_buffer:
> -The sixth parameter, age_buffer, governs the maximum time
> -Linux waits before writing out a dirty buffer to disk.  The
> -value is in jiffies.  The default value is 30 seconds,
> -the minimum is 1 second, and the maximum 6,000 seconds.
> -
> -int nfract_sync:
> -The seventh parameter, nfract_sync, governs the percentage
> -of buffer cache that is dirty before bdflush activates
> -synchronously.  This can be viewed as the hard limit before
> -bdflush forces buffers to disk.  The default is 60%, the
> -minimum is 0%, and the maximum is 100%.
> -
> -int nfract_stop_bdflush:
> -The eighth parameter, nfract_stop_bdflush, governs the percentage
> -of buffer cache that is dirty which will stop bdflush.
> -The default is 20%, the miniumum is 0%, and the maxiumum is 100%.
> -==============================================================
> +nfract:		The first parameter governs the maximum
> +		number of dirty buffers in the buffer
> +		cache. Dirty means that the contents of the
> +		buffer still have to be written to disk (as
> +		opposed to a clean buffer, which can just be
> +		forgotten about). Setting this to a high
> +		value means that Linux can delay disk writes
> +		for a long time, but it also means that it
> +		will have to do a lot of I/O at once when
> +		memory becomes short. A low value will
> +		spread out disk I/O more evenly, at the cost
> +		of more frequent I/O operations. The default
> +		value is 30%, the minimum is 0%, and the
> +		maximum is 100%.
> +
> +ndirty:		The second parameter (ndirty) gives the
> +		maximum number of dirty buffers that bdflush
> +		can write to the disk in one time. A high
> +		value will mean delayed, bursty I/O, while a
> +		small value can lead to memory shortage when
> +		bdflush isn't woken up often enough. The
> +		default value is 500 dirty buffers, the
> +		minimum is 1, and the maximum is 50000.
> +
> +dummy2:		The third parameter is not used.
> +
> +dummy3:		The fourth parameter is not used.
> +
> +interval:	The fifth parameter, interval, is the minimum
> +		rate at which kupdate will wake and flush.
> +		The value is in jiffies (clockticks), the
> +		number of jiffies per second is normally 100
> +		(Alpha is 1024). Thus, x*HZ is x seconds. The
> +		default value is 5 seconds, the minimum	is 0
> +		seconds, and the maximum is 10,000 seconds.
> +
> +age_buffer:	The sixth parameter, age_buffer, governs the
> +		maximum time Linux waits before writing out a
> +		dirty buffer to disk. The value is in jiffies.
> +		The default value is 30 seconds, the minimum
> +		is 1 second, and the maximum 10,000 seconds.
> +
> +sync:		The seventh parameter, nfract_sync, governs
> +		the percentage of buffer cache that is dirty
> +		before bdflush activates synchronously. This
> +		can be viewed as the hard limit before
> +		bdflush forces buffers to disk. The default
> +		is 60%,	the minimum is 0%, and the maximum
> +		is 100%.
> +
> +stop_bdflush:	The eighth parameter, nfract_stop_bdflush,
> +		governs the percentage of buffer cache that
> +		is dirty which will stop bdflush. The default
> +		is 20%, the miniumum is 0%, and the maxiumum
> +		is 100%.
> +
> +dummy5:		The ninth parameter is not used.
> +
> +So the default is: 30 500 0 0 500 3000 60 20 0   for 100 HZ.
> +=============================================================
> +
> +
> +
> +block_dump:
> +-----------
> +It can happen that the disk still keeps spinning up and you
> +don't quite know why or what causes it. The laptop mode patch
> +has a little helper for that as well. When set to 1, it will
> +dump info to the kernel message buffer about what process
> +caused the io. Be careful when playing with this setting.
> +It is advisable to shut down syslog first! The default is 0.
> +=============================================================
> +
>  
> -kswapd:
>  
> +kswapd:
> +-------
>  Kswapd is the kernel swapout daemon. That is, kswapd is that
>  piece of the kernel that frees memory when it gets fragmented
> -or full. Since every system is different, you'll probably want
> -some control over this piece of the system.
> +or full. Since every system is different, you'll probably
> +want some control over this piece of the system.
>  
>  The numbers in this page correspond to the numbers in the
>  struct pager_daemon {tries_base, tries_min, swap_cluster
> @@ -117,39 +149,83 @@ tries_base	The maximum number of pages k
>  		number. Usually this number will be divided
>  		by 4 or 8 (see mm/vmscan.c), so it isn't as
>  		big as it looks.
> -		When you need to increase the bandwidth to/from
> -		swap, you'll want to increase this number.
> +		When you need to increase the bandwidth to/
> +		from swap, you'll want to increase this
> +		number.
> +
>  tries_min	This is the minimum number of times kswapd
>  		tries to free a page each time it is called.
>  		Basically it's just there to make sure that
>  		kswapd frees some pages even when it's being
>  		called with minimum priority.
> +
>  swap_cluster	This is the number of pages kswapd writes in
>  		one turn. You want this large so that kswapd
>  		does it's I/O in large chunks and the disk
> -		doesn't have to seek often, but you don't want
> -		it to be too large since that would flood the
> -		request queue.
> +		doesn't have to seek often, but you don't
> +		want it to be too large since that would
> +		flood the request queue.
> +
> +The default value is: 512 32 8.
> +=============================================================
>  
> -==============================================================
>  
> -overcommit_memory:
>  
> -This value contains a flag that enables memory overcommitment.
> -When this flag is 0, the kernel checks before each malloc()
> -to see if there's enough memory left. If the flag is nonzero,
> -the system pretends there's always enough memory.
> +laptop_mode:
> +------------
> +Setting this to 1 switches the vm (and block layer) to laptop
> +mode. Leaving it to 0 makes the kernel work like before. When
> +in laptop mode, you also want to extend the intervals
> +desribed in Documentation/laptop-mode.txt.
> +See the laptop-mode.sh script for how to do that.
> +
> +The default value is 0.
> +=============================================================
>  
> -This feature can be very useful because there are a lot of
> -programs that malloc() huge amounts of memory "just-in-case"
> -and don't use much of it.
>  
> -Look at: mm/mmap.c::vm_enough_memory() for more information.
>  
> -==============================================================
> +max-readahead:
> +--------------
> +This tunable affects how early the Linux VFS will fetch the
> +next block of a file from memory. File readahead values are
> +determined on a per file basis in the VFS and are adjusted
> +based on the behavior of the application accessing the file.
> +Anytime the current position being read in a file plus the
> +current read ahead value results in the file pointer pointing
> +to the next block in the file, that block will be fetched
> +from disk. By raising this value, the Linux kernel will allow
> +the readahead value to grow larger, resulting in more blocks
> +being prefetched from disks which predictably access files in
> +uniform linear fashion. This can result in performance
> +improvements, but can also result in excess (and often
> +unnecessary) memory usage. Lowering this value has the
> +opposite affect. By forcing readaheads to be less aggressive,
> +memory may be conserved at a potential performance impact.
> +
> +The default value is 31.
> +=============================================================
>  
> -max_map_count:
>  
> +
> +min-readahead:
> +--------------
> +Like max-readahead, min-readahead places a floor on the
> +readahead value. Raising this number forces a files readahead
> +value to be unconditionally higher, which can bring about
> +performance improvements, provided that all file access in
> +the system is predictably linear from the start to the end of
> +a file. This of course results in higher memory usage from
> +the pagecache. Conversely, lowering this value, allows the
> +kernel to conserve pagecache memory, at a potential
> +performance cost.
> +
> +The default value is 3.
> +=============================================================
> +
> +
> +
> +max_map_count:
> +--------------
>  This file contains the maximum number of memory map areas a
>  process may have. Memory map areas are used as a side-effect
>  of calling malloc, directly by mmap and mprotect, and also
> @@ -159,10 +235,29 @@ While most applications need less than a
>  certain programs, particularly malloc debuggers, may consume 
>  lots of them, e.g. up to one or two maps per allocation.
>  
> -==============================================================
> +The default value is 65536.
> +=============================================================
> +
> +
> +
> +overcommit_memory:
> +------------------
> +This value contains a flag to enable memory overcommitment.
> +When this flag is 0, the kernel checks before each malloc()
> +to see if there's enough memory left. If the flag is nonzero,
> +the system pretends there's always enough memory.
> +
> +This feature can be very useful because there are a lot of
> +programs that malloc() huge amounts of memory "just-in-case"
> +and don't use much of it. The default value is 0.
> +
> +Look at: mm/mmap.c::vm_enough_memory() for more information.
> +=============================================================
> +
>  
> -page-cluster:
>  
> +page-cluster:
> +-------------
>  The Linux VM subsystem avoids excessive disk seeks by reading
>  multiple pages on a page fault. The number of pages it reads
>  is dependent on the amount of memory in your machine.
> @@ -170,11 +265,12 @@ is dependent on the amount of memory in 
>  The number of pages the kernel reads in at once is equal to
>  2 ^ page-cluster. Values above 2 ^ 5 don't make much sense
>  for swap because we only cluster swap data in 32-page groups.
> +=============================================================
>  
> -==============================================================
>  
> -pagetable_cache:
>  
> +pagetable_cache:
> +----------------
>  The kernel keeps a number of page tables in a per-processor
>  cache (this helps a lot on SMP systems). The cache size for
>  each processor will be between the low and the high value.
> @@ -188,3 +284,98 @@ For large systems, the settings are prob
>  systems they won't hurt a bit. For small systems (<16MB ram)
>  it might be advantageous to set both values to 0.
>  
> +The default value is: 25 50.
> +=============================================================
> +
> +
> +
> +vm_anon_lru:
> +------------
> +select if to immdiatly insert anon pages in the lru.
> +Immediatly means as soon as they're allocated during the page
> +faults. If this is set to 0, they're inserted only after the
> +first swapout.
> +  
> +Having anon pages immediatly inserted in the lru allows the
> +VM to know better when it's worthwhile to start swapping
> +anonymous ram, it will start to swap earlier and it should
> +swap smoother and faster, but it will decrease scalability
> +on the >16-ways of an order of magnitude. Big SMP/NUMA
> +definitely can't take an hit on a global spinlock at
> +every anon page allocation.
> +
> +Low ram machines that swaps all the time want to turn
> +this on (i.e. set to 1).
> +
> +The default value is 1.
> +=============================================================
> +
> +
> +
> +vm_cache_scan_ratio:
> +--------------------
> +is how much of the inactive LRU queue we will scan in one go.
> +A value of 6 for vm_cache_scan_ratio implies that we'll scan
> +1/6 of the inactive lists during a normal aging round.
> +
> +The default value is 6.
> +=============================================================
> +
> +
> +
> +vm_gfp_debug:
> +------------
> +is when __alloc_pages fails, dump us a stack. This will
> +mostly happen during OOM conditions (hopefully ;)
> +
> +The default value is 0.
> +=============================================================
> +
> +
> +
> +vm_lru_balance_ratio:
> +---------------------
> +controls the balance between active and inactive cache. The
> +bigger vm_balance is, the easier the active cache will grow,
> +because we'll rotate the active list slowly. A value of 2
> +means we'll go towards a balance of 1/3 of the cache being
> +inactive.
> +
> +The default value is 2.
> +=============================================================
> +
> +
> +
> +vm_mapped_ratio:
> +----------------
> +controls the pageout rate, the smaller, the earlier we'll
> +start to pageout.
> +
> +The default value is 100.
> +=============================================================
> +
> +
> +
> +vm_passes:
> +----------
> +is the number of vm passes before failing the memory
> +balancing. Take into account 3 passes are needed for a
> +flush/wait/free cycle and that we only scan
> +1/vm_cache_scan_ratio of the inactive list at each pass.
> +
> +The default value is 60.
> +=============================================================
> +
> +
> +
> +vm_vfs_scan_ratio:
> +------------------
> +is what proportion of the VFS queues we will scan in one go.
> +A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the
> +unused-inode, dentry and dquot caches will be freed during a
> +normal aging round.
> +Big fileservers (NFS, SMB etc.) probably want to set this
> +value to 3 or 2.
> +
> +The default value is 6.
> +=============================================================
> --- a/Documentation/filesystems/proc.txt	2004-05-23 00:08:31.000000000 +0200
> +++ b/Documentation/filesystems/proc.txt	2004-05-23 02:33:41.000000000 +0200
> @@ -936,172 +936,7 @@ program to load modules on demand.
>  
>  2.4 /proc/sys/vm - The virtual memory subsystem
>  -----------------------------------------------
> -
> -The files  in  this directory can be used to tune the operation of the virtual
> -memory (VM)  subsystem  of  the  Linux  kernel.  In addition, one of the files
> -(bdflush) has some influence on disk usage.
> -
> -bdflush
> --------
> -
> -This file  controls  the  operation of the bdflush kernel daemon. It currently
> -contains nine  integer  values,  six of which are actually used by the kernel.
> -They are listed in table 2-2.
> -
> -
> -Table 2-2: Parameters in /proc/sys/vm/bdflush 
> -..............................................................................
> - Value      Meaning                                                            
> - nfract     Percentage of buffer cache dirty to activate bdflush              
> - ndirty     Maximum number of dirty blocks to  write out per wake-cycle        
> - dummy      Unused                                                             
> - dummy      Unused                                                             
> - interval   jiffies delay between kupdate flushes
> - age_buffer Time for normal buffer to age before we flush it                   
> - nfract_sync Percentage of buffer cache dirty to activate bdflush synchronously
> - nfract_stop_bdflush Percetange of buffer cache dirty to stop bdflush
> - dummy      Unused                                                             
> -..............................................................................
> -
> -nfract
> -------
> -
> -This parameter  governs  the  maximum  number  of  dirty buffers in the buffer
> -cache. Dirty means that the contents of the buffer still have to be written to
> -disk (as  opposed  to  a  clean  buffer,  which  can just be forgotten about).
> -Setting this  to  a  higher value means that Linux can delay disk writes for a
> -long time, but it also means that it will have to do a lot of I/O at once when
> -memory becomes short. A lower value will spread out disk I/O more evenly.
> -
> -interval
> ---------
> -
> -The interval between two kupdate runs. The value is expressed in
> -jiffies (clockticks),  the  number of jiffies per second is 100.
> -
> -ndirty
> -------
> -
> -Ndirty gives the maximum number of dirty buffers that bdflush can write to the
> -disk at  one  time.  A high value will mean delayed, bursty I/O, while a small
> -value can lead to memory shortage when bdflush isn't woken up often enough.
> -
> -age_buffer
> -----------
> -
> -Finally, the age_buffer parameter govern the maximum time Linux
> -waits before  writing  out  a  dirty buffer to disk. The value is expressed in
> -jiffies (clockticks),  the  number of jiffies per second is 100.
> -
> -nfract_sync
> ------------
> -
> -nfract_stop_bdflush
> --------------------
> -
> -kswapd
> -------
> -
> -Kswapd is  the  kernel  swap  out daemon. That is, kswapd is that piece of the
> -kernel that  frees  memory when it gets fragmented or full. Since every system
> -is different, you'll probably want some control over this piece of the system.
> -
> -The file contains three numbers:
> -
> -tries_base
> -----------
> -
> -The maximum  number  of  pages kswapd tries to free in one round is calculated
> -from this  number.  Usually  this  number  will  be  divided  by  4  or 8 (see
> -mm/vmscan.c), so it isn't as big as it looks.
> -
> -When you  need to increase the bandwidth to/from swap, you'll want to increase
> -this number.
> -
> -tries_min
> ----------
> -
> -This is  the  minimum number of times kswapd tries to free a page each time it
> -is called. Basically it's just there to make sure that kswapd frees some pages
> -even when it's being called with minimum priority.
> -
> -overcommit_memory
> ------------------
> -
> -This file  contains  one  value.  The following algorithm is used to decide if
> -there's enough  memory:  if  the  value of overcommit_memory is positive, then
> -there's always  enough  memory. This is a useful feature, since programs often
> -malloc() huge  amounts  of  memory 'just in case', while they only use a small
> -part of  it.  Leaving  this value at 0 will lead to the failure of such a huge
> -malloc(), when in fact the system has enough memory for the program to run.
> -
> -On the  other  hand,  enabling this feature can cause you to run out of memory
> -and thrash the system to death, so large and/or important servers will want to
> -set this value to 0.
> -
> -pagetable_cache
> ----------------
> -
> -The kernel  keeps a number of page tables in a per-processor cache (this helps
> -a lot  on  SMP systems). The cache size for each processor will be between the
> -low and the high value.
> -
> -On a  low-memory,  single  CPU system, you can safely set these values to 0 so
> -you don't  waste  memory.  It  is  used  on SMP systems so that the system can
> -perform fast  pagetable allocations without having to acquire the kernel memory
> -lock.
> -
> -For large  systems,  the  settings  are probably fine. For normal systems they
> -won't hurt  a  bit.  For  small  systems  (  less  than  16MB ram) it might be
> -advantageous to set both values to 0.
> -
> -swapctl
> --------
> -
> -This file  contains  no less than 8 variables. All of these values are used by
> -kswapd.
> -
> -The first four variables
> -* sc_max_page_age,
> -* sc_page_advance,
> -* sc_page_decline and
> -* sc_page_initial_age
> -are used  to  keep  track  of  Linux's page aging. Page aging is a bookkeeping
> -method to  track  which pages of memory are often used, and which pages can be
> -swapped out without consequences.
> -
> -When a  page  is  swapped in, it starts at sc_page_initial_age (default 3) and
> -when the  page  is  scanned  by  kswapd,  its age is adjusted according to the
> -following scheme:
> -
> -* If  the  page  was used since the last time we scanned, its age is increased
> -  by sc_page_advance  (default  3).  Where  the  maximum  value  is  given  by
> -  sc_max_page_age (default 20).
> -* Otherwise  (meaning  it wasn't used) its age is decreased by sc_page_decline
> -  (default 1).
> -
> -When a page reaches age 0, it's ready to be swapped out.
> -
> -The variables  sc_age_cluster_fract, sc_age_cluster_min, sc_pageout_weight and
> -sc_bufferout_weight, can  be  used  to  control  kswapd's  aggressiveness  in
> -swapping out pages.
> -
> -Sc_age_cluster_fract is used to calculate how many pages from a process are to
> -be scanned by kswapd. The formula used is
> -
> -(sc_age_cluster_fract divided by 1024) times resident set size
> -
> -So if you want kswapd to scan the whole process, sc_age_cluster_fract needs to
> -have a  value  of  1024.  The  minimum  number  of  pages  kswapd will scan is
> -represented by sc_age_cluster_min, which is done so that kswapd will also scan
> -small processes.
> -
> -The values  of  sc_pageout_weight  and sc_bufferout_weight are used to control
> -how many  tries  kswapd  will make in order to swap out one page/buffer. These
> -values can  be used to fine-tune the ratio between user pages and buffer/cache
> -memory. When  you find that your Linux system is swapping out too many process
> -pages in  order  to  satisfy  buffer  memory  demands,  you may want to either
> -increase sc_bufferout_weight, or decrease the value of sc_pageout_weight.
> +Please read Documentation/sysctl/vm.txt
>  
>  2.5 /proc/sys/dev - Device specific parameters
>  ----------------------------------------------
> @@ -1719,10 +1719,3 @@ need to  recompile  the kernel, or even 
>  command to write value into these files, thereby changing the default settings
>  of the kernel.
>  ------------------------------------------------------------------------------
> -
> -
> -
> -
> -
> -
> -

> --- a/Documentation/sysctl/vm.txt	2002-11-28 16:53:08.000000000 -0700
> +++ b/Documentation/sysctl/vm.txt	2003-11-12 17:35:11.000000000 -0700
> @@ -18,13 +18,10 @@
>  
>  Currently, these files are in /proc/sys/vm:
>  - bdflush
> -- buffermem
> -- freepages
>  - kswapd
>  - max_map_count
>  - overcommit_memory
>  - page-cluster
> -- pagecache
>  - pagetable_cache
>  
>  ==============================================================
> @@ -102,38 +99,6 @@
>  of buffer cache that is dirty which will stop bdflush.
>  The default is 20%, the miniumum is 0%, and the maxiumum is 100%.
>  ==============================================================
> -buffermem:
> -
> -The three values in this file correspond to the values in
> -the struct buffer_mem. It controls how much memory should
> -be used for buffer memory. The percentage is calculated
> -as a percentage of total system memory.
> -
> -The values are:
> -min_percent	-- this is the minimum percentage of memory
> -		   that should be spent on buffer memory
> -borrow_percent  -- UNUSED
> -max_percent     -- UNUSED
> -
> -==============================================================
> -freepages:
> -
> -This file contains the values in the struct freepages. That
> -struct contains three members: min, low and high.
> -
> -The meaning of the numbers is:
> -
> -freepages.min	When the number of free pages in the system
> -		reaches this number, only the kernel can
> -		allocate more memory.
> -freepages.low	If the number of free pages gets below this
> -		point, the kernel starts swapping aggressively.
> -freepages.high	The kernel tries to keep up to this amount of
> -		memory free; if memory comes below this point,
> -		the kernel gently starts swapping in the hopes
> -		that it never has to do real aggressive swapping.
> -
> -==============================================================
>  
>  kswapd:
>  
> @@ -208,24 +173,6 @@
>  
>  ==============================================================
>  
> -pagecache:
> -
> -This file does exactly the same as buffermem, only this
> -file controls the struct page_cache, and thus controls
> -the amount of memory used for the page cache.
> -
> -In 2.2, the page cache is used for 3 main purposes:
> -- caching read() data from files
> -- caching mmap()ed data and executable files
> -- swap cache
> -
> -When your system is both deep in swap and high on cache,
> -it probably means that a lot of the swapped data is being
> -cached, making for more efficient swapping than possible
> -with the 2.0 kernel.
> -
> -==============================================================
> -
>  pagetable_cache:
>  
>  The kernel keeps a number of page tables in a per-processor
> --- a/Documentation/filesystems/proc.txt	2004-05-21 22:54:13.000000000 +0200
> +++ b/Documentation/filesystems/proc.txt	2004-05-23 00:08:09.000000000 +0200
> @@ -999,54 +999,6 @@ nfract_sync
>  nfract_stop_bdflush
>  -------------------
>  
> -buffermem
> ----------
> -
> -The three  values  in  this  file  control  how much memory should be used for
> -buffer memory.  The  percentage  is calculated as a percentage of total system
> -memory.
> -
> -The values are:
> -
> -min_percent
> ------------
> -
> -This is  the  minimum  percentage  of  memory  that  should be spent on buffer
> -memory.
> -
> -borrow_percent
> ---------------
> -
> -When Linux is short on memory, and the buffer cache uses more than it has been
> -allotted, the  memory  management  (MM)  subsystem will prune the buffer cache
> -more heavily than other memory to compensate.
> -
> -max_percent
> ------------
> -
> -This is the maximum amount of memory that can be used for buffer memory.
> -
> -freepages
> ----------
> -
> -This file contains three values: min, low and high:
> -
> -min
> ----
> -When the  number  of  free  pages  in the system reaches this number, only the
> -kernel can allocate more memory.
> -
> -low
> ----
> -If the number of free pages falls below this point, the kernel starts swapping
> -aggressively.
> -
> -high
> -----
> -The kernel  tries  to  keep  up to this amount of memory free; if memory falls
> -below this point, the kernel starts gently swapping in the hopes that it never
> -has to do really aggressive swapping.
> -
>  kswapd
>  ------
>  
> @@ -1073,16 +1025,6 @@ This is  the  minimum number of times ks
>  is called. Basically it's just there to make sure that kswapd frees some pages
>  even when it's being called with minimum priority.
>  
> -swap_cluster
> -------------
> -
> -This is probably the greatest influence on system performance.
> -
> -swap_cluster is  the  number  of  pages kswapd writes in one turn. You'll want
> -this value  to  be  large  so that kswapd does its I/O in large chunks and the
> -disk doesn't  have  to  seek  as  often, but you don't want it to be too large
> -since that would flood the request queue.
> -
>  overcommit_memory
>  -----------------
>  
> @@ -1097,15 +1039,6 @@ On the  other  hand,  enabling this feat
>  and thrash the system to death, so large and/or important servers will want to
>  set this value to 0.
>  
> -pagecache
> ----------
> -
> -This file  does exactly the same job as buffermem, only this file controls the
> -amount of memory allowed for memory mapping and generic caching of files.
> -
> -You don't  want  the  minimum level to be too low, otherwise your system might
> -thrash when memory is tight or fragmentation is high.
> -
>  pagetable_cache
>  ---------------
>  


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [2.4] heavy-load under swap space shortage
  2004-05-26 12:41             ` Marcelo Tosatti
                                 ` (2 preceding siblings ...)
  2004-05-26 22:23               ` Andrea Arcangeli
@ 2004-05-28  2:55               ` j-nomura
  3 siblings, 0 replies; 42+ messages in thread
From: j-nomura @ 2004-05-28  2:55 UTC (permalink / raw)
  To: marcelo.tosatti; +Cc: j-nomura, linux-kernel, andrea, akpm, hugh

Hi Marcelo,

> I think we can merge this patch.
> 
> Its very safe - default behaviour unchanged. 

Yes.

> Jun, are you willing to do another test for us if this gets merged
> in v2.4.27-pre4 ?

Yes. I'll try.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Lazily add anonymous pages to LRU on v2.4? was Re: [2.4] heavy-load under swap space shortage
  2004-03-14 23:22                   ` Andrew Morton
  2004-03-15  0:14                     ` Andrea Arcangeli
  2004-03-16  6:31                     ` Marcelo Tosatti
@ 2004-11-22 15:01                     ` Marcelo Tosatti
  2004-11-22 19:49                       ` Andrea Arcangeli
  2 siblings, 1 reply; 42+ messages in thread
From: Marcelo Tosatti @ 2004-11-22 15:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, j-nomura, linux-kernel, riel, Hugh Dickins

On Sun, Mar 14, 2004 at 03:22:53PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > > 
> > > Having a magic knob is a weak solution: the majority of people who are
> > > affected by this problem won't know to turn it on.
> > 
> > that's why I turned it _on_ by default in my tree ;)
> 
> So maybe Marcelo should apply this patch, and also turn it on by default.

I've been pondering this again for 2.4.29pre - the thing I'm not sure about 
what negative effect will be caused by not adding anonymous pages to LRU 
immediately on creation.

The scanning algorithm will apply more pressure to pagecache pages initially 
(which are on the LRU) - but that is _hopefully_ no problem because swap_out() will
kick-in soon moving anon pages to LRU soon as they are swap-allocated.

I'm afraid that might be a significant problem for some workloads. No?

Marc-Christian-Petersen claims it improves behaviour for him - how so Marc, 
and what is your workload/hardware description? 

This is known to decrease contention on pagemap_lru_lock.

Guys, doo you have any further thoughts on this? 
I think I'll give it a shot on 2.4.29-pre?

> > There are workloads where adding anonymous pages to the lru is
> > suboptimal for both the vm (cache shrinking) and the fast path too
> > (lru_cache_add), not sure how 2.6 optimizes those bits, since with 2.6
> > you're forced to add those pages to the lru somehow and that implies
> > some form of locking.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Lazily add anonymous pages to LRU on v2.4? was Re: [2.4] heavy-load under swap space shortage
  2004-11-22 19:49                       ` Andrea Arcangeli
@ 2004-11-22 15:58                         ` Marcelo Tosatti
  0 siblings, 0 replies; 42+ messages in thread
From: Marcelo Tosatti @ 2004-11-22 15:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, j-nomura, linux-kernel, riel, Hugh Dickins

On Mon, Nov 22, 2004 at 08:49:53PM +0100, Andrea Arcangeli wrote:
> On Mon, Nov 22, 2004 at 01:01:38PM -0200, Marcelo Tosatti wrote:
> > On Sun, Mar 14, 2004 at 03:22:53PM -0800, Andrew Morton wrote:
> > > Andrea Arcangeli <andrea@suse.de> wrote:
> > > >
> > > > > 
> > > > > Having a magic knob is a weak solution: the majority of people who are
> > > > > affected by this problem won't know to turn it on.
> > > > 
> > > > that's why I turned it _on_ by default in my tree ;)
> > > 
> > > So maybe Marcelo should apply this patch, and also turn it on by default.
> > 
> > I've been pondering this again for 2.4.29pre - the thing I'm not sure about 
> > what negative effect will be caused by not adding anonymous pages to LRU 
> > immediately on creation.
> > 
> > The scanning algorithm will apply more pressure to pagecache pages initially 
> > (which are on the LRU) - but that is _hopefully_ no problem because swap_out() will
> > kick-in soon moving anon pages to LRU soon as they are swap-allocated.
> > 
> > I'm afraid that might be a significant problem for some workloads. No?
> > 
> > Marc-Christian-Petersen claims it improves behaviour for him - how so Marc, 
> > and what is your workload/hardware description? 
> > 
> > This is known to decrease contention on pagemap_lru_lock.
> > 
> > Guys, doo you have any further thoughts on this? 
> > I think I'll give it a shot on 2.4.29-pre?
> 
> I think you mean the one liner patch that avoids the lru_cache_add
> during anonymous page allocation (you didn't quote it, and I can't see
> the start of the thread). I develoepd that patch for 2.4-aa and I'm
> using it for years, and it runs in all latest SLES8 kernels too, plus
> 2.4-aa is the only kernel I'm sure can sustain certain extreme VM loads
> with heavy swapping of shmfs during heavy I/O. So you can apply it
> safely I think.

Yes it is your patch I am talking about Andrea. Ok, good to hear that.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Lazily add anonymous pages to LRU on v2.4? was Re: [2.4] heavy-load under swap space shortage
  2004-11-22 15:01                     ` Lazily add anonymous pages to LRU on v2.4? was " Marcelo Tosatti
@ 2004-11-22 19:49                       ` Andrea Arcangeli
  2004-11-22 15:58                         ` Marcelo Tosatti
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Arcangeli @ 2004-11-22 19:49 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, j-nomura, linux-kernel, riel, Hugh Dickins

On Mon, Nov 22, 2004 at 01:01:38PM -0200, Marcelo Tosatti wrote:
> On Sun, Mar 14, 2004 at 03:22:53PM -0800, Andrew Morton wrote:
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > >
> > > > 
> > > > Having a magic knob is a weak solution: the majority of people who are
> > > > affected by this problem won't know to turn it on.
> > > 
> > > that's why I turned it _on_ by default in my tree ;)
> > 
> > So maybe Marcelo should apply this patch, and also turn it on by default.
> 
> I've been pondering this again for 2.4.29pre - the thing I'm not sure about 
> what negative effect will be caused by not adding anonymous pages to LRU 
> immediately on creation.
> 
> The scanning algorithm will apply more pressure to pagecache pages initially 
> (which are on the LRU) - but that is _hopefully_ no problem because swap_out() will
> kick-in soon moving anon pages to LRU soon as they are swap-allocated.
> 
> I'm afraid that might be a significant problem for some workloads. No?
> 
> Marc-Christian-Petersen claims it improves behaviour for him - how so Marc, 
> and what is your workload/hardware description? 
> 
> This is known to decrease contention on pagemap_lru_lock.
> 
> Guys, doo you have any further thoughts on this? 
> I think I'll give it a shot on 2.4.29-pre?

I think you mean the one liner patch that avoids the lru_cache_add
during anonymous page allocation (you didn't quote it, and I can't see
the start of the thread). I develoepd that patch for 2.4-aa and I'm
using it for years, and it runs in all latest SLES8 kernels too, plus
2.4-aa is the only kernel I'm sure can sustain certain extreme VM loads
with heavy swapping of shmfs during heavy I/O. So you can apply it
safely I think.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2004-11-22 20:30 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-02-02 10:12 [2.4] heavy-load under swap space shortage j-nomura
2004-02-02 13:29 ` Hugh Dickins
2004-02-03  7:53   ` j-nomura
2004-02-03 17:19     ` Hugh Dickins
2004-02-04 11:40       ` j-nomura
2004-02-05 18:42         ` Hugh Dickins
2004-02-06  9:03           ` j-nomura
2004-03-10 10:57           ` j-nomura
2004-03-14 19:47             ` Marcelo Tosatti
2004-03-14 19:54               ` Rik van Riel
2004-03-14 20:15               ` Andrew Morton
     [not found]                 ` <20040314230138.GV30940@dualathlon.random>
2004-03-14 23:22                   ` Andrew Morton
2004-03-15  0:14                     ` Andrea Arcangeli
2004-03-15  4:38                       ` Nick Piggin
2004-03-15 11:49                         ` Andrea Arcangeli
2004-03-15 13:23                           ` Rik van Riel
2004-03-15 14:37                             ` Nick Piggin
2004-03-15 14:50                               ` Andrea Arcangeli
2004-03-15 18:35                                 ` Andrew Morton
2004-03-15 18:51                                   ` Andrea Arcangeli
2004-03-15 19:02                                     ` Andrew Morton
2004-03-15 21:55                                       ` Andrea Arcangeli
2004-03-15 22:05                                 ` Nick Piggin
2004-03-15 22:24                                   ` Andrea Arcangeli
2004-03-15 22:41                                     ` Nick Piggin
2004-03-15 22:44                                       ` Andrea Arcangeli
2004-03-15 22:41                                     ` Rik van Riel
2004-03-15 23:32                                       ` Andrea Arcangeli
2004-03-16  6:27                                         ` Nick Piggin
2004-03-16  7:25                                   ` Marcelo Tosatti
2004-03-16  6:31                     ` Marcelo Tosatti
2004-03-16 13:47                       ` Andrea Arcangeli
2004-03-16 16:59                         ` Marcelo Tosatti
2004-11-22 15:01                     ` Lazily add anonymous pages to LRU on v2.4? was " Marcelo Tosatti
2004-11-22 19:49                       ` Andrea Arcangeli
2004-11-22 15:58                         ` Marcelo Tosatti
2004-05-26 12:41             ` Marcelo Tosatti
2004-05-26 18:24               ` Marc-Christian Petersen
2004-05-27 11:16                 ` Marcelo Tosatti
2004-05-26 19:06               ` Hugh Dickins
2004-05-26 22:23               ` Andrea Arcangeli
2004-05-28  2:55               ` j-nomura

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.