linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/20] VM pageout scalability improvements
@ 2007-12-18 21:15 Rik van Riel
  2007-12-18 21:15 ` [patch 01/20] convert anon_vma list lock a read/write lock Rik van Riel
                   ` (20 more replies)
  0 siblings, 21 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn

On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not
only does it use up CPU time, but it also provokes lock contention
and can leave large systems under memory presure in a catatonic state.

This patch series improves VM scalability by:

1) making the locking a little more scalable

2) putting filesystem backed, swap backed and non-reclaimable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory

3) switching to SEQ replacement for the anonymous LRUs, so the
   number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number

The noreclaim patches come verbatim from Lee Schermerhorn and
Nick Piggin.  I have not taken a detailed look at them yet and
all I have done is fix the rejects against the latest -mm kernel.

I am posting this series now because I would like to get more
feedback, while I am studying and improving the noreclaim patches
myself.

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 01/20] convert anon_vma list lock a read/write lock
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-20  7:07   ` Christoph Lameter
  2007-12-18 21:15 ` [patch 02/20] make the inode i_mmap_lock a reader/writer lock Rik van Riel
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: make-anon_vma-lock-rw.patch --]
[-- Type: text/plain, Size: 5429 bytes --]

Make the anon_vma list lock a read/write lock.  Heaviest use of this
lock is in the page_referenced()/try_to_unmap() calls from vmscan
[shrink_page_list()].  These functions can use a read lock to allow
some parallelism for different cpus trying to reclaim pages mapped
via the same set of vmas.

This change should not change the footprint of the anon_vma in the
non-debug case.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: Linux/include/linux/rmap.h
===================================================================
--- Linux.orig/include/linux/rmap.h	2007-11-28 10:54:36.000000000 -0500
+++ Linux/include/linux/rmap.h	2007-11-28 11:12:37.000000000 -0500
@@ -25,7 +25,7 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	spinlock_t lock;	/* Serialize access to vma list */
+	rwlock_t rwlock;	/* Serialize access to vma list */
 	struct list_head head;	/* List of private "related" vmas */
 };
 
@@ -43,18 +43,21 @@ static inline void anon_vma_free(struct 
 	kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+/*
+ * This needs to be a write lock for __vma_link()
+ */
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_lock(&anon_vma->lock);
+		write_lock(&anon_vma->rwlock);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		write_unlock(&anon_vma->rwlock);
 }
 
 /*
Index: Linux/mm/rmap.c
===================================================================
--- Linux.orig/mm/rmap.c	2007-11-28 10:54:37.000000000 -0500
+++ Linux/mm/rmap.c	2007-11-28 11:12:37.000000000 -0500
@@ -25,7 +25,7 @@
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_lock
- *         anon_vma->lock
+ *         anon_vma->rwlock
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
  *             swap_lock (in swap_duplicate, swap_info_get)
@@ -68,7 +68,7 @@ int anon_vma_prepare(struct vm_area_stru
 		if (anon_vma) {
 			allocated = NULL;
 			locked = anon_vma;
-			spin_lock(&locked->lock);
+			write_lock(&locked->rwlock);
 		} else {
 			anon_vma = anon_vma_alloc();
 			if (unlikely(!anon_vma))
@@ -87,7 +87,7 @@ int anon_vma_prepare(struct vm_area_stru
 		spin_unlock(&mm->page_table_lock);
 
 		if (locked)
-			spin_unlock(&locked->lock);
+			write_unlock(&locked->rwlock);
 		if (unlikely(allocated))
 			anon_vma_free(allocated);
 	}
@@ -113,9 +113,9 @@ void anon_vma_link(struct vm_area_struct
 	struct anon_vma *anon_vma = vma->anon_vma;
 
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		write_lock(&anon_vma->rwlock);
 		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
-		spin_unlock(&anon_vma->lock);
+		write_unlock(&anon_vma->rwlock);
 	}
 }
 
@@ -127,12 +127,12 @@ void anon_vma_unlink(struct vm_area_stru
 	if (!anon_vma)
 		return;
 
-	spin_lock(&anon_vma->lock);
+	write_lock(&anon_vma->rwlock);
 	list_del(&vma->anon_vma_node);
 
 	/* We must garbage collect the anon_vma if it's empty */
 	empty = list_empty(&anon_vma->head);
-	spin_unlock(&anon_vma->lock);
+	write_unlock(&anon_vma->rwlock);
 
 	if (empty)
 		anon_vma_free(anon_vma);
@@ -142,7 +142,7 @@ static void anon_vma_ctor(struct kmem_ca
 {
 	struct anon_vma *anon_vma = data;
 
-	spin_lock_init(&anon_vma->lock);
+ 	rwlock_init(&anon_vma->rwlock);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -169,7 +169,7 @@ static struct anon_vma *page_lock_anon_v
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	read_lock(&anon_vma->rwlock);
 	return anon_vma;
 out:
 	rcu_read_unlock();
@@ -178,7 +178,7 @@ out:
 
 static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
-	spin_unlock(&anon_vma->lock);
+	read_unlock(&anon_vma->rwlock);
 	rcu_read_unlock();
 }
 
Index: Linux/mm/mmap.c
===================================================================
--- Linux.orig/mm/mmap.c	2007-11-28 10:54:36.000000000 -0500
+++ Linux/mm/mmap.c	2007-11-28 11:12:37.000000000 -0500
@@ -564,7 +564,7 @@ again:			remove_next = 1 + (end > next->
 	if (vma->anon_vma)
 		anon_vma = vma->anon_vma;
 	if (anon_vma) {
-		spin_lock(&anon_vma->lock);
+		write_lock(&anon_vma->rwlock);
 		/*
 		 * Easily overlooked: when mprotect shifts the boundary,
 		 * make sure the expanding vma has anon_vma set if the
@@ -618,7 +618,7 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	if (anon_vma)
-		spin_unlock(&anon_vma->lock);
+		write_unlock(&anon_vma->rwlock);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
 
Index: Linux/mm/migrate.c
===================================================================
--- Linux.orig/mm/migrate.c	2007-11-28 10:54:36.000000000 -0500
+++ Linux/mm/migrate.c	2007-11-28 11:12:37.000000000 -0500
@@ -229,12 +229,12 @@ static void remove_anon_migration_ptes(s
 	 * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
 	 */
 	anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-	spin_lock(&anon_vma->lock);
+	read_lock(&anon_vma->rwlock);
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&anon_vma->lock);
+	read_unlock(&anon_vma->rwlock);
 }
 
 /*

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
  2007-12-18 21:15 ` [patch 01/20] convert anon_vma list lock a read/write lock Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-19  0:48   ` Nick Piggin
  2007-12-18 21:15 ` [patch 03/20] move isolate_lru_page() to vmscan.c Rik van Riel
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: make-i_mmap_lock-rw.patch --]
[-- Type: text/plain, Size: 19390 bytes --]

I have seen soft cpu lockups in page_referenced_file() due to 
contention on i_mmap_lock() for different pages.  Making the
i_mmap_lock a reader/writer lock should increase parallelism
in vmscan for file back pages mapped into many address spaces.

Read lock the i_mmap_lock for all usage except:

1) mmap/munmap:  linking vma into i_mmap prio_tree or removing
2) unmap_mapping_range:   protecting vm_truncate_count

rmap:  try_to_unmap_file() required new cond_resched_rwlock().
To reduce code duplication, I recast cond_resched_lock() as a
[static inline] wrapper around reworked cond_sched_lock() =>
__cond_resched_lock(void *lock, int type). 
New cond_resched_rwlock() implemented as another wrapper.  


Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc5-mm1/include/linux/fs.h
===================================================================
--- linux-2.6.24-rc5-mm1.orig/include/linux/fs.h
+++ linux-2.6.24-rc5-mm1/include/linux/fs.h
@@ -501,7 +501,7 @@ struct address_space {
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
+	rwlock_t		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
Index: linux-2.6.24-rc5-mm1/include/linux/mm.h
===================================================================
--- linux-2.6.24-rc5-mm1.orig/include/linux/mm.h
+++ linux-2.6.24-rc5-mm1/include/linux/mm.h
@@ -707,7 +707,7 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
-	spinlock_t *i_mmap_lock;		/* For unmap_mapping_range: */
+	rwlock_t *i_mmap_lock;			/* For unmap_mapping_range: */
 	unsigned long truncate_count;		/* Compare vm_truncate_count */
 };
 
Index: linux-2.6.24-rc5-mm1/fs/inode.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/fs/inode.c
+++ linux-2.6.24-rc5-mm1/fs/inode.c
@@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.tree_lock);
-	spin_lock_init(&inode->i_data.i_mmap_lock);
+ 	rwlock_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
Index: linux-2.6.24-rc5-mm1/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/fs/hugetlbfs/inode.c
+++ linux-2.6.24-rc5-mm1/fs/hugetlbfs/inode.c
@@ -420,6 +420,9 @@ static void hugetlbfs_drop_inode(struct 
 		hugetlbfs_forget_inode(inode);
 }
 
+/*
+ * LOCKING:  __unmap_hugepage_range() requires write lock on i_mmap_lock
+ */
 static inline void
 hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff)
 {
@@ -454,10 +457,10 @@ static int hugetlb_vmtruncate(struct ino
 	pgoff = offset >> PAGE_SHIFT;
 
 	i_size_write(inode, offset);
-	spin_lock(&mapping->i_mmap_lock);
+	write_lock(&mapping->i_mmap_lock);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
-	spin_unlock(&mapping->i_mmap_lock);
+	write_unlock(&mapping->i_mmap_lock);
 	truncate_hugepages(inode, offset);
 	return 0;
 }
Index: linux-2.6.24-rc5-mm1/kernel/fork.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/kernel/fork.c
+++ linux-2.6.24-rc5-mm1/kernel/fork.c
@@ -272,12 +272,12 @@ static int dup_mmap(struct mm_struct *mm
 				atomic_dec(&inode->i_writecount);
 
 			/* insert tmp into the share list, just after mpnt */
-			spin_lock(&file->f_mapping->i_mmap_lock);
+			write_lock(&file->f_mapping->i_mmap_lock);
 			tmp->vm_truncate_count = mpnt->vm_truncate_count;
 			flush_dcache_mmap_lock(file->f_mapping);
 			vma_prio_tree_add(tmp, mpnt);
 			flush_dcache_mmap_unlock(file->f_mapping);
-			spin_unlock(&file->f_mapping->i_mmap_lock);
+			write_unlock(&file->f_mapping->i_mmap_lock);
 		}
 
 		/*
Index: linux-2.6.24-rc5-mm1/mm/filemap_xip.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/mm/filemap_xip.c
+++ linux-2.6.24-rc5-mm1/mm/filemap_xip.c
@@ -182,7 +182,7 @@ __xip_unmap (struct address_space * mapp
 	if (!page)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	read_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		mm = vma->vm_mm;
 		address = vma->vm_start +
@@ -200,7 +200,7 @@ __xip_unmap (struct address_space * mapp
 			page_cache_release(page);
 		}
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&mapping->i_mmap_lock);
 }
 
 /*
Index: linux-2.6.24-rc5-mm1/mm/fremap.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/mm/fremap.c
+++ linux-2.6.24-rc5-mm1/mm/fremap.c
@@ -202,13 +202,13 @@ asmlinkage long sys_remap_file_pages(uns
 			}
 			goto out;
 		}
-		spin_lock(&mapping->i_mmap_lock);
+		write_lock(&mapping->i_mmap_lock);
 		flush_dcache_mmap_lock(mapping);
 		vma->vm_flags |= VM_NONLINEAR;
 		vma_prio_tree_remove(vma, &mapping->i_mmap);
 		vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
 		flush_dcache_mmap_unlock(mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		write_unlock(&mapping->i_mmap_lock);
 	}
 
 	err = populate_range(mm, vma, start, size, pgoff);
Index: linux-2.6.24-rc5-mm1/mm/hugetlb.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/mm/hugetlb.c
+++ linux-2.6.24-rc5-mm1/mm/hugetlb.c
@@ -721,9 +721,9 @@ void unmap_hugepage_range(struct vm_area
 	 * do nothing in this case.
 	 */
 	if (vma->vm_file) {
-		spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+		write_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 		__unmap_hugepage_range(vma, start, end);
-		spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+		write_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 	}
 }
 
@@ -964,7 +964,7 @@ void hugetlb_change_protection(struct vm
 	BUG_ON(address >= end);
 	flush_cache_range(vma, address, end);
 
-	spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
+	read_lock(&vma->vm_file->f_mapping->i_mmap_lock);
 	spin_lock(&mm->page_table_lock);
 	for (; address < end; address += HPAGE_SIZE) {
 		ptep = huge_pte_offset(mm, address);
@@ -979,7 +979,7 @@ void hugetlb_change_protection(struct vm
 		}
 	}
 	spin_unlock(&mm->page_table_lock);
-	spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+	read_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
 
 	flush_tlb_range(vma, start, end);
 }
Index: linux-2.6.24-rc5-mm1/mm/memory.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/mm/memory.c
+++ linux-2.6.24-rc5-mm1/mm/memory.c
@@ -811,7 +811,7 @@ unsigned long unmap_vmas(struct mmu_gath
 	unsigned long tlb_start = 0;	/* For tlb_finish_mmu */
 	int tlb_start_valid = 0;
 	unsigned long start = start_addr;
-	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
+	rwlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
 	int fullmm = (*tlbp)->fullmm;
 
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
@@ -1727,7 +1727,7 @@ unwritable_page:
  * can't efficiently keep all vmas in step with mapping->truncate_count:
  * so instead reset them all whenever it wraps back to 0 (then go to 1).
  * mapping->truncate_count and vma->vm_truncate_count are protected by
- * i_mmap_lock.
+ * write locked i_mmap_lock.
  *
  * In order to make forward progress despite repeatedly restarting some
  * large vma, note the restart_addr from unmap_vmas when it breaks out:
@@ -1792,9 +1792,10 @@ again:
 			goto again;
 	}
 
-	spin_unlock(details->i_mmap_lock);
+//TODO:  why not cond_resched_lock() here [rwlock version]?
+	write_unlock(details->i_mmap_lock);
 	cond_resched();
-	spin_lock(details->i_mmap_lock);
+	write_lock(details->i_mmap_lock);
 	return -EINTR;
 }
 
@@ -1890,7 +1891,7 @@ void unmap_mapping_range(struct address_
 		details.last_index = ULONG_MAX;
 	details.i_mmap_lock = &mapping->i_mmap_lock;
 
-	spin_lock(&mapping->i_mmap_lock);
+	write_lock(&mapping->i_mmap_lock);
 
 	/* Protect against endless unmapping loops */
 	mapping->truncate_count++;
@@ -1905,7 +1906,7 @@ void unmap_mapping_range(struct address_
 		unmap_mapping_range_tree(&mapping->i_mmap, &details);
 	if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
 		unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-	spin_unlock(&mapping->i_mmap_lock);
+	write_unlock(&mapping->i_mmap_lock);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
Index: linux-2.6.24-rc5-mm1/mm/migrate.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/mm/migrate.c
+++ linux-2.6.24-rc5-mm1/mm/migrate.c
@@ -202,12 +202,12 @@ static void remove_file_migration_ptes(s
 	if (!mapping)
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	read_lock(&mapping->i_mmap_lock);
 
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
 		remove_migration_pte(vma, old, new);
 
-	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&mapping->i_mmap_lock);
 }
 
 /*
Index: linux-2.6.24-rc5-mm1/mm/mmap.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/mm/mmap.c
+++ linux-2.6.24-rc5-mm1/mm/mmap.c
@@ -186,7 +186,7 @@ error:
 }
 
 /*
- * Requires inode->i_mapping->i_mmap_lock
+ * Requires write locked inode->i_mapping->i_mmap_lock
  */
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		struct file *file, struct address_space *mapping)
@@ -214,9 +214,9 @@ void unlink_file_vma(struct vm_area_stru
 
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		write_lock(&mapping->i_mmap_lock);
 		__remove_shared_vm_struct(vma, file, mapping);
-		spin_unlock(&mapping->i_mmap_lock);
+		write_unlock(&mapping->i_mmap_lock);
 	}
 }
 
@@ -439,7 +439,7 @@ static void vma_link(struct mm_struct *m
 		mapping = vma->vm_file->f_mapping;
 
 	if (mapping) {
-		spin_lock(&mapping->i_mmap_lock);
+		write_lock(&mapping->i_mmap_lock);
 		vma->vm_truncate_count = mapping->truncate_count;
 	}
 	anon_vma_lock(vma);
@@ -449,7 +449,7 @@ static void vma_link(struct mm_struct *m
 
 	anon_vma_unlock(vma);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		write_unlock(&mapping->i_mmap_lock);
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -536,7 +536,7 @@ again:			remove_next = 1 + (end > next->
 		mapping = file->f_mapping;
 		if (!(vma->vm_flags & VM_NONLINEAR))
 			root = &mapping->i_mmap;
-		spin_lock(&mapping->i_mmap_lock);
+		write_lock(&mapping->i_mmap_lock);
 		if (importer &&
 		    vma->vm_truncate_count != next->vm_truncate_count) {
 			/*
@@ -620,7 +620,7 @@ again:			remove_next = 1 + (end > next->
 	if (anon_vma)
 		write_unlock(&anon_vma->rwlock);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		write_unlock(&mapping->i_mmap_lock);
 
 	if (remove_next) {
 		if (file)
@@ -2061,7 +2061,7 @@ void exit_mmap(struct mm_struct *mm)
 
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
- * then i_mmap_lock is taken here.
+ * then i_mmap_lock is write locked here.
  */
 int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
 {
Index: linux-2.6.24-rc5-mm1/mm/mremap.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/mm/mremap.c
+++ linux-2.6.24-rc5-mm1/mm/mremap.c
@@ -83,7 +83,7 @@ static void move_ptes(struct vm_area_str
 		 * and we propagate stale pages into the dst afterward.
 		 */
 		mapping = vma->vm_file->f_mapping;
-		spin_lock(&mapping->i_mmap_lock);
+		read_lock(&mapping->i_mmap_lock);
 		if (new_vma->vm_truncate_count &&
 		    new_vma->vm_truncate_count != vma->vm_truncate_count)
 			new_vma->vm_truncate_count = 0;
@@ -115,7 +115,7 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_nested(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
-		spin_unlock(&mapping->i_mmap_lock);
+		read_unlock(&mapping->i_mmap_lock);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
Index: linux-2.6.24-rc5-mm1/mm/rmap.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/mm/rmap.c
+++ linux-2.6.24-rc5-mm1/mm/rmap.c
@@ -368,7 +368,7 @@ static int page_referenced_file(struct p
 	 */
 	BUG_ON(!PageLocked(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	read_lock(&mapping->i_mmap_lock);
 
 	/*
 	 * i_mmap_lock does not stabilize mapcount at all, but mapcount
@@ -394,7 +394,7 @@ static int page_referenced_file(struct p
 			break;
 	}
 
-	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&mapping->i_mmap_lock);
 	return referenced;
 }
 
@@ -474,12 +474,12 @@ static int page_mkclean_file(struct addr
 
 	BUG_ON(PageAnon(page));
 
-	spin_lock(&mapping->i_mmap_lock);
+	read_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		if (vma->vm_flags & VM_SHARED)
 			ret += page_mkclean_one(page, vma);
 	}
-	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
@@ -909,7 +909,7 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 
-	spin_lock(&mapping->i_mmap_lock);
+	read_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
@@ -946,7 +946,7 @@ static int try_to_unmap_file(struct page
 	mapcount = page_mapcount(page);
 	if (!mapcount)
 		goto out;
-	cond_resched_lock(&mapping->i_mmap_lock);
+	cond_resched_rwlock(&mapping->i_mmap_lock, 0);
 
 	max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK;
 	if (max_nl_cursor == 0)
@@ -968,7 +968,7 @@ static int try_to_unmap_file(struct page
 			}
 			vma->vm_private_data = (void *) max_nl_cursor;
 		}
-		cond_resched_lock(&mapping->i_mmap_lock);
+		cond_resched_rwlock(&mapping->i_mmap_lock, 0);
 		max_nl_cursor += CLUSTER_SIZE;
 	} while (max_nl_cursor <= max_nl_size);
 
@@ -980,7 +980,7 @@ static int try_to_unmap_file(struct page
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
 		vma->vm_private_data = NULL;
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&mapping->i_mmap_lock);
 	return ret;
 }
 
Index: linux-2.6.24-rc5-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.24-rc5-mm1.orig/include/linux/sched.h
+++ linux-2.6.24-rc5-mm1/include/linux/sched.h
@@ -1889,12 +1889,23 @@ static inline int need_resched(void)
  * cond_resched() and cond_resched_lock(): latency reduction via
  * explicit rescheduling in places that are safe. The return
  * value indicates whether a reschedule was done in fact.
- * cond_resched_lock() will drop the spinlock before scheduling,
- * cond_resched_softirq() will enable bhs before scheduling.
+ * cond_resched_softirq() will enable bhs before scheduling,
+ * cond_resched_*lock() will drop the *lock before scheduling.
  */
 extern int cond_resched(void);
-extern int cond_resched_lock(spinlock_t * lock);
 extern int cond_resched_softirq(void);
+extern int __cond_resched_lock(void * lock, int lock_type);
+
+#define COND_RESCHED_SPIN  2
+static inline int cond_resched_lock(spinlock_t * lock)
+{
+	return __cond_resched_lock(lock, COND_RESCHED_SPIN);
+}
+
+static inline int cond_resched_rwlock(rwlock_t * lock, int write_lock)
+{
+	return __cond_resched_lock(lock, !!write_lock);
+}
 
 /*
  * Does a critical section need to be broken due to another
Index: linux-2.6.24-rc5-mm1/kernel/sched.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/kernel/sched.c
+++ linux-2.6.24-rc5-mm1/kernel/sched.c
@@ -4686,34 +4686,78 @@ int __sched cond_resched(void)
 EXPORT_SYMBOL(cond_resched);
 
 /*
- * cond_resched_lock() - if a reschedule is pending, drop the given lock,
+ * helper functions for __cond_resched_lock()
+ */
+static int __need_lockbreak(void *lock, int type)
+{
+	if (likely(type == COND_RESCHED_SPIN))
+		return need_lockbreak((spinlock_t *)lock);
+	else
+		return need_lockbreak((rwlock_t *)lock);
+}
+
+static void __reacquire_lock(void *lock, int type)
+{
+	if (likely(type == COND_RESCHED_SPIN))
+		spin_lock((spinlock_t *)lock);
+	else if (type)
+		write_unlock((rwlock_t *)lock);
+	else
+		read_unlock((rwlock_t *)lock);
+}
+
+static void __drop_lock(void *lock, int type)
+{
+	if (likely(type == COND_RESCHED_SPIN))
+		spin_unlock((spinlock_t *)lock);
+	else if (type)
+		write_unlock((rwlock_t *)lock);
+	else
+		read_unlock((rwlock_t *)lock);
+}
+
+static void __release_lock(void *lock, int type)
+{
+	if (likely(type == COND_RESCHED_SPIN))
+		spin_release(&((spinlock_t *)lock)->dep_map, 1, _RET_IP_);
+	else
+		rwlock_release(&((rwlock_t *)lock)->dep_map, 1, _RET_IP_);
+}
+
+/*
+ * __cond_resched_lock() - if a reschedule is pending, drop the given lock,
  * call schedule, and on return reacquire the lock.
  *
+ * Lock type:
+ *  0 = rwlock held for read
+ *  1 = rwlock held for write
+ *  2 = COND_RESCHED_SPIN = spinlock
+ *
  * This works OK both with and without CONFIG_PREEMPT. We do strange low-level
  * operations here to prevent schedule() from being called twice (once via
- * spin_unlock(), once by hand).
+ * *_unlock(), once by hand).
  */
-int cond_resched_lock(spinlock_t *lock)
+int __cond_resched_lock(void *lock, int type)
 {
 	int ret = 0;
 
-	if (need_lockbreak(lock)) {
-		spin_unlock(lock);
+	if (__need_lockbreak(lock, type)) {
+		__drop_lock(lock, type);
 		cpu_relax();
 		ret = 1;
-		spin_lock(lock);
+		__reacquire_lock(lock, type);
 	}
 	if (need_resched() && system_state == SYSTEM_RUNNING) {
-		spin_release(&lock->dep_map, 1, _THIS_IP_);
-		_raw_spin_unlock(lock);
+		__release_lock(lock, type);
+		__drop_lock(lock, type);
 		preempt_enable_no_resched();
 		__cond_resched();
 		ret = 1;
-		spin_lock(lock);
+		__reacquire_lock(lock, type);
 	}
 	return ret;
 }
-EXPORT_SYMBOL(cond_resched_lock);
+EXPORT_SYMBOL(__cond_resched_lock);
 
 int __sched cond_resched_softirq(void)
 {
Index: linux-2.6.24-rc5-mm1/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux-2.6.24-rc5-mm1.orig/arch/x86/mm/hugetlbpage.c
+++ linux-2.6.24-rc5-mm1/arch/x86/mm/hugetlbpage.c
@@ -68,7 +68,7 @@ static void huge_pmd_share(struct mm_str
 	if (!vma_shareable(vma, addr))
 		return;
 
-	spin_lock(&mapping->i_mmap_lock);
+	read_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -93,7 +93,7 @@ static void huge_pmd_share(struct mm_str
 		put_page(virt_to_page(spte));
 	spin_unlock(&mm->page_table_lock);
 out:
-	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&mapping->i_mmap_lock);
 }
 
 /*

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 03/20] move isolate_lru_page() to vmscan.c
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
  2007-12-18 21:15 ` [patch 01/20] convert anon_vma list lock a read/write lock Rik van Riel
  2007-12-18 21:15 ` [patch 02/20] make the inode i_mmap_lock a reader/writer lock Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-20  7:08   ` Christoph Lameter
  2007-12-18 21:15 ` [patch 04/20] free swap space on swap-in/activation Rik van Riel
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Nick Piggin, Lee Schermerhorn

[-- Attachment #1: np-01-move-and-rework-isolate_lru_page-v2.patch --]
[-- Type: text/plain, Size: 6783 bytes --]

V1 -> V2 [lts]:
+  fix botched merge -- add back "get_page_unless_zero()"

  From: Nick Piggin <npiggin@suse.de>
  To: Linux Memory Management <linux-mm@kvack.org>
  Subject: [patch 1/4] mm: move and rework isolate_lru_page
  Date:	Mon, 12 Mar 2007 07:38:44 +0100 (CET)

isolate_lru_page logically belongs to be in vmscan.c than migrate.c.

It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a subsequent
patch needs to make use of it in the core mm, so we can happily move it
to vmscan.c.

Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.

	Note that we now have '__isolate_lru_page()', that does
	something quite different, visible outside of vmscan.c
	for use with memory controller.  Methinks we need to
	rationalize these names/purposes.	--lts

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Index: linux-2.6.24-rc3-mm2/include/linux/migrate.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/migrate.h
+++ linux-2.6.24-rc3-mm2/include/linux/migrate.h
@@ -25,7 +25,6 @@ static inline int vma_migratable(struct 
 	return 1;
 }
 
-extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -42,8 +41,6 @@ extern int migrate_vmas(struct mm_struct
 static inline int vma_migratable(struct vm_area_struct *vma)
 					{ return 0; }
 
-static inline int isolate_lru_page(struct page *p, struct list_head *list)
-					{ return -ENOSYS; }
 static inline int putback_lru_pages(struct list_head *l) { return 0; }
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private) { return -ENOSYS; }
Index: linux-2.6.24-rc3-mm2/mm/internal.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/internal.h
+++ linux-2.6.24-rc3-mm2/mm/internal.h
@@ -34,6 +34,8 @@ static inline void __put_page(struct pag
 	atomic_dec(&page->_count);
 }
 
+extern int isolate_lru_page(struct page *page);
+
 extern void fastcall __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
Index: linux-2.6.24-rc3-mm2/mm/migrate.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/migrate.c
+++ linux-2.6.24-rc3-mm2/mm/migrate.c
@@ -36,36 +36,6 @@
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 /*
- * Isolate one page from the LRU lists. If successful put it onto
- * the indicated list with elevated page count.
- *
- * Result:
- *  -EBUSY: page not on LRU list
- *  0: page removed from LRU list and added to the specified list.
- */
-int isolate_lru_page(struct page *page, struct list_head *pagelist)
-{
-	int ret = -EBUSY;
-
-	if (PageLRU(page)) {
-		struct zone *zone = page_zone(page);
-
-		spin_lock_irq(&zone->lru_lock);
-		if (PageLRU(page) && get_page_unless_zero(page)) {
-			ret = 0;
-			ClearPageLRU(page);
-			if (PageActive(page))
-				del_page_from_active_list(zone, page);
-			else
-				del_page_from_inactive_list(zone, page);
-			list_add_tail(&page->lru, pagelist);
-		}
-		spin_unlock_irq(&zone->lru_lock);
-	}
-	return ret;
-}
-
-/*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page().
  */
@@ -853,14 +823,17 @@ static int do_move_pages(struct mm_struc
 				!migrate_all)
 			goto put_and_set;
 
-		err = isolate_lru_page(page, &pagelist);
+		err = isolate_lru_page(page);
+		if (err) {
 put_and_set:
-		/*
-		 * Either remove the duplicate refcount from
-		 * isolate_lru_page() or drop the page ref if it was
-		 * not isolated.
-		 */
-		put_page(page);
+			/*
+			 * Either remove the duplicate refcount from
+			 * isolate_lru_page() or drop the page ref if it was
+			 * not isolated.
+			 */
+			put_page(page);
+		} else
+			list_add_tail(&page->lru, &pagelist);
 set_status:
 		pp->status = err;
 	}
Index: linux-2.6.24-rc3-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/vmscan.c
+++ linux-2.6.24-rc3-mm2/mm/vmscan.c
@@ -829,6 +829,47 @@ static unsigned long clear_active_flags(
 	return nr_active;
 }
 
+/**
+ * isolate_lru_page(@page)
+ *
+ * Isolate one @page from the LRU lists. Must be called with an elevated
+ * refcount on the page, which is a fundamentnal difference from
+ * isolate_lru_pages (which is called without a stable reference).
+ *
+ * The returned page will have PageLru() cleared, and PageActive set,
+ * if it was found on the active list. This flag generally will need to be
+ * cleared by the caller before letting the page go.
+ *
+ * The vmstat page counts corresponding to the list on which the page was
+ * found will be decremented.
+ *
+ * lru_lock must not be held, interrupts must be enabled.
+ *
+ * Returns:
+ *  -EBUSY: page not on LRU list
+ *  0: page removed from LRU list.
+ */
+int isolate_lru_page(struct page *page)
+{
+	int ret = -EBUSY;
+
+	if (PageLRU(page)) {
+		struct zone *zone = page_zone(page);
+
+		spin_lock_irq(&zone->lru_lock);
+		if (PageLRU(page) && get_page_unless_zero(page)) {
+			ret = 0;
+			ClearPageLRU(page);
+			if (PageActive(page))
+				del_page_from_active_list(zone, page);
+			else
+				del_page_from_inactive_list(zone, page);
+		}
+		spin_unlock_irq(&zone->lru_lock);
+	}
+	return ret;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
Index: linux-2.6.24-rc3-mm2/mm/mempolicy.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/mempolicy.c
+++ linux-2.6.24-rc3-mm2/mm/mempolicy.c
@@ -93,6 +93,8 @@
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
 
+#include "internal.h"
+
 /* Internal flags */
 #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
 #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
@@ -603,8 +605,12 @@ static void migrate_page_add(struct page
 	/*
 	 * Avoid migrating a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1)
-		isolate_lru_page(page, pagelist);
+	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
+		if (!isolate_lru_page(page)) {
+			get_page(page);
+			list_add_tail(&page->lru, pagelist);
+		}
+	}
 }
 
 static struct page *new_node_page(struct page *page, unsigned long node, int **x)

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 04/20] free swap space on swap-in/activation
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (2 preceding siblings ...)
  2007-12-18 21:15 ` [patch 03/20] move isolate_lru_page() to vmscan.c Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 05/20] define page_file_cache() function Rik van Riel
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: rvr-00-linux-2.6-swapfree.patch --]
[-- Type: text/plain, Size: 2781 bytes --]

+ lts' convert anon_vma list lock to reader/write lock patch
+ Nick Piggin's move and rework isolate_lru_page() patch

Free swap cache entries when swapping in pages if vm_swap_full()
[swap space > 1/2 used?].  Uses new pagevec to reduce pressure
on locks.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Index: linux-2.6.24-rc3-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/vmscan.c
+++ linux-2.6.24-rc3-mm2/mm/vmscan.c
@@ -632,6 +632,9 @@ free_it:
 		continue;
 
 activate_locked:
+		/* Not a candidate for swapping, so reclaim swap space. */
+		if (PageSwapCache(page) && vm_swap_full())
+			remove_exclusive_swap_page(page);
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -1213,6 +1216,8 @@ static void shrink_active_list(unsigned 
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 			pgmoved = 0;
 			spin_unlock_irq(&zone->lru_lock);
+			if (vm_swap_full())
+				pagevec_swap_free(&pvec);
 			__pagevec_release(&pvec);
 			spin_lock_irq(&zone->lru_lock);
 		}
@@ -1222,6 +1227,8 @@ static void shrink_active_list(unsigned 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
 	spin_unlock_irq(&zone->lru_lock);
+	if (vm_swap_full())
+		pagevec_swap_free(&pvec);
 
 	pagevec_release(&pvec);
 }
Index: linux-2.6.24-rc3-mm2/mm/swap.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/swap.c
+++ linux-2.6.24-rc3-mm2/mm/swap.c
@@ -465,6 +465,24 @@ void pagevec_strip(struct pagevec *pvec)
 	}
 }
 
+/*
+ * Try to free swap space from the pages in a pagevec
+ */
+void pagevec_swap_free(struct pagevec *pvec)
+{
+	int i;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+
+		if (PageSwapCache(page) && !TestSetPageLocked(page)) {
+			if (PageSwapCache(page))
+				remove_exclusive_swap_page(page);
+			unlock_page(page);
+		}
+	}
+}
+
 /**
  * pagevec_lookup - gang pagecache lookup
  * @pvec:	Where the resulting pages are placed
Index: linux-2.6.24-rc3-mm2/include/linux/pagevec.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/pagevec.h
+++ linux-2.6.24-rc3-mm2/include/linux/pagevec.h
@@ -26,6 +26,7 @@ void __pagevec_free(struct pagevec *pvec
 void __pagevec_lru_add(struct pagevec *pvec);
 void __pagevec_lru_add_active(struct pagevec *pvec);
 void pagevec_strip(struct pagevec *pvec);
+void pagevec_swap_free(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
 unsigned pagevec_lookup_tag(struct pagevec *pvec,

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 05/20] define page_file_cache() function
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (3 preceding siblings ...)
  2007-12-18 21:15 ` [patch 04/20] free swap space on swap-in/activation Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 06/20] debugging checks for page_file_cache() Rik van Riel
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: rvr-01-linux-2.6-page_file_cache.patch --]
[-- Type: text/plain, Size: 6439 bytes --]

Define page_file_cache() function to answer the question:
	is page backed by a file?

Originally part of Rik van Riel's split-lru patch.  Extracted
to make available for other, independent reclaim patches.

Moved inline function to linux/mm_inline.h where it will
be needed by subsequent "split LRU" and "noreclaim" patches.  

Unfortunately this needs to use a page flag, since the
PG_swapbacked state needs to be preserved all the way
to the point where the page is last removed from the
LRU.  Trying to derive the status from other info in
the page resulted in wrong VM statistics in earlier
split VM patchsets.


Signed-off-by:  Rik van Riel <riel@redhat.com>
Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>


Index: linux-2.6.24-rc3-mm2/include/linux/mm_inline.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/mm_inline.h
+++ linux-2.6.24-rc3-mm2/include/linux/mm_inline.h
@@ -1,3 +1,24 @@
+#ifndef LINUX_MM_INLINE_H
+#define LINUX_MM_INLINE_H
+
+/**
+ * page_file_cache(@page)
+ * Returns !0 if @page is page cache page backed by a regular filesystem,
+ * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed.
+ *
+ * We would like to get this info without a page flag, but the state
+ * needs to survive until the page is last deleted from the LRU, which
+ * could be as far down as __page_cache_release.
+ */
+static inline int page_file_cache(struct page *page)
+{
+	if (PageSwapBacked(page))
+		return 0;
+
+	/* The page is page cache backed by a normal filesystem. */
+	return 2;
+}
+
 static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
 {
@@ -38,3 +59,4 @@ del_page_from_lru(struct zone *zone, str
 	}
 }
 
+#endif
Index: linux-2.6.24-rc3-mm2/mm/shmem.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/shmem.c
+++ linux-2.6.24-rc3-mm2/mm/shmem.c
@@ -1279,6 +1279,7 @@ repeat:
 				goto failed;
 			}
 
+			SetPageSwapBacked(filepage);
 			spin_lock(&info->lock);
 			entry = shmem_swp_alloc(info, idx, sgp);
 			if (IS_ERR(entry))
Index: linux-2.6.24-rc3-mm2/include/linux/page-flags.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/page-flags.h
+++ linux-2.6.24-rc3-mm2/include/linux/page-flags.h
@@ -89,6 +89,7 @@
 #define PG_mappedtodisk		16	/* Has blocks allocated on-disk */
 #define PG_reclaim		17	/* To be reclaimed asap */
 #define PG_buddy		19	/* Page is free, on buddy lists */
+#define PG_swapbacked		20	/* Page is backed by RAM/swap */
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead		PG_reclaim /* Reminder to do async read-ahead */
@@ -216,6 +217,10 @@ static inline void SetPageUptodate(struc
 #define ClearPageReclaim(page)	clear_bit(PG_reclaim, &(page)->flags)
 #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags)
 
+#define PageSwapBacked(page)	test_bit(PG_swapbacked, &(page)->flags)
+#define SetPageSwapBacked(page)	set_bit(PG_swapbacked, &(page)->flags)
+#define __ClearPageSwapBacked(page)	__clear_bit(PG_swapbacked, &(page)->flags)
+
 #define PageCompound(page)	test_bit(PG_compound, &(page)->flags)
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
Index: linux-2.6.24-rc3-mm2/mm/memory.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/memory.c
+++ linux-2.6.24-rc3-mm2/mm/memory.c
@@ -1664,6 +1664,7 @@ gotten:
 		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
+		SetPageSwapBacked(new_page);
 		lru_cache_add_active(new_page);
 		page_add_new_anon_rmap(new_page, vma, address);
 
@@ -2132,6 +2133,7 @@ static int do_anonymous_page(struct mm_s
 	if (!pte_none(*page_table))
 		goto release;
 	inc_mm_counter(mm, anon_rss);
+	SetPageSwapBacked(page);
 	lru_cache_add_active(page);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
@@ -2285,6 +2287,7 @@ static int __do_fault(struct mm_struct *
 		set_pte_at(mm, address, page_table, entry);
 		if (anon) {
                         inc_mm_counter(mm, anon_rss);
+			SetPageSwapBacked(page);
                         lru_cache_add_active(page);
                         page_add_new_anon_rmap(page, vma, address);
 		} else {
Index: linux-2.6.24-rc3-mm2/mm/swap_state.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/swap_state.c
+++ linux-2.6.24-rc3-mm2/mm/swap_state.c
@@ -86,6 +86,7 @@ static int __add_to_swap_cache(struct pa
 		if (!error) {
 			page_cache_get(page);
 			SetPageSwapCache(page);
+			SetPageSwapBacked(page);
 			set_page_private(page, entry.val);
 			total_swapcache_pages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
Index: linux-2.6.24-rc3-mm2/mm/migrate.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/migrate.c
+++ linux-2.6.24-rc3-mm2/mm/migrate.c
@@ -546,6 +546,8 @@ static int move_to_new_page(struct page 
 	/* Prepare mapping for the new page.*/
 	newpage->index = page->index;
 	newpage->mapping = page->mapping;
+	if (PageSwapBacked(page))
+		SetPageSwapBacked(newpage);
 
 	mapping = page_mapping(page);
 	if (!mapping)
Index: linux-2.6.24-rc3-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/page_alloc.c
+++ linux-2.6.24-rc3-mm2/mm/page_alloc.c
@@ -253,6 +253,7 @@ static void bad_page(struct page *page)
 			1 << PG_slab    |
 			1 << PG_swapcache |
 			1 << PG_writeback |
+			1 << PG_swapbacked |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -486,6 +487,8 @@ static inline int free_pages_check(struc
 		bad_page(page);
 	if (PageDirty(page))
 		__ClearPageDirty(page);
+	if (PageSwapBacked(page))
+		__ClearPageSwapBacked(page);
 	/*
 	 * For now, we report if PG_reserved was found set, but do not
 	 * clear it, and do not free the page.  But we shall soon need
@@ -632,6 +635,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_swapbacked |
 			1 << PG_buddy ))))
 		bad_page(page);
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 06/20] debugging checks for page_file_cache()
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (4 preceding siblings ...)
  2007-12-18 21:15 ` [patch 05/20] define page_file_cache() function Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 07/20] Use an indexed array for LRU variables Rik van Riel
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn

[-- Attachment #1: rvr-page_file_cache-debug.patch --]
[-- Type: text/plain, Size: 2231 bytes --]

Debug whether we end up classifying the wrong pages as
filesystem backed.  This has not triggered in stress
tests on my system, but who knows...

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc3-mm2/include/linux/mm_inline.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/mm_inline.h
+++ linux-2.6.24-rc3-mm2/include/linux/mm_inline.h
@@ -1,6 +1,8 @@
 #ifndef LINUX_MM_INLINE_H
 #define LINUX_MM_INLINE_H
 
+#include <linux/fs.h>  /* for struct address_space */
+
 /**
  * page_file_cache(@page)
  * Returns !0 if @page is page cache page backed by a regular filesystem,
@@ -10,11 +12,19 @@
  * needs to survive until the page is last deleted from the LRU, which
  * could be as far down as __page_cache_release.
  */
+extern const struct address_space_operations shmem_aops;
 static inline int page_file_cache(struct page *page)
 {
+	struct address_space * mapping = page_mapping(page);
+
 	if (PageSwapBacked(page))
 		return 0;
 
+	/* These pages should all be marked PG_swapbacked */
+	WARN_ON(PageAnon(page));
+	WARN_ON(PageSwapCache(page));
+	WARN_ON(mapping && mapping->a_ops && mapping->a_ops == &shmem_aops);
+
 	/* The page is page cache backed by a normal filesystem. */
 	return 2;
 }
Index: linux-2.6.24-rc3-mm2/mm/shmem.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/shmem.c
+++ linux-2.6.24-rc3-mm2/mm/shmem.c
@@ -178,7 +178,7 @@ static inline void shmem_unacct_blocks(u
 }
 
 static const struct super_operations shmem_ops;
-static const struct address_space_operations shmem_aops;
+const struct address_space_operations shmem_aops;
 static const struct file_operations shmem_file_operations;
 static const struct inode_operations shmem_inode_operations;
 static const struct inode_operations shmem_dir_inode_operations;
@@ -2232,7 +2232,7 @@ static void destroy_inodecache(void)
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 
-static const struct address_space_operations shmem_aops = {
+const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
 #ifdef CONFIG_TMPFS

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 07/20] Use an indexed array for LRU variables
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (5 preceding siblings ...)
  2007-12-18 21:15 ` [patch 06/20] debugging checks for page_file_cache() Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 08/20] split LRU lists into anon & file sets Rik van Riel
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn, Christoph Lameter

[-- Attachment #1: cl-use-indexed-array-of-lru-lists.patch --]
[-- Type: text/plain, Size: 14229 bytes --]

V1 -> V2 [lts]:
+ Remove extraneous  __dec_zone_state(zone, NR_ACTIVE) pointed
  out by Mel G.

>From clameter@sgi.com Wed Aug 29 11:39:51 2007

Currently we are defining explicit variables for the inactive
and active list. An indexed array can be more generic and avoid
repeating similar code in several places in the reclaim code.

We are saving a few bytes in terms of code size:

Before:

   text    data     bss     dec     hex filename
4097753  573120 4092484 8763357  85b7dd vmlinux

After:

   text    data     bss     dec     hex filename
4097729  573120 4092484 8763333  85b7c5 vmlinux

Having an easy way to add new lru lists may ease future work on
the reclaim code.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

 include/linux/mm_inline.h |   34 ++++++++---
 include/linux/mmzone.h    |   17 +++--
 mm/page_alloc.c           |    9 +--
 mm/swap.c                 |    2 
 mm/vmscan.c               |  132 ++++++++++++++++++++++------------------------
 mm/vmstat.c               |    3 -
 6 files changed, 107 insertions(+), 90 deletions(-)

Index: linux-2.6.24-rc3-mm2/include/linux/mmzone.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/mmzone.h
+++ linux-2.6.24-rc3-mm2/include/linux/mmzone.h
@@ -80,8 +80,8 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_INACTIVE,
-	NR_ACTIVE,
+	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
+	NR_ACTIVE,	/*  "     "     "   "       "         */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
@@ -105,6 +105,13 @@ enum zone_stat_item {
 #endif
 	NR_VM_ZONE_STAT_ITEMS };
 
+enum lru_list {
+	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
+	LRU_ACTIVE,	/*  "     "     "   "       "        */
+	NR_LRU_LISTS };
+
+#define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
@@ -258,10 +265,8 @@ struct zone {
 
 	/* Fields commonly accessed by the page reclaim scanner */
 	spinlock_t		lru_lock;	
-	struct list_head	active_list;
-	struct list_head	inactive_list;
-	unsigned long		nr_scan_active;
-	unsigned long		nr_scan_inactive;
+	struct list_head	list[NR_LRU_LISTS];
+	unsigned long		nr_scan[NR_LRU_LISTS];
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
Index: linux-2.6.24-rc3-mm2/include/linux/mm_inline.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/mm_inline.h
+++ linux-2.6.24-rc3-mm2/include/linux/mm_inline.h
@@ -30,43 +30,55 @@ static inline int page_file_cache(struct
 }
 
 static inline void
+add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
+{
+	list_add(&page->lru, &zone->list[l]);
+	__inc_zone_state(zone, NR_INACTIVE + l);
+}
+
+static inline void
+del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
+{
+	list_del(&page->lru);
+	__dec_zone_state(zone, NR_INACTIVE + l);
+}
+
+
+static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
 {
-	list_add(&page->lru, &zone->active_list);
-	__inc_zone_state(zone, NR_ACTIVE);
+	add_page_to_lru_list(zone, page, LRU_ACTIVE);
 }
 
 static inline void
 add_page_to_inactive_list(struct zone *zone, struct page *page)
 {
-	list_add(&page->lru, &zone->inactive_list);
-	__inc_zone_state(zone, NR_INACTIVE);
+	add_page_to_lru_list(zone, page, LRU_INACTIVE);
 }
 
 static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
-	list_del(&page->lru);
-	__dec_zone_state(zone, NR_ACTIVE);
+	del_page_from_lru_list(zone, page, LRU_ACTIVE);
 }
 
 static inline void
 del_page_from_inactive_list(struct zone *zone, struct page *page)
 {
-	list_del(&page->lru);
-	__dec_zone_state(zone, NR_INACTIVE);
+	del_page_from_lru_list(zone, page, LRU_INACTIVE);
 }
 
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
+	enum lru_list l = LRU_INACTIVE;
+
 	list_del(&page->lru);
 	if (PageActive(page)) {
 		__ClearPageActive(page);
-		__dec_zone_state(zone, NR_ACTIVE);
-	} else {
-		__dec_zone_state(zone, NR_INACTIVE);
+		l = LRU_ACTIVE;
 	}
+	__dec_zone_state(zone, NR_INACTIVE + l);
 }
 
 #endif
Index: linux-2.6.24-rc3-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/page_alloc.c
+++ linux-2.6.24-rc3-mm2/mm/page_alloc.c
@@ -3404,6 +3404,7 @@ static void __meminit free_area_init_cor
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize, memmap_pages;
+		enum lru_list l;
 
 		size = zone_spanned_pages_in_node(nid, j, zones_size);
 		realsize = size - zone_absent_pages_in_node(nid, j,
@@ -3453,10 +3454,10 @@ static void __meminit free_area_init_cor
 		zone->prev_priority = DEF_PRIORITY;
 
 		zone_pcp_init(zone);
-		INIT_LIST_HEAD(&zone->active_list);
-		INIT_LIST_HEAD(&zone->inactive_list);
-		zone->nr_scan_active = 0;
-		zone->nr_scan_inactive = 0;
+		for_each_lru(l) {
+			INIT_LIST_HEAD(&zone->list[l]);
+			zone->nr_scan[l] = 0;
+		}
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
Index: linux-2.6.24-rc3-mm2/mm/swap.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/swap.c
+++ linux-2.6.24-rc3-mm2/mm/swap.c
@@ -118,7 +118,7 @@ static void pagevec_move_tail(struct pag
 			spin_lock(&zone->lru_lock);
 		}
 		if (PageLRU(page) && !PageActive(page)) {
-			list_move_tail(&page->lru, &zone->inactive_list);
+			list_move_tail(&page->lru, &zone->list[LRU_INACTIVE]);
 			pgmoved++;
 		}
 	}
Index: linux-2.6.24-rc3-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/vmscan.c
+++ linux-2.6.24-rc3-mm2/mm/vmscan.c
@@ -807,10 +807,10 @@ static unsigned long isolate_pages_globa
 					int active)
 {
 	if (active)
-		return isolate_lru_pages(nr, &z->active_list, dst,
+		return isolate_lru_pages(nr, &z->list[LRU_ACTIVE], dst,
 						scanned, order, mode);
 	else
-		return isolate_lru_pages(nr, &z->inactive_list, dst,
+		return isolate_lru_pages(nr, &z->list[LRU_INACTIVE], dst,
 						scanned, order, mode);
 }
 
@@ -956,10 +956,7 @@ static unsigned long shrink_inactive_lis
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			if (PageActive(page))
-				add_page_to_active_list(zone, page);
-			else
-				add_page_to_inactive_list(zone, page);
+			add_page_to_lru_list(zone, page, PageActive(page));
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
@@ -1127,11 +1124,14 @@ static void shrink_active_list(unsigned 
 	int pgdeactivate = 0;
 	unsigned long pgscanned;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
-	LIST_HEAD(l_inactive);	/* Pages to go onto the inactive_list */
-	LIST_HEAD(l_active);	/* Pages to go onto the active_list */
+	struct list_head list[NR_LRU_LISTS];
 	struct page *page;
 	struct pagevec pvec;
 	int reclaim_mapped = 0;
+	enum lru_list l;
+
+	for_each_lru(l)
+		INIT_LIST_HEAD(&list[l]);
 
 	if (sc->may_swap)
 		reclaim_mapped = calc_reclaim_mapped(sc, zone, priority);
@@ -1159,28 +1159,28 @@ static void shrink_active_list(unsigned 
 			if (!reclaim_mapped ||
 			    (total_swap_pages == 0 && PageAnon(page)) ||
 			    page_referenced(page, 0, sc->mem_cgroup)) {
-				list_add(&page->lru, &l_active);
+				list_add(&page->lru, &list[LRU_ACTIVE]);
 				continue;
 			}
 		} else if (TestClearPageReferenced(page)) {
-			list_add(&page->lru, &l_active);
+			list_add(&page->lru, &list[LRU_ACTIVE]);
 			continue;
 		}
-		list_add(&page->lru, &l_inactive);
+		list_add(&page->lru, &list[LRU_INACTIVE]);
 	}
 
 	pagevec_init(&pvec, 1);
 	pgmoved = 0;
 	spin_lock_irq(&zone->lru_lock);
-	while (!list_empty(&l_inactive)) {
-		page = lru_to_page(&l_inactive);
-		prefetchw_prev_lru_page(page, &l_inactive, flags);
+	while (!list_empty(&list[LRU_INACTIVE])) {
+		page = lru_to_page(&list[LRU_INACTIVE]);
+		prefetchw_prev_lru_page(page, &list[LRU_INACTIVE], flags);
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 		ClearPageActive(page);
 
-		list_move(&page->lru, &zone->inactive_list);
+		list_move(&page->lru, &zone->list[LRU_INACTIVE]);
 		mem_cgroup_move_lists(page_get_page_cgroup(page), false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
@@ -1203,13 +1203,13 @@ static void shrink_active_list(unsigned 
 	}
 
 	pgmoved = 0;
-	while (!list_empty(&l_active)) {
-		page = lru_to_page(&l_active);
-		prefetchw_prev_lru_page(page, &l_active, flags);
+	while (!list_empty(&list[LRU_ACTIVE])) {
+		page = lru_to_page(&list[LRU_ACTIVE]);
+		prefetchw_prev_lru_page(page, &list[LRU_ACTIVE], flags);
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
-		list_move(&page->lru, &zone->active_list);
+		list_move(&page->lru, &zone->list[LRU_ACTIVE]);
 		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
@@ -1233,65 +1233,64 @@ static void shrink_active_list(unsigned 
 	pagevec_release(&pvec);
 }
 
+static unsigned long shrink_list(enum lru_list l, unsigned long nr_to_scan,
+	struct zone *zone, struct scan_control *sc, int priority)
+{
+	if (l == LRU_ACTIVE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority);
+		return 0;
+	}
+	return shrink_inactive_list(nr_to_scan, zone, sc);
+}
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
 static unsigned long shrink_zone(int priority, struct zone *zone,
 				struct scan_control *sc)
 {
-	unsigned long nr_active;
-	unsigned long nr_inactive;
+	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
+	enum lru_list l;
 
 	if (scan_global_lru(sc)) {
 		/*
 		 * Add one to nr_to_scan just to make sure that the kernel
 		 * will slowly sift through the active list.
 		 */
-		zone->nr_scan_active +=
-			(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
-		nr_active = zone->nr_scan_active;
-		zone->nr_scan_inactive +=
-			(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
-		nr_inactive = zone->nr_scan_inactive;
-		if (nr_inactive >= sc->swap_cluster_max)
-			zone->nr_scan_inactive = 0;
-		else
-			nr_inactive = 0;
-
-		if (nr_active >= sc->swap_cluster_max)
-			zone->nr_scan_active = 0;
-		else
-			nr_active = 0;
+		for_each_lru(l) {
+			zone->nr_scan[l] += (zone_page_state(zone,
+					NR_INACTIVE + l)  >> priority) + 1;
+			nr[l] = zone->nr_scan[l];
+			if (nr[l] >= sc->swap_cluster_max)
+				zone->nr_scan[l] = 0;
+			else
+				nr[l] = 0;
+		}
 	} else {
 		/*
 		 * This reclaim occurs not because zone memory shortage but
 		 * because memory controller hits its limit.
 		 * Then, don't modify zone reclaim related data.
 		 */
-		nr_active = mem_cgroup_calc_reclaim_active(sc->mem_cgroup,
+		nr[LRU_ACTIVE] = mem_cgroup_calc_reclaim_active(sc->mem_cgroup,
 					zone, priority);
 
-		nr_inactive = mem_cgroup_calc_reclaim_inactive(sc->mem_cgroup,
+		nr[LRU_INACTIVE] = mem_cgroup_calc_reclaim_inactive(sc->mem_cgroup,
 					zone, priority);
 	}
 
-
-	while (nr_active || nr_inactive) {
-		if (nr_active) {
-			nr_to_scan = min(nr_active,
+	while (nr[LRU_ACTIVE] || nr[LRU_INACTIVE]) {
+		for_each_lru(l) {
+			if (nr[l]) {
+				nr_to_scan = min(nr[l],
 					(unsigned long)sc->swap_cluster_max);
-			nr_active -= nr_to_scan;
-			shrink_active_list(nr_to_scan, zone, sc, priority);
-		}
+				nr[l] -= nr_to_scan;
 
-		if (nr_inactive) {
-			nr_to_scan = min(nr_inactive,
-					(unsigned long)sc->swap_cluster_max);
-			nr_inactive -= nr_to_scan;
-			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
-								sc);
+				nr_reclaimed += shrink_list(l, nr_to_scan,
+							zone, sc, priority);
+			}
 		}
 	}
 
@@ -1807,6 +1806,7 @@ static unsigned long shrink_all_zones(un
 {
 	struct zone *zone;
 	unsigned long nr_to_scan, ret = 0;
+	enum lru_list l;
 
 	for_each_zone(zone) {
 
@@ -1816,28 +1816,25 @@ static unsigned long shrink_all_zones(un
 		if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
 			continue;
 
-		/* For pass = 0 we don't shrink the active list */
-		if (pass > 0) {
-			zone->nr_scan_active +=
-				(zone_page_state(zone, NR_ACTIVE) >> prio) + 1;
-			if (zone->nr_scan_active >= nr_pages || pass > 3) {
-				zone->nr_scan_active = 0;
+		for_each_lru(l) {
+			/* For pass = 0 we don't shrink the active list */
+			if (pass == 0 && l == LRU_ACTIVE)
+				continue;
+
+			zone->nr_scan[l] +=
+				(zone_page_state(zone, NR_INACTIVE + l)
+								>> prio) + 1;
+			if (zone->nr_scan[l] >= nr_pages || pass > 3) {
+				zone->nr_scan[l] = 0;
 				nr_to_scan = min(nr_pages,
-					zone_page_state(zone, NR_ACTIVE));
-				shrink_active_list(nr_to_scan, zone, sc, prio);
+					zone_page_state(zone,
+							NR_INACTIVE + l));
+				ret += shrink_list(l, nr_to_scan, zone,
+								sc, prio);
+				if (ret >= nr_pages)
+					return ret;
 			}
 		}
-
-		zone->nr_scan_inactive +=
-			(zone_page_state(zone, NR_INACTIVE) >> prio) + 1;
-		if (zone->nr_scan_inactive >= nr_pages || pass > 3) {
-			zone->nr_scan_inactive = 0;
-			nr_to_scan = min(nr_pages,
-				zone_page_state(zone, NR_INACTIVE));
-			ret += shrink_inactive_list(nr_to_scan, zone, sc);
-			if (ret >= nr_pages)
-				return ret;
-		}
 	}
 
 	return ret;
Index: linux-2.6.24-rc3-mm2/mm/vmstat.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/vmstat.c
+++ linux-2.6.24-rc3-mm2/mm/vmstat.c
@@ -765,7 +765,8 @@ static void zoneinfo_show_print(struct s
 		   zone->pages_low,
 		   zone->pages_high,
 		   zone->pages_scanned,
-		   zone->nr_scan_active, zone->nr_scan_inactive,
+		   zone->nr_scan[LRU_ACTIVE],
+		   zone->nr_scan[LRU_INACTIVE],
 		   zone->spanned_pages,
 		   zone->present_pages);
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 08/20] split LRU lists into anon & file sets
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (6 preceding siblings ...)
  2007-12-18 21:15 ` [patch 07/20] Use an indexed array for LRU variables Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 09/20] split anon & file LRUs for memcontrol code Rik van Riel
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: rvr-02-linux-2.6-vm-split-lrus.patch --]
[-- Type: text/plain, Size: 58084 bytes --]

Split the LRU lists in two, one set for pages that are backed by
real file systems ("file") and one for pages that are backed by
memory and swap ("anon").  The latter includes tmpfs.

Eventually mlocked pages will be taken off the LRUs alltogether.
A patch for that already exists and just needs to be integrated
into this series.

This patch mostly has the infrastructure and a basic policy to
balance how much we scan the anon lists and how much we scan
the file lists. Fancy policy changes will be in separate patches.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Index: linux-2.6.24-rc3-mm2/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/fs/proc/proc_misc.c
+++ linux-2.6.24-rc3-mm2/fs/proc/proc_misc.c
@@ -153,43 +153,47 @@ static int meminfo_read_proc(char *page,
 	 * Tagged format, for easy grepping and expansion.
 	 */
 	len = sprintf(page,
-		"MemTotal:     %8lu kB\n"
-		"MemFree:      %8lu kB\n"
-		"Buffers:      %8lu kB\n"
-		"Cached:       %8lu kB\n"
-		"SwapCached:   %8lu kB\n"
-		"Active:       %8lu kB\n"
-		"Inactive:     %8lu kB\n"
+		"MemTotal:       %8lu kB\n"
+		"MemFree:        %8lu kB\n"
+		"Buffers:        %8lu kB\n"
+		"Cached:         %8lu kB\n"
+		"SwapCached:     %8lu kB\n"
+		"Active(anon):   %8lu kB\n"
+		"Inactive(anon): %8lu kB\n"
+		"Active(file):   %8lu kB\n"
+		"Inactive(file): %8lu kB\n"
 #ifdef CONFIG_HIGHMEM
-		"HighTotal:    %8lu kB\n"
-		"HighFree:     %8lu kB\n"
-		"LowTotal:     %8lu kB\n"
-		"LowFree:      %8lu kB\n"
-#endif
-		"SwapTotal:    %8lu kB\n"
-		"SwapFree:     %8lu kB\n"
-		"Dirty:        %8lu kB\n"
-		"Writeback:    %8lu kB\n"
-		"AnonPages:    %8lu kB\n"
-		"Mapped:       %8lu kB\n"
-		"Slab:         %8lu kB\n"
-		"SReclaimable: %8lu kB\n"
-		"SUnreclaim:   %8lu kB\n"
-		"PageTables:   %8lu kB\n"
-		"NFS_Unstable: %8lu kB\n"
-		"Bounce:       %8lu kB\n"
-		"CommitLimit:  %8lu kB\n"
-		"Committed_AS: %8lu kB\n"
-		"VmallocTotal: %8lu kB\n"
-		"VmallocUsed:  %8lu kB\n"
-		"VmallocChunk: %8lu kB\n",
+		"HighTotal:      %8lu kB\n"
+		"HighFree:       %8lu kB\n"
+		"LowTotal:       %8lu kB\n"
+		"LowFree:        %8lu kB\n"
+#endif
+		"SwapTotal:      %8lu kB\n"
+		"SwapFree:       %8lu kB\n"
+		"Dirty:          %8lu kB\n"
+		"Writeback:      %8lu kB\n"
+		"AnonPages:      %8lu kB\n"
+		"Mapped:         %8lu kB\n"
+		"Slab:           %8lu kB\n"
+		"SReclaimable:   %8lu kB\n"
+		"SUnreclaim:     %8lu kB\n"
+		"PageTables:     %8lu kB\n"
+		"NFS_Unstable:   %8lu kB\n"
+		"Bounce:         %8lu kB\n"
+		"CommitLimit:    %8lu kB\n"
+		"Committed_AS:   %8lu kB\n"
+		"VmallocTotal:   %8lu kB\n"
+		"VmallocUsed:    %8lu kB\n"
+		"VmallocChunk:   %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
 		K(cached),
 		K(total_swapcache_pages),
-		K(global_page_state(NR_ACTIVE)),
-		K(global_page_state(NR_INACTIVE)),
+		K(global_page_state(NR_ACTIVE_ANON)),
+		K(global_page_state(NR_INACTIVE_ANON)),
+		K(global_page_state(NR_ACTIVE_FILE)),
+		K(global_page_state(NR_INACTIVE_FILE)),
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
 		K(i.freehigh),
Index: linux-2.6.24-rc3-mm2/fs/cifs/file.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/fs/cifs/file.c
+++ linux-2.6.24-rc3-mm2/fs/cifs/file.c
@@ -1783,7 +1783,7 @@ static void cifs_copy_cache_pages(struct
 		SetPageUptodate(page);
 		unlock_page(page);
 		if (!pagevec_add(plru_pvec, page))
-			__pagevec_lru_add(plru_pvec);
+			__pagevec_lru_add_file(plru_pvec);
 		data += PAGE_CACHE_SIZE;
 	}
 	return;
@@ -1921,7 +1921,7 @@ static int cifs_readpages(struct file *f
 		bytes_read = 0;
 	}
 
-	pagevec_lru_add(&lru_pvec);
+	pagevec_lru_add_file(&lru_pvec);
 
 /* need to free smb_read_data buf before exit */
 	if (smb_read_data) {
Index: linux-2.6.24-rc3-mm2/fs/ntfs/file.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/fs/ntfs/file.c
+++ linux-2.6.24-rc3-mm2/fs/ntfs/file.c
@@ -439,7 +439,7 @@ static inline int __ntfs_grab_cache_page
 			pages[nr] = *cached_page;
 			page_cache_get(*cached_page);
 			if (unlikely(!pagevec_add(lru_pvec, *cached_page)))
-				__pagevec_lru_add(lru_pvec);
+				__pagevec_lru_add_file(lru_pvec);
 			*cached_page = NULL;
 		}
 		index++;
@@ -2084,7 +2084,7 @@ err_out:
 						OSYNC_METADATA|OSYNC_DATA);
 		}
   	}
-	pagevec_lru_add(&lru_pvec);
+	pagevec_lru_add_file(&lru_pvec);
 	ntfs_debug("Done.  Returning %s (written 0x%lx, status %li).",
 			written ? "written" : "status", (unsigned long)written,
 			(long)status);
Index: linux-2.6.24-rc3-mm2/fs/nfs/dir.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/fs/nfs/dir.c
+++ linux-2.6.24-rc3-mm2/fs/nfs/dir.c
@@ -1497,7 +1497,7 @@ static int nfs_symlink(struct inode *dir
 	if (!add_to_page_cache(page, dentry->d_inode->i_mapping, 0,
 							GFP_KERNEL)) {
 		pagevec_add(&lru_pvec, page);
-		pagevec_lru_add(&lru_pvec);
+		pagevec_lru_add_file(&lru_pvec);
 		SetPageUptodate(page);
 		unlock_page(page);
 	} else
Index: linux-2.6.24-rc3-mm2/fs/ramfs/file-nommu.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/fs/ramfs/file-nommu.c
+++ linux-2.6.24-rc3-mm2/fs/ramfs/file-nommu.c
@@ -111,12 +111,12 @@ static int ramfs_nommu_expand_for_mappin
 			goto add_error;
 
 		if (!pagevec_add(&lru_pvec, page))
-			__pagevec_lru_add(&lru_pvec);
+			__pagevec_lru_add_file(&lru_pvec);
 
 		unlock_page(page);
 	}
 
-	pagevec_lru_add(&lru_pvec);
+	pagevec_lru_add_file(&lru_pvec);
 	return 0;
 
  fsize_exceeded:
Index: linux-2.6.24-rc3-mm2/drivers/base/node.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/drivers/base/node.c
+++ linux-2.6.24-rc3-mm2/drivers/base/node.c
@@ -45,33 +45,37 @@ static ssize_t node_read_meminfo(struct 
 	si_meminfo_node(&i, nid);
 
 	n = sprintf(buf, "\n"
-		       "Node %d MemTotal:     %8lu kB\n"
-		       "Node %d MemFree:      %8lu kB\n"
-		       "Node %d MemUsed:      %8lu kB\n"
-		       "Node %d Active:       %8lu kB\n"
-		       "Node %d Inactive:     %8lu kB\n"
+		       "Node %d MemTotal:       %8lu kB\n"
+		       "Node %d MemFree:        %8lu kB\n"
+		       "Node %d MemUsed:        %8lu kB\n"
+		       "Node %d Active(anon):   %8lu kB\n"
+		       "Node %d Inactive(anon): %8lu kB\n"
+		       "Node %d Active(file):   %8lu kB\n"
+		       "Node %d Inactive(file): %8lu kB\n"
 #ifdef CONFIG_HIGHMEM
-		       "Node %d HighTotal:    %8lu kB\n"
-		       "Node %d HighFree:     %8lu kB\n"
-		       "Node %d LowTotal:     %8lu kB\n"
-		       "Node %d LowFree:      %8lu kB\n"
+		       "Node %d HighTotal:      %8lu kB\n"
+		       "Node %d HighFree:       %8lu kB\n"
+		       "Node %d LowTotal:       %8lu kB\n"
+		       "Node %d LowFree:        %8lu kB\n"
 #endif
-		       "Node %d Dirty:        %8lu kB\n"
-		       "Node %d Writeback:    %8lu kB\n"
-		       "Node %d FilePages:    %8lu kB\n"
-		       "Node %d Mapped:       %8lu kB\n"
-		       "Node %d AnonPages:    %8lu kB\n"
-		       "Node %d PageTables:   %8lu kB\n"
-		       "Node %d NFS_Unstable: %8lu kB\n"
-		       "Node %d Bounce:       %8lu kB\n"
-		       "Node %d Slab:         %8lu kB\n"
-		       "Node %d SReclaimable: %8lu kB\n"
-		       "Node %d SUnreclaim:   %8lu kB\n",
+		       "Node %d Dirty:          %8lu kB\n"
+		       "Node %d Writeback:      %8lu kB\n"
+		       "Node %d FilePages:      %8lu kB\n"
+		       "Node %d Mapped:         %8lu kB\n"
+		       "Node %d AnonPages:      %8lu kB\n"
+		       "Node %d PageTables:     %8lu kB\n"
+		       "Node %d NFS_Unstable:   %8lu kB\n"
+		       "Node %d Bounce:         %8lu kB\n"
+		       "Node %d Slab:           %8lu kB\n"
+		       "Node %d SReclaimable:   %8lu kB\n"
+		       "Node %d SUnreclaim:     %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
 		       nid, K(i.totalram - i.freeram),
-		       nid, node_page_state(nid, NR_ACTIVE),
-		       nid, node_page_state(nid, NR_INACTIVE),
+		       nid, node_page_state(nid, NR_ACTIVE_ANON),
+		       nid, node_page_state(nid, NR_INACTIVE_ANON),
+		       nid, node_page_state(nid, NR_ACTIVE_FILE),
+		       nid, node_page_state(nid, NR_INACTIVE_FILE),
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
 		       nid, K(i.freehigh),
Index: linux-2.6.24-rc3-mm2/mm/memory.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/memory.c
+++ linux-2.6.24-rc3-mm2/mm/memory.c
@@ -1665,7 +1665,7 @@ gotten:
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		SetPageSwapBacked(new_page);
-		lru_cache_add_active(new_page);
+		lru_cache_add_active_anon(new_page);
 		page_add_new_anon_rmap(new_page, vma, address);
 
 		/* Free the old page.. */
@@ -2134,7 +2134,7 @@ static int do_anonymous_page(struct mm_s
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	SetPageSwapBacked(page);
-	lru_cache_add_active(page);
+	lru_cache_add_active_anon(page);
 	page_add_new_anon_rmap(page, vma, address);
 	set_pte_at(mm, address, page_table, entry);
 
@@ -2288,7 +2288,7 @@ static int __do_fault(struct mm_struct *
 		if (anon) {
                         inc_mm_counter(mm, anon_rss);
 			SetPageSwapBacked(page);
-                        lru_cache_add_active(page);
+                        lru_cache_add_active_anon(page);
                         page_add_new_anon_rmap(page, vma, address);
 		} else {
 			inc_mm_counter(mm, file_rss);
Index: linux-2.6.24-rc3-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/page_alloc.c
+++ linux-2.6.24-rc3-mm2/mm/page_alloc.c
@@ -1879,10 +1879,13 @@ void show_free_areas(void)
 		}
 	}
 
-	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+	printk("Active_anon:%lu active_file:%lu inactive_anon%lu\n"
+		" inactive_file:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",
-		global_page_state(NR_ACTIVE),
-		global_page_state(NR_INACTIVE),
+		global_page_state(NR_ACTIVE_ANON),
+		global_page_state(NR_ACTIVE_FILE),
+		global_page_state(NR_INACTIVE_ANON),
+		global_page_state(NR_INACTIVE_FILE),
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -1905,8 +1908,10 @@ void show_free_areas(void)
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
-			" active:%lukB"
-			" inactive:%lukB"
+			" active_anon:%lukB"
+			" inactive_anon:%lukB"
+			" active_file:%lukB"
+			" inactive_file:%lukB"
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1916,8 +1921,10 @@ void show_free_areas(void)
 			K(zone->pages_min),
 			K(zone->pages_low),
 			K(zone->pages_high),
-			K(zone_page_state(zone, NR_ACTIVE)),
-			K(zone_page_state(zone, NR_INACTIVE)),
+			K(zone_page_state(zone, NR_ACTIVE_ANON)),
+			K(zone_page_state(zone, NR_INACTIVE_ANON)),
+			K(zone_page_state(zone, NR_ACTIVE_FILE)),
+			K(zone_page_state(zone, NR_INACTIVE_FILE)),
 			K(zone->present_pages),
 			zone->pages_scanned,
 			(zone_is_all_unreclaimable(zone) ? "yes" : "no")
@@ -3458,6 +3465,9 @@ static void __meminit free_area_init_cor
 			INIT_LIST_HEAD(&zone->list[l]);
 			zone->nr_scan[l] = 0;
 		}
+		zone->recent_rotated_anon = 0;
+		zone->recent_rotated_file = 0;
+//TODO recent_scanned_* ???
 		zap_zone_vm_stats(zone);
 		zone->flags = 0;
 		if (!size)
Index: linux-2.6.24-rc3-mm2/mm/swap.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/swap.c
+++ linux-2.6.24-rc3-mm2/mm/swap.c
@@ -34,8 +34,10 @@
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
-static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, };
-static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_add_file_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_add_active_file_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_add_anon_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_add_active_anon_pvecs) = { 0, };
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs) = { 0, };
 
 /*
@@ -118,7 +120,13 @@ static void pagevec_move_tail(struct pag
 			spin_lock(&zone->lru_lock);
 		}
 		if (PageLRU(page) && !PageActive(page)) {
-			list_move_tail(&page->lru, &zone->list[LRU_INACTIVE]);
+			if (page_file_cache(page)) {
+				list_move_tail(&page->lru,
+						&zone->list[LRU_INACTIVE_FILE]);
+			} else {
+				list_move_tail(&page->lru,
+						&zone->list[LRU_INACTIVE_ANON]);
+			}
 			pgmoved++;
 		}
 	}
@@ -172,9 +180,13 @@ void fastcall activate_page(struct page 
 
 	spin_lock_irq(&zone->lru_lock);
 	if (PageLRU(page) && !PageActive(page)) {
-		del_page_from_inactive_list(zone, page);
+		int lru = LRU_BASE;
+		lru += page_file_cache(page);
+		del_page_from_lru_list(zone, page, lru);
+
 		SetPageActive(page);
-		add_page_to_active_list(zone, page);
+		lru += LRU_ACTIVE;
+		add_page_to_lru_list(zone, page, lru);
 		__count_vm_event(PGACTIVATE);
 		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
 	}
@@ -204,26 +216,46 @@ EXPORT_SYMBOL(mark_page_accessed);
  * lru_cache_add: add a page to the page lists
  * @page: the page to add
  */
-void fastcall lru_cache_add(struct page *page)
+void fastcall lru_cache_add_anon(struct page *page)
 {
-	struct pagevec *pvec = &get_cpu_var(lru_add_pvecs);
+	struct pagevec *pvec = &get_cpu_var(lru_add_anon_pvecs);
 
 	page_cache_get(page);
 	if (!pagevec_add(pvec, page))
-		__pagevec_lru_add(pvec);
-	put_cpu_var(lru_add_pvecs);
+		__pagevec_lru_add_anon(pvec);
+	put_cpu_var(lru_add_anon_pvecs);
 }
 
-void fastcall lru_cache_add_active(struct page *page)
+void fastcall lru_cache_add_file(struct page *page)
 {
-	struct pagevec *pvec = &get_cpu_var(lru_add_active_pvecs);
+	struct pagevec *pvec = &get_cpu_var(lru_add_file_pvecs);
 
 	page_cache_get(page);
 	if (!pagevec_add(pvec, page))
-		__pagevec_lru_add_active(pvec);
+		__pagevec_lru_add_file(pvec);
+	put_cpu_var(lru_add_file_pvecs);
+}
+
+void fastcall lru_cache_add_active_anon(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_active_anon_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_active_anon(pvec);
 	put_cpu_var(lru_add_active_pvecs);
 }
 
+void fastcall lru_cache_add_active_file(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_active_file_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_active_file(pvec);
+	put_cpu_var(lru_add_active_file_pvecs);
+}
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
@@ -233,13 +265,21 @@ static void drain_cpu_pagevecs(int cpu)
 {
 	struct pagevec *pvec;
 
-	pvec = &per_cpu(lru_add_pvecs, cpu);
+	pvec = &per_cpu(lru_add_file_pvecs, cpu);
+	if (pagevec_count(pvec))
+		__pagevec_lru_add_file(pvec);
+
+	pvec = &per_cpu(lru_add_anon_pvecs, cpu);
 	if (pagevec_count(pvec))
-		__pagevec_lru_add(pvec);
+		__pagevec_lru_add_anon(pvec);
 
-	pvec = &per_cpu(lru_add_active_pvecs, cpu);
+	pvec = &per_cpu(lru_add_active_file_pvecs, cpu);
 	if (pagevec_count(pvec))
-		__pagevec_lru_add_active(pvec);
+		__pagevec_lru_add_active_file(pvec);
+
+	pvec = &per_cpu(lru_add_active_anon_pvecs, cpu);
+	if (pagevec_count(pvec))
+		__pagevec_lru_add_active_anon(pvec);
 
 	pvec = &per_cpu(lru_rotate_pvecs, cpu);
 	if (pagevec_count(pvec)) {
@@ -393,7 +433,7 @@ void __pagevec_release_nonlru(struct pag
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.
  */
-void __pagevec_lru_add(struct pagevec *pvec)
+void __pagevec_lru_add_file(struct pagevec *pvec)
 {
 	int i;
 	struct zone *zone = NULL;
@@ -410,7 +450,7 @@ void __pagevec_lru_add(struct pagevec *p
 		}
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
-		add_page_to_inactive_list(zone, page);
+		add_page_to_inactive_file_list(zone, page);
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
@@ -418,9 +458,60 @@ void __pagevec_lru_add(struct pagevec *p
 	pagevec_reinit(pvec);
 }
 
-EXPORT_SYMBOL(__pagevec_lru_add);
+EXPORT_SYMBOL(__pagevec_lru_add_file);
+void __pagevec_lru_add_active_file(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		VM_BUG_ON(PageActive(page));
+		SetPageActive(page);
+		add_page_to_active_file_list(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
+void __pagevec_lru_add_anon(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		add_page_to_inactive_anon_list(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
 
-void __pagevec_lru_add_active(struct pagevec *pvec)
+void __pagevec_lru_add_active_anon(struct pagevec *pvec)
 {
 	int i;
 	struct zone *zone = NULL;
@@ -439,7 +530,7 @@ void __pagevec_lru_add_active(struct pag
 		SetPageLRU(page);
 		VM_BUG_ON(PageActive(page));
 		SetPageActive(page);
-		add_page_to_active_list(zone, page);
+		add_page_to_active_anon_list(zone, page);
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
Index: linux-2.6.24-rc3-mm2/mm/migrate.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/migrate.c
+++ linux-2.6.24-rc3-mm2/mm/migrate.c
@@ -60,9 +60,15 @@ static inline void move_to_lru(struct pa
 		 * the PG_active bit is off.
 		 */
 		ClearPageActive(page);
-		lru_cache_add_active(page);
+		if (page_file_cache(page))
+			lru_cache_add_active_file(page);
+		else
+			lru_cache_add_active_anon(page);
 	} else {
-		lru_cache_add(page);
+		if (page_file_cache(page))
+			lru_cache_add_file(page);
+		else
+			lru_cache_add_anon(page);
 	}
 	put_page(page);
 }
Index: linux-2.6.24-rc3-mm2/mm/readahead.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/readahead.c
+++ linux-2.6.24-rc3-mm2/mm/readahead.c
@@ -229,7 +229,7 @@ int do_page_cache_readahead(struct addre
  */
 unsigned long max_sane_readahead(unsigned long nr)
 {
-	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
+	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE_FILE)
 		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
 
Index: linux-2.6.24-rc3-mm2/mm/filemap.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/filemap.c
+++ linux-2.6.24-rc3-mm2/mm/filemap.c
@@ -34,6 +34,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/mm_inline.h> /* for page_file_cache() */
 #include "internal.h"
 
 /*
@@ -481,8 +482,12 @@ int add_to_page_cache_lru(struct page *p
 				pgoff_t offset, gfp_t gfp_mask)
 {
 	int ret = add_to_page_cache(page, mapping, offset, gfp_mask);
-	if (ret == 0)
-		lru_cache_add(page);
+	if (ret == 0) {
+		if (page_file_cache(page))
+			lru_cache_add_file(page);
+		else
+			lru_cache_add_active_anon(page);
+	}
 	return ret;
 }
 
Index: linux-2.6.24-rc3-mm2/mm/vmstat.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/vmstat.c
+++ linux-2.6.24-rc3-mm2/mm/vmstat.c
@@ -693,8 +693,10 @@ const struct seq_operations pagetypeinfo
 static const char * const vmstat_text[] = {
 	/* Zoned VM counters */
 	"nr_free_pages",
-	"nr_inactive",
-	"nr_active",
+	"nr_inactive_anon",
+	"nr_active_anon",
+	"nr_inactive_file",
+	"nr_active_file",
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
@@ -757,7 +759,7 @@ static void zoneinfo_show_print(struct s
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
-		   "\n        scanned  %lu (a: %lu i: %lu)"
+		   "\n        scanned  %lu (aa: %lu ia: %lu af: %lu if: %lu)"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
@@ -765,8 +767,10 @@ static void zoneinfo_show_print(struct s
 		   zone->pages_low,
 		   zone->pages_high,
 		   zone->pages_scanned,
-		   zone->nr_scan[LRU_ACTIVE],
-		   zone->nr_scan[LRU_INACTIVE],
+		   zone->nr_scan[LRU_ACTIVE_ANON],
+		   zone->nr_scan[LRU_INACTIVE_ANON],
+		   zone->nr_scan[LRU_ACTIVE_FILE],
+		   zone->nr_scan[LRU_INACTIVE_FILE],
 		   zone->spanned_pages,
 		   zone->present_pages);
 
Index: linux-2.6.24-rc3-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/vmscan.c
+++ linux-2.6.24-rc3-mm2/mm/vmscan.c
@@ -71,6 +71,9 @@ struct scan_control {
 
 	int order;
 
+	/* The number of pages moved to the active list this pass. */
+	int activated;
+
 	/*
 	 * Pages that have (or should have) IO pending.  If we run into
 	 * a lot of these, we're better off waiting a little for IO to
@@ -85,7 +88,7 @@ struct scan_control {
 	unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,
 			unsigned long *scanned, int order, int mode,
 			struct zone *z, struct mem_cgroup *mem_cont,
-			int active);
+			int active, int file);
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -243,27 +246,6 @@ unsigned long shrink_slab(unsigned long 
 	return ret;
 }
 
-/* Called without lock on whether page is mapped, so answer is unstable */
-static inline int page_mapping_inuse(struct page *page)
-{
-	struct address_space *mapping;
-
-	/* Page is in somebody's page tables. */
-	if (page_mapped(page))
-		return 1;
-
-	/* Be more reluctant to reclaim swapcache than pagecache */
-	if (PageSwapCache(page))
-		return 1;
-
-	mapping = page_mapping(page);
-	if (!mapping)
-		return 0;
-
-	/* File is mmap'd by somebody? */
-	return mapping_mapped(mapping);
-}
-
 static inline int is_page_cache_freeable(struct page *page)
 {
 	return page_count(page) - !!PagePrivate(page) == 2;
@@ -527,8 +509,7 @@ static unsigned long shrink_page_list(st
 
 		referenced = page_referenced(page, 1, sc->mem_cgroup);
 		/* In active use or really unfreeable?  Activate it. */
-		if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
-					referenced && page_mapping_inuse(page))
+		if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
 			goto activate_locked;
 
 #ifdef CONFIG_SWAP
@@ -559,8 +540,6 @@ static unsigned long shrink_page_list(st
 		}
 
 		if (PageDirty(page)) {
-			if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
-				goto keep_locked;
 			if (!may_enter_fs) {
 				sc->nr_io_pages++;
 				goto keep_locked;
@@ -647,6 +626,7 @@ keep:
 	if (pagevec_count(&freed_pvec))
 		__pagevec_release_nonlru(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
+	sc->activated = pgactivate;
 	return nr_reclaimed;
 }
 
@@ -665,7 +645,7 @@ keep:
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, int mode)
+int __isolate_lru_page(struct page *page, int mode, int file)
 {
 	int ret = -EINVAL;
 
@@ -681,6 +661,9 @@ int __isolate_lru_page(struct page *page
 	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
 		return ret;
 
+	if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
+		return ret;
+
 	ret = -EBUSY;
 	if (likely(get_page_unless_zero(page))) {
 		/*
@@ -711,12 +694,13 @@ int __isolate_lru_page(struct page *page
  * @scanned:	The number of pages that were scanned.
  * @order:	The caller's attempted allocation order
  * @mode:	One of the LRU isolation modes
+ * @file:	True [1] if isolating file [!anon] pages
  *
  * returns how many pages were moved onto *@dst.
  */
 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		struct list_head *src, struct list_head *dst,
-		unsigned long *scanned, int order, int mode)
+		unsigned long *scanned, int order, int mode, int file)
 {
 	unsigned long nr_taken = 0;
 	unsigned long scan;
@@ -733,7 +717,7 @@ static unsigned long isolate_lru_pages(u
 
 		VM_BUG_ON(!PageLRU(page));
 
-		switch (__isolate_lru_page(page, mode)) {
+		switch (__isolate_lru_page(page, mode, file)) {
 		case 0:
 			list_move(&page->lru, dst);
 			nr_taken++;
@@ -776,10 +760,11 @@ static unsigned long isolate_lru_pages(u
 				break;
 
 			cursor_page = pfn_to_page(pfn);
+
 			/* Check that we have not crossed a zone boundary. */
 			if (unlikely(page_zone_id(cursor_page) != zone_id))
 				continue;
-			switch (__isolate_lru_page(cursor_page, mode)) {
+			switch (__isolate_lru_page(cursor_page, mode, file)) {
 			case 0:
 				list_move(&cursor_page->lru, dst);
 				nr_taken++;
@@ -804,30 +789,37 @@ static unsigned long isolate_pages_globa
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
 					struct mem_cgroup *mem_cont,
-					int active)
+					int active, int file)
 {
+	int lru = LRU_BASE;
 	if (active)
-		return isolate_lru_pages(nr, &z->list[LRU_ACTIVE], dst,
-						scanned, order, mode);
-	else
-		return isolate_lru_pages(nr, &z->list[LRU_INACTIVE], dst,
-						scanned, order, mode);
+		lru += LRU_ACTIVE;
+	if (file)
+		lru += LRU_FILE;
+	return isolate_lru_pages(nr, &z->list[lru], dst, scanned, order,
+								mode, !!file);
 }
 
 /*
  * clear_active_flags() is a helper for shrink_active_list(), clearing
  * any active bits from the pages in the list.
  */
-static unsigned long clear_active_flags(struct list_head *page_list)
+static unsigned long clear_active_flags(struct list_head *page_list,
+					unsigned int *count)
 {
 	int nr_active = 0;
+	int lru;
 	struct page *page;
 
-	list_for_each_entry(page, page_list, lru)
+	list_for_each_entry(page, page_list, lru) {
+		lru = page_file_cache(page);
 		if (PageActive(page)) {
+			lru += LRU_ACTIVE;
 			ClearPageActive(page);
 			nr_active++;
 		}
+		count[lru]++;
+	}
 
 	return nr_active;
 }
@@ -861,12 +853,12 @@ int isolate_lru_page(struct page *page)
 
 		spin_lock_irq(&zone->lru_lock);
 		if (PageLRU(page) && get_page_unless_zero(page)) {
+			int lru = LRU_BASE;
 			ret = 0;
 			ClearPageLRU(page);
-			if (PageActive(page))
-				del_page_from_active_list(zone, page);
-			else
-				del_page_from_inactive_list(zone, page);
+
+			lru += page_file_cache(page) + !!PageActive(page);
+			del_page_from_lru_list(zone, page, lru);
 		}
 		spin_unlock_irq(&zone->lru_lock);
 	}
@@ -878,7 +870,7 @@ int isolate_lru_page(struct page *page)
  * of reclaimed pages
  */
 static unsigned long shrink_inactive_list(unsigned long max_scan,
-				struct zone *zone, struct scan_control *sc)
+			struct zone *zone, struct scan_control *sc, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
@@ -895,18 +887,25 @@ static unsigned long shrink_inactive_lis
 		unsigned long nr_scan;
 		unsigned long nr_freed;
 		unsigned long nr_active;
+		unsigned int count[NR_LRU_LISTS] = { 0, };
+		int mode = (sc->order > PAGE_ALLOC_COSTLY_ORDER) ?
+					ISOLATE_BOTH : ISOLATE_INACTIVE;
 
 		nr_taken = sc->isolate_pages(sc->swap_cluster_max,
-			     &page_list, &nr_scan, sc->order,
-			     (sc->order > PAGE_ALLOC_COSTLY_ORDER)?
-					     ISOLATE_BOTH : ISOLATE_INACTIVE,
-				zone, sc->mem_cgroup, 0);
-		nr_active = clear_active_flags(&page_list);
+			     &page_list, &nr_scan, sc->order, mode,
+				zone, sc->mem_cgroup, 0, file);
+		nr_active = clear_active_flags(&page_list, count);
 		__count_vm_events(PGDEACTIVATE, nr_active);
 
-		__mod_zone_page_state(zone, NR_ACTIVE, -nr_active);
-		__mod_zone_page_state(zone, NR_INACTIVE,
-						-(nr_taken - nr_active));
+		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+						-count[LRU_ACTIVE_FILE]);
+		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+						-count[LRU_INACTIVE_FILE]);
+		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+						-count[LRU_ACTIVE_ANON]);
+		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+						-count[LRU_INACTIVE_ANON]);
+
 		if (scan_global_lru(sc))
 			zone->pages_scanned += nr_scan;
 		spin_unlock_irq(&zone->lru_lock);
@@ -928,7 +927,7 @@ static unsigned long shrink_inactive_lis
 			 * The attempt at page out may have made some
 			 * of the pages active, mark them inactive again.
 			 */
-			nr_active = clear_active_flags(&page_list);
+			nr_active = clear_active_flags(&page_list, count);
 			count_vm_events(PGDEACTIVATE, nr_active);
 
 			nr_freed += shrink_page_list(&page_list, sc,
@@ -952,11 +951,20 @@ static unsigned long shrink_inactive_lis
 		 * Put back any unfreeable pages.
 		 */
 		while (!list_empty(&page_list)) {
+			int lru = LRU_BASE;
 			page = lru_to_page(&page_list);
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			add_page_to_lru_list(zone, page, PageActive(page));
+			if (page_file_cache(page)) {
+				lru += LRU_FILE;
+				zone->recent_rotated_file++;
+			} else {
+				zone->recent_rotated_anon++;
+			}
+			if (PageActive(page))
+				lru += LRU_ACTIVE;
+			add_page_to_lru_list(zone, page, lru);
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
 				__pagevec_release(&pvec);
@@ -987,115 +995,7 @@ static inline void note_zone_scanning_pr
 
 static inline int zone_is_near_oom(struct zone *zone)
 {
-	return zone->pages_scanned >= (zone_page_state(zone, NR_ACTIVE)
-				+ zone_page_state(zone, NR_INACTIVE))*3;
-}
-
-/*
- * Determine we should try to reclaim mapped pages.
- * This is called only when sc->mem_cgroup is NULL.
- */
-static int calc_reclaim_mapped(struct scan_control *sc, struct zone *zone,
-				int priority)
-{
-	long mapped_ratio;
-	long distress;
-	long swap_tendency;
-	long imbalance;
-	int reclaim_mapped = 0;
-	int prev_priority;
-
-	if (scan_global_lru(sc) && zone_is_near_oom(zone))
-		return 1;
-	/*
-	 * `distress' is a measure of how much trouble we're having
-	 * reclaiming pages.  0 -> no problems.  100 -> great trouble.
-	 */
-	if (scan_global_lru(sc))
-		prev_priority = zone->prev_priority;
-	else
-		prev_priority = mem_cgroup_get_reclaim_priority(sc->mem_cgroup);
-
-	distress = 100 >> min(prev_priority, priority);
-
-	/*
-	 * The point of this algorithm is to decide when to start
-	 * reclaiming mapped memory instead of just pagecache.  Work out
-	 * how much memory
-	 * is mapped.
-	 */
-	if (scan_global_lru(sc))
-		mapped_ratio = ((global_page_state(NR_FILE_MAPPED) +
-				global_page_state(NR_ANON_PAGES)) * 100) /
-					vm_total_pages;
-	else
-		mapped_ratio = mem_cgroup_calc_mapped_ratio(sc->mem_cgroup);
-
-	/*
-	 * Now decide how much we really want to unmap some pages.  The
-	 * mapped ratio is downgraded - just because there's a lot of
-	 * mapped memory doesn't necessarily mean that page reclaim
-	 * isn't succeeding.
-	 *
-	 * The distress ratio is important - we don't want to start
-	 * going oom.
-	 *
-	 * A 100% value of vm_swappiness overrides this algorithm
-	 * altogether.
-	 */
-	swap_tendency = mapped_ratio / 2 + distress + sc->swappiness;
-
-	/*
-	 * If there's huge imbalance between active and inactive
-	 * (think active 100 times larger than inactive) we should
-	 * become more permissive, or the system will take too much
-	 * cpu before it start swapping during memory pressure.
-	 * Distress is about avoiding early-oom, this is about
-	 * making swappiness graceful despite setting it to low
-	 * values.
-	 *
-	 * Avoid div by zero with nr_inactive+1, and max resulting
-	 * value is vm_total_pages.
-	 */
-	if (scan_global_lru(sc)) {
-		imbalance  = zone_page_state(zone, NR_ACTIVE);
-		imbalance /= zone_page_state(zone, NR_INACTIVE) + 1;
-	} else
-		imbalance = mem_cgroup_reclaim_imbalance(sc->mem_cgroup);
-
-	/*
-	 * Reduce the effect of imbalance if swappiness is low,
-	 * this means for a swappiness very low, the imbalance
-	 * must be much higher than 100 for this logic to make
-	 * the difference.
-	 *
-	 * Max temporary value is vm_total_pages*100.
-	 */
-	imbalance *= (vm_swappiness + 1);
-	imbalance /= 100;
-
-	/*
-	 * If not much of the ram is mapped, makes the imbalance
-	 * less relevant, it's high priority we refill the inactive
-	 * list with mapped pages only in presence of high ratio of
-	 * mapped pages.
-	 *
-	 * Max temporary value is vm_total_pages*100.
-	 */
-	imbalance *= mapped_ratio;
-	imbalance /= 100;
-
-	/* apply imbalance feedback to swap_tendency */
-	swap_tendency += imbalance;
-
-	/*
-	 * Now use this metric to decide whether to start moving mapped
-	 * memory onto the inactive list.
-	 */
-	if (swap_tendency >= 100)
-		reclaim_mapped = 1;
-
-	return reclaim_mapped;
+	return zone->pages_scanned >= (zone_lru_pages(zone) * 3);
 }
 
 /*
@@ -1115,10 +1015,8 @@ static int calc_reclaim_mapped(struct sc
  * The downside is that we have to touch page->_count against each page.
  * But we had to alter page->flags anyway.
  */
-
-
 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
-				struct scan_control *sc, int priority)
+				struct scan_control *sc, int priority, int file)
 {
 	unsigned long pgmoved;
 	int pgdeactivate = 0;
@@ -1127,64 +1025,65 @@ static void shrink_active_list(unsigned 
 	struct list_head list[NR_LRU_LISTS];
 	struct page *page;
 	struct pagevec pvec;
-	int reclaim_mapped = 0;
-	enum lru_list l;
+	enum lru_list lru;
 
-	for_each_lru(l)
-		INIT_LIST_HEAD(&list[l]);
-
-	if (sc->may_swap)
-		reclaim_mapped = calc_reclaim_mapped(sc, zone, priority);
+	for_each_lru(lru)
+		INIT_LIST_HEAD(&list[lru]);
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 	pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order,
 					ISOLATE_ACTIVE, zone,
-					sc->mem_cgroup, 1);
+					sc->mem_cgroup, 1, file);
 	/*
 	 * zone->pages_scanned is used for detect zone's oom
 	 * mem_cgroup remembers nr_scan by itself.
 	 */
 	if (scan_global_lru(sc))
 		zone->pages_scanned += pgscanned;
-
-	__mod_zone_page_state(zone, NR_ACTIVE, -pgmoved);
+	if (file)
+		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
+	else
+		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
 	spin_unlock_irq(&zone->lru_lock);
 
+	/*
+	 * For sorting active vs inactive pages, we'll use the 'anon'
+	 * elements of the local list[] array and sort out the file vs
+	 * anon pages below.
+	 */
 	while (!list_empty(&l_hold)) {
+		lru = LRU_INACTIVE_ANON;
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
-		if (page_mapped(page)) {
-			if (!reclaim_mapped ||
-			    (total_swap_pages == 0 && PageAnon(page)) ||
-			    page_referenced(page, 0, sc->mem_cgroup)) {
-				list_add(&page->lru, &list[LRU_ACTIVE]);
-				continue;
-			}
-		} else if (TestClearPageReferenced(page)) {
-			list_add(&page->lru, &list[LRU_ACTIVE]);
-			continue;
-		}
-		list_add(&page->lru, &list[LRU_INACTIVE]);
+		if (page_referenced(page, 0, sc->mem_cgroup))
+			lru = LRU_ACTIVE_ANON;
+		list_add(&page->lru, &list[lru]);
 	}
 
+	/*
+	 * Now put the pages back to the appropriate [file or anon] inactive
+	 * and active lists.
+	 */
 	pagevec_init(&pvec, 1);
 	pgmoved = 0;
+	lru = LRU_BASE + file * LRU_FILE;
 	spin_lock_irq(&zone->lru_lock);
-	while (!list_empty(&list[LRU_INACTIVE])) {
-		page = lru_to_page(&list[LRU_INACTIVE]);
-		prefetchw_prev_lru_page(page, &list[LRU_INACTIVE], flags);
+	while (!list_empty(&list[LRU_INACTIVE_ANON])) {
+		page = lru_to_page(&list[LRU_INACTIVE_ANON]);
+		prefetchw_prev_lru_page(page, &list[LRU_INACTIVE_ANON], flags);
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 		ClearPageActive(page);
 
-		list_move(&page->lru, &zone->list[LRU_INACTIVE]);
+		list_move(&page->lru, &zone->list[lru]);
 		mem_cgroup_move_lists(page_get_page_cgroup(page), false);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
-			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
+			__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
+								pgmoved);
 			spin_unlock_irq(&zone->lru_lock);
 			pgdeactivate += pgmoved;
 			pgmoved = 0;
@@ -1194,7 +1093,7 @@ static void shrink_active_list(unsigned 
 			spin_lock_irq(&zone->lru_lock);
 		}
 	}
-	__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru, pgmoved);
 	pgdeactivate += pgmoved;
 	if (buffer_heads_over_limit) {
 		spin_unlock_irq(&zone->lru_lock);
@@ -1203,17 +1102,19 @@ static void shrink_active_list(unsigned 
 	}
 
 	pgmoved = 0;
-	while (!list_empty(&list[LRU_ACTIVE])) {
-		page = lru_to_page(&list[LRU_ACTIVE]);
-		prefetchw_prev_lru_page(page, &list[LRU_ACTIVE], flags);
+	lru = LRU_ACTIVE + file * LRU_FILE;
+	while (!list_empty(&list[LRU_ACTIVE_ANON])) {
+		page = lru_to_page(&list[LRU_ACTIVE_ANON]);
+		prefetchw_prev_lru_page(page, &list[LRU_ACTIVE_ANON], flags);
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
-		list_move(&page->lru, &zone->list[LRU_ACTIVE]);
+		list_move(&page->lru, &zone->list[lru]);
 		mem_cgroup_move_lists(page_get_page_cgroup(page), true);
 		pgmoved++;
 		if (!pagevec_add(&pvec, page)) {
-			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
+			__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru,
+								pgmoved);
 			pgmoved = 0;
 			spin_unlock_irq(&zone->lru_lock);
 			if (vm_swap_full())
@@ -1222,7 +1123,12 @@ static void shrink_active_list(unsigned 
 			spin_lock_irq(&zone->lru_lock);
 		}
 	}
-	__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON + lru, pgmoved);
+	if (file) {
+		zone->recent_rotated_file += pgmoved;
+	} else {
+		zone->recent_rotated_anon += pgmoved;
+	}
 
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
@@ -1233,17 +1139,83 @@ static void shrink_active_list(unsigned 
 	pagevec_release(&pvec);
 }
 
-static unsigned long shrink_list(enum lru_list l, unsigned long nr_to_scan,
+static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 	struct zone *zone, struct scan_control *sc, int priority)
 {
-	if (l == LRU_ACTIVE) {
-		shrink_active_list(nr_to_scan, zone, sc, priority);
+	int file = is_file_lru(lru);
+
+	if (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
-	return shrink_inactive_list(nr_to_scan, zone, sc);
+	return shrink_inactive_list(nr_to_scan, zone, sc, file);
 }
 
 /*
+ * The utility of the anon and file memory corresponds to the fraction
+ * of pages that were recently referenced in each category.  Pageout
+ * pressure is distributed according to the size of each set, the fraction
+ * of recently referenced pages (except used-once file pages) and the
+ * swappiness parameter.
+ *
+ * We return the relative pressures as percentages so shrink_zone can
+ * easily use them.
+ */
+static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
+					unsigned long *percent)
+{
+	unsigned long anon, file;
+	unsigned long anon_prio, file_prio;
+	unsigned long rotate_sum;
+	unsigned long ap, fp;
+
+	anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
+		zone_page_state(zone, NR_INACTIVE_ANON);
+	file  = zone_page_state(zone, NR_ACTIVE_FILE) +
+		zone_page_state(zone, NR_INACTIVE_FILE);
+
+	rotate_sum = zone->recent_rotated_file + zone->recent_rotated_anon;
+
+	/* Keep a floating average of RECENT references. */
+	if (unlikely(rotate_sum > min(anon, file))) {
+		spin_lock_irq(&zone->lru_lock);
+		zone->recent_rotated_file /= 2;
+		zone->recent_rotated_anon /= 2;
+		spin_unlock_irq(&zone->lru_lock);
+		rotate_sum /= 2;
+	}
+
+	/*
+	 * With swappiness at 100, anonymous and file have the same priority.
+	 * This scanning priority is essentially the inverse of IO cost.
+	 */
+	anon_prio = sc->swappiness;
+	file_prio = 200 - sc->swappiness;
+
+	/*
+	 *                  anon       recent_rotated_anon
+	 * %anon = 100 * ----------- / ------------------- * IO cost
+	 *               anon + file       rotate_sum
+	 */
+	ap = (anon_prio * anon) / (anon + file + 1);
+	ap *= rotate_sum / (zone->recent_rotated_anon + 1);
+	if (ap == 0)
+		ap = 1;
+	else if (ap > 100)
+		ap = 100;
+	percent[0] = ap;
+
+	fp = (file_prio * file) / (anon + file + 1);
+	fp *= rotate_sum / (zone->recent_rotated_file + 1);
+	if (fp == 0)
+		fp = 1;
+	else if (fp > 100)
+		fp = 100;
+	percent[1] = fp;
+}
+
+
+/*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
 static unsigned long shrink_zone(int priority, struct zone *zone,
@@ -1252,17 +1224,21 @@ static unsigned long shrink_zone(int pri
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
+	unsigned long percent[2];       /* anon @ 0; file @ 1 */
 	enum lru_list l;
 
+	get_scan_ratio(zone, sc, percent);
+
 	if (scan_global_lru(sc)) {
 		/*
 		 * Add one to nr_to_scan just to make sure that the kernel
 		 * will slowly sift through the active list.
 		 */
 		for_each_lru(l) {
+			int file = is_file_lru(l);
 			zone->nr_scan[l] += (zone_page_state(zone,
-					NR_INACTIVE + l)  >> priority) + 1;
-			nr[l] = zone->nr_scan[l];
+				NR_INACTIVE_ANON + l) >> priority) + 1;
+			nr[l] = zone->nr_scan[l] * percent[file] / 100;
 			if (nr[l] >= sc->swap_cluster_max)
 				zone->nr_scan[l] = 0;
 			else
@@ -1281,7 +1257,8 @@ static unsigned long shrink_zone(int pri
 					zone, priority);
 	}
 
-	while (nr[LRU_ACTIVE] || nr[LRU_INACTIVE]) {
+	while (nr[LRU_ACTIVE_ANON] || nr[LRU_INACTIVE_ANON] ||
+			nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) {
 		for_each_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
@@ -1355,7 +1332,7 @@ static unsigned long shrink_zones(int pr
 
 	return nr_reclaimed;
 }
- 
+
 /*
  * This is the main entry point to direct page reclaim.
  *
@@ -1391,8 +1368,7 @@ static unsigned long do_try_to_free_page
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
 
-			lru_pages += zone_page_state(zone, NR_ACTIVE)
-					+ zone_page_state(zone, NR_INACTIVE);
+			lru_pages += zone_lru_pages(zone);
 		}
 	}
 
@@ -1597,8 +1573,7 @@ loop_again:
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
 
-			lru_pages += zone_page_state(zone, NR_ACTIVE)
-					+ zone_page_state(zone, NR_INACTIVE);
+			lru_pages += zone_lru_pages(zone);
 		}
 
 		/*
@@ -1642,8 +1617,7 @@ loop_again:
 			if (zone_is_all_unreclaimable(zone))
 				continue;
 			if (nr_slab == 0 && zone->pages_scanned >=
-				(zone_page_state(zone, NR_ACTIVE)
-				+ zone_page_state(zone, NR_INACTIVE)) * 6)
+						(zone_lru_pages(zone) * 6))
 					zone_set_flag(zone,
 						      ZONE_ALL_UNRECLAIMABLE);
 			/*
@@ -1698,7 +1672,7 @@ out:
 
 /*
  * The background pageout daemon, started as a kernel thread
- * from the init process. 
+ * from the init process.
  *
  * This basically trickles out pages so that we have _some_
  * free memory available even if there is no other activity
@@ -1818,17 +1792,18 @@ static unsigned long shrink_all_zones(un
 
 		for_each_lru(l) {
 			/* For pass = 0 we don't shrink the active list */
-			if (pass == 0 && l == LRU_ACTIVE)
+			if (pass == 0 &&
+				(l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE))
 				continue;
 
 			zone->nr_scan[l] +=
-				(zone_page_state(zone, NR_INACTIVE + l)
+				(zone_page_state(zone, NR_INACTIVE_ANON + l)
 								>> prio) + 1;
 			if (zone->nr_scan[l] >= nr_pages || pass > 3) {
 				zone->nr_scan[l] = 0;
 				nr_to_scan = min(nr_pages,
 					zone_page_state(zone,
-							NR_INACTIVE + l));
+							NR_INACTIVE_ANON + l));
 				ret += shrink_list(l, nr_to_scan, zone,
 								sc, prio);
 				if (ret >= nr_pages)
@@ -1840,9 +1815,12 @@ static unsigned long shrink_all_zones(un
 	return ret;
 }
 
-static unsigned long count_lru_pages(void)
+unsigned long global_lru_pages(void)
 {
-	return global_page_state(NR_ACTIVE) + global_page_state(NR_INACTIVE);
+	return global_page_state(NR_ACTIVE_ANON)
+		+ global_page_state(NR_ACTIVE_FILE)
+		+ global_page_state(NR_INACTIVE_ANON)
+		+ global_page_state(NR_INACTIVE_FILE);
 }
 
 /*
@@ -1870,7 +1848,7 @@ unsigned long shrink_all_memory(unsigned
 
 	current->reclaim_state = &reclaim_state;
 
-	lru_pages = count_lru_pages();
+	lru_pages = global_lru_pages();
 	nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
 	/* If slab caches are huge, it's better to hit them first */
 	while (nr_slab >= lru_pages) {
@@ -1913,7 +1891,7 @@ unsigned long shrink_all_memory(unsigned
 
 			reclaim_state.reclaimed_slab = 0;
 			shrink_slab(sc.nr_scanned, sc.gfp_mask,
-					count_lru_pages());
+					global_lru_pages());
 			ret += reclaim_state.reclaimed_slab;
 			if (ret >= nr_pages)
 				goto out;
@@ -1930,7 +1908,7 @@ unsigned long shrink_all_memory(unsigned
 	if (!ret) {
 		do {
 			reclaim_state.reclaimed_slab = 0;
-			shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
+			shrink_slab(nr_pages, sc.gfp_mask, global_lru_pages());
 			ret += reclaim_state.reclaimed_slab;
 		} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
 	}
Index: linux-2.6.24-rc3-mm2/mm/swap_state.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/swap_state.c
+++ linux-2.6.24-rc3-mm2/mm/swap_state.c
@@ -360,7 +360,7 @@ struct page *read_swap_cache_async(swp_e
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active(new_page);
+			lru_cache_add_active_anon(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}
Index: linux-2.6.24-rc3-mm2/include/linux/mmzone.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/mmzone.h
+++ linux-2.6.24-rc3-mm2/include/linux/mmzone.h
@@ -80,21 +80,23 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_INACTIVE,	/* must match order of LRU_[IN]ACTIVE */
-	NR_ACTIVE,	/*  "     "     "   "       "         */
+	NR_INACTIVE_ANON,	/* must match order of LRU_[IN]ACTIVE_* */
+	NR_ACTIVE_ANON,		/*  "     "     "   "       "           */
+	NR_INACTIVE_FILE,	/*  "     "     "   "       "           */
+	NR_ACTIVE_FILE,		/*  "     "     "   "       "           */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
 	NR_FILE_PAGES,
 	NR_FILE_DIRTY,
 	NR_WRITEBACK,
-	/* Second 128 byte cacheline */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
 	NR_VMSCAN_WRITE,
+	/* Second 128 byte cacheline */
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
@@ -105,13 +107,32 @@ enum zone_stat_item {
 #endif
 	NR_VM_ZONE_STAT_ITEMS };
 
+/*
+ * We do arithmetic on the LRU lists in various places in the code,
+ * so it is important to keep the active lists LRU_ACTIVE higher in
+ * the array than the corresponding inactive lists, and to keep
+ * the *_FILE lists LRU_FILE higher than the corresponding _ANON lists.
+ */
+#define LRU_BASE 0
+#define LRU_ANON LRU_BASE
+#define LRU_ACTIVE 1
+#define LRU_FILE 2
+
 enum lru_list {
-	LRU_INACTIVE,	/* must match order of NR_[IN]ACTIVE */
-	LRU_ACTIVE,	/*  "     "     "   "       "        */
+	LRU_INACTIVE_ANON = LRU_BASE,
+	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
+	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
+	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
 	NR_LRU_LISTS };
 
 #define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
 
+static inline int is_file_lru(enum lru_list l)
+{
+	BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3);
+	return (l/2 == 1);
+}
+
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
@@ -267,6 +288,10 @@ struct zone {
 	spinlock_t		lru_lock;	
 	struct list_head	list[NR_LRU_LISTS];
 	unsigned long		nr_scan[NR_LRU_LISTS];
+
+	unsigned long		recent_rotated_anon;
+	unsigned long		recent_rotated_file;
+
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	unsigned long		flags;		   /* zone flags, see below */
 
Index: linux-2.6.24-rc3-mm2/include/linux/mm_inline.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/mm_inline.h
+++ linux-2.6.24-rc3-mm2/include/linux/mm_inline.h
@@ -26,59 +26,84 @@ static inline int page_file_cache(struct
 	WARN_ON(mapping && mapping->a_ops && mapping->a_ops == &shmem_aops);
 
 	/* The page is page cache backed by a normal filesystem. */
-	return 2;
+	return LRU_FILE;
 }
 
 static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
 	list_add(&page->lru, &zone->list[l]);
-	__inc_zone_state(zone, NR_INACTIVE + l);
+	__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
 static inline void
 del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_INACTIVE + l);
+	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
+//TODO:  eventually these can all go away?  just use above 2 fcns?
+static inline void
+add_page_to_active_anon_list(struct zone *zone, struct page *page)
+{
+	add_page_to_lru_list(zone, page, LRU_ACTIVE_ANON);
+}
+
+static inline void
+add_page_to_inactive_anon_list(struct zone *zone, struct page *page)
+{
+	add_page_to_lru_list(zone, page, LRU_INACTIVE_ANON);
+}
+
+static inline void
+del_page_from_active_anon_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_ACTIVE_ANON);
+}
+
+static inline void
+del_page_from_inactive_anon_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_INACTIVE_ANON);
+}
 
 static inline void
-add_page_to_active_list(struct zone *zone, struct page *page)
+add_page_to_active_file_list(struct zone *zone, struct page *page)
 {
-	add_page_to_lru_list(zone, page, LRU_ACTIVE);
+	add_page_to_lru_list(zone, page, LRU_ACTIVE_FILE);
 }
 
 static inline void
-add_page_to_inactive_list(struct zone *zone, struct page *page)
+add_page_to_inactive_file_list(struct zone *zone, struct page *page)
 {
-	add_page_to_lru_list(zone, page, LRU_INACTIVE);
+	add_page_to_lru_list(zone, page, LRU_INACTIVE_FILE);
 }
 
 static inline void
-del_page_from_active_list(struct zone *zone, struct page *page)
+del_page_from_active_file_list(struct zone *zone, struct page *page)
 {
-	del_page_from_lru_list(zone, page, LRU_ACTIVE);
+	del_page_from_lru_list(zone, page, LRU_ACTIVE_FILE);
 }
 
 static inline void
-del_page_from_inactive_list(struct zone *zone, struct page *page)
+del_page_from_inactive_file_list(struct zone *zone, struct page *page)
 {
-	del_page_from_lru_list(zone, page, LRU_INACTIVE);
+	del_page_from_lru_list(zone, page, LRU_INACTIVE_FILE);
 }
 
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
-	enum lru_list l = LRU_INACTIVE;
+	enum lru_list l = LRU_INACTIVE_ANON;
 
 	list_del(&page->lru);
 	if (PageActive(page)) {
 		__ClearPageActive(page);
-		l = LRU_ACTIVE;
+		l = LRU_ACTIVE_ANON;
 	}
-	__dec_zone_state(zone, NR_INACTIVE + l);
+	l += page_file_cache(page);
+	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
 #endif
Index: linux-2.6.24-rc3-mm2/include/linux/pagevec.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/pagevec.h
+++ linux-2.6.24-rc3-mm2/include/linux/pagevec.h
@@ -23,8 +23,10 @@ struct pagevec {
 void __pagevec_release(struct pagevec *pvec);
 void __pagevec_release_nonlru(struct pagevec *pvec);
 void __pagevec_free(struct pagevec *pvec);
-void __pagevec_lru_add(struct pagevec *pvec);
-void __pagevec_lru_add_active(struct pagevec *pvec);
+void __pagevec_lru_add_file(struct pagevec *pvec);
+void __pagevec_lru_add_active_file(struct pagevec *pvec);
+void __pagevec_lru_add_anon(struct pagevec *pvec);
+void __pagevec_lru_add_active_anon(struct pagevec *pvec);
 void pagevec_strip(struct pagevec *pvec);
 void pagevec_swap_free(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
@@ -82,10 +84,16 @@ static inline void pagevec_free(struct p
 		__pagevec_free(pvec);
 }
 
-static inline void pagevec_lru_add(struct pagevec *pvec)
+static inline void pagevec_lru_add_file(struct pagevec *pvec)
 {
 	if (pagevec_count(pvec))
-		__pagevec_lru_add(pvec);
+		__pagevec_lru_add_file(pvec);
+}
+
+static inline void pagevec_lru_add_anon(struct pagevec *pvec)
+{
+	if (pagevec_count(pvec))
+		__pagevec_lru_add_anon(pvec);
 }
 
 #endif /* _LINUX_PAGEVEC_H */
Index: linux-2.6.24-rc3-mm2/include/linux/vmstat.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/vmstat.h
+++ linux-2.6.24-rc3-mm2/include/linux/vmstat.h
@@ -149,6 +149,16 @@ static inline unsigned long zone_page_st
 	return x;
 }
 
+extern unsigned long global_lru_pages(void);
+
+static inline unsigned long zone_lru_pages(struct zone *zone)
+{
+	return (zone_page_state(zone, NR_ACTIVE_ANON)
+		+ zone_page_state(zone, NR_ACTIVE_FILE)
+		+ zone_page_state(zone, NR_INACTIVE_ANON)
+		+ zone_page_state(zone, NR_INACTIVE_FILE));
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Determine the per node value of a stat item. This function
Index: linux-2.6.24-rc3-mm2/mm/page-writeback.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/page-writeback.c
+++ linux-2.6.24-rc3-mm2/mm/page-writeback.c
@@ -270,9 +270,7 @@ static unsigned long highmem_dirtyable_m
 		struct zone *z =
 			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
-		x += zone_page_state(z, NR_FREE_PAGES)
-			+ zone_page_state(z, NR_INACTIVE)
-			+ zone_page_state(z, NR_ACTIVE);
+		x += zone_page_state(z, NR_FREE_PAGES) + zone_lru_pages(z);
 	}
 	/*
 	 * Make sure that the number of highmem pages is never larger
@@ -290,9 +288,7 @@ static unsigned long determine_dirtyable
 {
 	unsigned long x;
 
-	x = global_page_state(NR_FREE_PAGES)
-		+ global_page_state(NR_INACTIVE)
-		+ global_page_state(NR_ACTIVE);
+	x = global_page_state(NR_FREE_PAGES) + global_lru_pages();
 
 	if (!vm_highmem_is_dirtyable)
 		x -= highmem_dirtyable_memory(x);
Index: linux-2.6.24-rc3-mm2/include/linux/swap.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/swap.h
+++ linux-2.6.24-rc3-mm2/include/linux/swap.h
@@ -171,8 +171,10 @@ extern unsigned int nr_free_pagecache_pa
 
 
 /* linux/mm/swap.c */
-extern void FASTCALL(lru_cache_add(struct page *));
-extern void FASTCALL(lru_cache_add_active(struct page *));
+extern void FASTCALL(lru_cache_add_file(struct page *));
+extern void FASTCALL(lru_cache_add_anon(struct page *));
+extern void FASTCALL(lru_cache_add_active_file(struct page *));
+extern void FASTCALL(lru_cache_add_active_anon(struct page *));
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
@@ -185,7 +187,7 @@ extern unsigned long try_to_free_pages(s
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 							gfp_t gfp_mask);
-extern int __isolate_lru_page(struct page *page, int mode);
+extern int __isolate_lru_page(struct page *page, int mode, int file);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
Index: linux-2.6.24-rc3-mm2/include/linux/memcontrol.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/memcontrol.h
+++ linux-2.6.24-rc3-mm2/include/linux/memcontrol.h
@@ -41,7 +41,7 @@ extern unsigned long mem_cgroup_isolate_
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
 					struct mem_cgroup *mem_cont,
-					int active);
+					int active, int file);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
Index: linux-2.6.24-rc3-mm2/mm/memcontrol.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/memcontrol.c
+++ linux-2.6.24-rc3-mm2/mm/memcontrol.c
@@ -30,6 +30,7 @@
 #include <linux/spinlock.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
+#include <linux/mm_inline.h>
 
 #include <asm/uaccess.h>
 
@@ -511,7 +512,7 @@ unsigned long mem_cgroup_isolate_pages(u
 					unsigned long *scanned, int order,
 					int mode, struct zone *z,
 					struct mem_cgroup *mem_cont,
-					int active)
+					int active, int file)
 {
 	unsigned long nr_taken = 0;
 	struct page *page;
@@ -523,6 +524,7 @@ unsigned long mem_cgroup_isolate_pages(u
 	int zid = zone_idx(z);
 	struct mem_cgroup_per_zone *mz;
 
+	/* TODO: split file and anon LRUs - Rik */
 	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
 	if (active)
 		src = &mz->active_list;
@@ -541,6 +543,9 @@ unsigned long mem_cgroup_isolate_pages(u
 		if (unlikely(!PageLRU(page)))
 			continue;
 
+		/*
+		 * TODO: play better with lumpy reclaim, grabbing anything.
+		 */
 		if (PageActive(page) && !active) {
 			__mem_cgroup_move_lists(pc, true);
 			continue;
@@ -553,7 +558,7 @@ unsigned long mem_cgroup_isolate_pages(u
 		scan++;
 		list_move(&pc->lru, &pc_list);
 
-		if (__isolate_lru_page(page, mode) == 0) {
+		if (__isolate_lru_page(page, mode, file) == 0) {
 			list_move(&page->lru, dst);
 			nr_taken++;
 		}

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 09/20] split anon & file LRUs for memcontrol code
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (7 preceding siblings ...)
  2007-12-18 21:15 ` [patch 08/20] split LRU lists into anon & file sets Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 10/20] SEQ replacement for anonymous pages Rik van Riel
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn

[-- Attachment #1: rvr-03-linux-2.6-memcontrol-lrus.patch --]
[-- Type: text/plain, Size: 12009 bytes --]

Update the split anon & file LRU code to deal with the recent
memory controller changes.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc3-mm2/include/linux/memcontrol.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/memcontrol.h
+++ linux-2.6.24-rc3-mm2/include/linux/memcontrol.h
@@ -73,10 +73,8 @@ extern void mem_cgroup_note_reclaim_prio
 extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
 							int priority);
 
-extern long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
-				struct zone *zone, int priority);
-extern long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
-				struct zone *zone, int priority);
+extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
+					int priority, int active, int file);
 
 #else /* CONFIG_CGROUP_MEM_CONT */
 static inline void mm_init_cgroup(struct mm_struct *mm,
@@ -174,14 +172,9 @@ static inline void mem_cgroup_record_rec
 {
 }
 
-static inline long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
-					struct zone *zone, int priority)
-{
-	return 0;
-}
-
-static inline long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
-					struct zone *zone, int priority)
+static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem,
+					struct zone *zone, int priority,
+					int active, int file)
 {
 	return 0;
 }
Index: linux-2.6.24-rc3-mm2/mm/memcontrol.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/memcontrol.c
+++ linux-2.6.24-rc3-mm2/mm/memcontrol.c
@@ -81,22 +81,13 @@ static s64 mem_cgroup_read_stat(struct m
 /*
  * per-zone information in memory controller.
  */
-
-enum mem_cgroup_zstat_index {
-	MEM_CGROUP_ZSTAT_ACTIVE,
-	MEM_CGROUP_ZSTAT_INACTIVE,
-
-	NR_MEM_CGROUP_ZSTAT,
-};
-
 struct mem_cgroup_per_zone {
 	/*
 	 * spin_lock to protect the per cgroup LRU
 	 */
 	spinlock_t		lru_lock;
-	struct list_head	active_list;
-	struct list_head	inactive_list;
-	unsigned long count[NR_MEM_CGROUP_ZSTAT];
+	struct list_head	lists[NR_LRU_LISTS];
+	unsigned long		count[NR_LRU_LISTS];
 };
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
@@ -162,6 +153,7 @@ struct page_cgroup {
 };
 #define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* charged as cache */
 #define PAGE_CGROUP_FLAG_ACTIVE (0x2)	/* page is active in this cgroup */
+#define PAGE_CGROUP_FLAG_FILE	(0x4)	/* page is file system backed */
 
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
@@ -223,7 +215,7 @@ page_cgroup_zoneinfo(struct page_cgroup 
 }
 
 static unsigned long mem_cgroup_get_all_zonestat(struct mem_cgroup *mem,
-					enum mem_cgroup_zstat_index idx)
+					enum lru_list idx)
 {
 	int nid, zid;
 	struct mem_cgroup_per_zone *mz;
@@ -349,13 +341,15 @@ static struct page_cgroup *clear_page_cg
 
 static void __mem_cgroup_remove_list(struct page_cgroup *pc)
 {
-	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+	int lru = LRU_BASE;
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
 
-	if (from)
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) -= 1;
-	else
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) -= 1;
+	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		lru += LRU_ACTIVE;
+	if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		lru += LRU_FILE;
+
+	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
 	list_del_init(&pc->lru);
@@ -363,38 +357,37 @@ static void __mem_cgroup_remove_list(str
 
 static void __mem_cgroup_add_list(struct page_cgroup *pc)
 {
-	int to = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+	int lru = LRU_BASE;
+
+	if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		lru += LRU_ACTIVE;
+	if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		lru += LRU_FILE;
+
+	MEM_CGROUP_ZSTAT(mz, lru) += 1;
+	list_add(&pc->lru, &mz->lists[lru]);
 
-	if (!to) {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
-		list_add(&pc->lru, &mz->inactive_list);
-	} else {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1;
-		list_add(&pc->lru, &mz->active_list);
-	}
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
 }
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, bool active)
 {
 	int from = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
+	int file = pc->flags & PAGE_CGROUP_FLAG_FILE;
+	int lru = LRU_FILE * !!file + !!from;
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
 
-	if (from)
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) -= 1;
-	else
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) -= 1;
+	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
-	if (active) {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE) += 1;
+	if (active)
 		pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
-		list_move(&pc->lru, &mz->active_list);
-	} else {
-		MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE) += 1;
+	else
 		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-		list_move(&pc->lru, &mz->inactive_list);
-	}
+
+	lru = LRU_FILE * !!file + !!active;
+	MEM_CGROUP_ZSTAT(mz, lru) += 1;
+	list_move(&pc->lru, &mz->lists[lru]);
 }
 
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
@@ -440,20 +433,6 @@ int mem_cgroup_calc_mapped_ratio(struct 
 	rss = (long)mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
 	return (int)((rss * 100L) / total);
 }
-/*
- * This function is called from vmscan.c. In page reclaiming loop. balance
- * between active and inactive list is calculated. For memory controller
- * page reclaiming, we should use using mem_cgroup's imbalance rather than
- * zone's global lru imbalance.
- */
-long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem)
-{
-	unsigned long active, inactive;
-	/* active and inactive are the number of pages. 'long' is ok.*/
-	active = mem_cgroup_get_all_zonestat(mem, MEM_CGROUP_ZSTAT_ACTIVE);
-	inactive = mem_cgroup_get_all_zonestat(mem, MEM_CGROUP_ZSTAT_INACTIVE);
-	return (long) (active / (inactive + 1));
-}
 
 /*
  * prev_priority control...this will be used in memory reclaim path.
@@ -482,29 +461,16 @@ void mem_cgroup_record_reclaim_priority(
  * (see include/linux/mmzone.h)
  */
 
-long mem_cgroup_calc_reclaim_active(struct mem_cgroup *mem,
-				   struct zone *zone, int priority)
-{
-	long nr_active;
-	int nid = zone->zone_pgdat->node_id;
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
-
-	nr_active = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_ACTIVE);
-	return (nr_active >> priority);
-}
-
-long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem,
-					struct zone *zone, int priority)
+long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
+				int priority, int active, int file)
 {
-	long nr_inactive;
+	long nr_pages;
 	int nid = zone->zone_pgdat->node_id;
 	int zid = zone_idx(zone);
 	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
 
-	nr_inactive = MEM_CGROUP_ZSTAT(mz, MEM_CGROUP_ZSTAT_INACTIVE);
-
-	return (nr_inactive >> priority);
+	nr_pages = MEM_CGROUP_ZSTAT(mz, LRU_FILE * !!file + !!active);
+	return (nr_pages >> priority);
 }
 
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
@@ -522,14 +488,12 @@ unsigned long mem_cgroup_isolate_pages(u
 	struct page_cgroup *pc, *tmp;
 	int nid = z->zone_pgdat->node_id;
 	int zid = zone_idx(z);
+	int lru = LRU_FILE * !!file + !!active;
 	struct mem_cgroup_per_zone *mz;
 
 	/* TODO: split file and anon LRUs - Rik */
 	mz = mem_cgroup_zoneinfo(mem_cont, nid, zid);
-	if (active)
-		src = &mz->active_list;
-	else
-		src = &mz->inactive_list;
+	src = &mz->lists[lru];
 
 
 	spin_lock(&mz->lru_lock);
@@ -684,6 +648,8 @@ noreclaim:
 	pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE)
 		pc->flags |= PAGE_CGROUP_FLAG_CACHE;
+	if (page_file_cache(page))
+		pc->flags |= PAGE_CGROUP_FLAG_FILE;
 	if (page_cgroup_assign_new_page_cgroup(page, pc)) {
 		/*
 		 * an another charge is added to this page already.
@@ -840,18 +806,17 @@ retry:
 static void
 mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			    struct mem_cgroup_per_zone *mz,
-			    int active)
+			    int active, int file)
 {
 	struct page_cgroup *pc;
 	struct page *page;
 	int count;
 	unsigned long flags;
 	struct list_head *list;
+	int lru;
 
-	if (active)
-		list = &mz->active_list;
-	else
-		list = &mz->inactive_list;
+	lru = LRU_FILE * !!file + !!active;
+	list = &mz->lists[lru];
 
 	if (list_empty(list))
 		return;
@@ -904,10 +869,14 @@ int mem_cgroup_force_empty(struct mem_cg
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				struct mem_cgroup_per_zone *mz;
 				mz = mem_cgroup_zoneinfo(mem, node, zid);
-				/* drop all page_cgroup in active_list */
-				mem_cgroup_force_empty_list(mem, mz, 1);
-				/* drop all page_cgroup in inactive_list */
-				mem_cgroup_force_empty_list(mem, mz, 0);
+				/* drop all page_cgroup in ACTIVE_ANON */
+				mem_cgroup_force_empty_list(mem, mz, 1, 0);
+				/* drop all page_cgroup in INACTIVE_ANON */
+				mem_cgroup_force_empty_list(mem, mz, 0, 0);
+				/* drop all page_cgroup in ACTIVE_FILE */
+				mem_cgroup_force_empty_list(mem, mz, 1, 1);
+				/* drop all page_cgroup in INACTIVE_FILE */
+				mem_cgroup_force_empty_list(mem, mz, 0, 1);
 			}
 	}
 	ret = 0;
@@ -1055,14 +1024,21 @@ static int mem_control_stat_show(struct 
 	}
 	/* showing # of active pages */
 	{
-		unsigned long active, inactive;
+		unsigned long active_anon, inactive_anon;
+		unsigned long active_file, inactive_file;
 
-		inactive = mem_cgroup_get_all_zonestat(mem_cont,
-						MEM_CGROUP_ZSTAT_INACTIVE);
-		active = mem_cgroup_get_all_zonestat(mem_cont,
-						MEM_CGROUP_ZSTAT_ACTIVE);
-		seq_printf(m, "active %ld\n", (active) * PAGE_SIZE);
-		seq_printf(m, "inactive %ld\n", (inactive) * PAGE_SIZE);
+		inactive_anon = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_INACTIVE_ANON);
+		active_anon = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_ACTIVE_ANON);
+		inactive_file = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_INACTIVE_FILE);
+		active_file = mem_cgroup_get_all_zonestat(mem_cont,
+						LRU_ACTIVE_FILE);
+		seq_printf(m, "active_anon %ld\n", (active_anon) * PAGE_SIZE);
+		seq_printf(m, "inactive_anon %ld\n", (inactive_anon) * PAGE_SIZE);
+		seq_printf(m, "active_file %ld\n", (active_file) * PAGE_SIZE);
+		seq_printf(m, "inactive_file %ld\n", (inactive_file) * PAGE_SIZE);
 	}
 	return 0;
 }
@@ -1121,6 +1097,7 @@ static int alloc_mem_cgroup_per_zone_inf
 {
 	struct mem_cgroup_per_node *pn;
 	struct mem_cgroup_per_zone *mz;
+	int i;
 	int zone;
 
 	pn = kmalloc_node(sizeof(*pn), GFP_KERNEL, node);
@@ -1132,8 +1109,8 @@ static int alloc_mem_cgroup_per_zone_inf
 
 	for (zone = 0; zone < MAX_NR_ZONES; zone++) {
 		mz = &pn->zoneinfo[zone];
-		INIT_LIST_HEAD(&mz->active_list);
-		INIT_LIST_HEAD(&mz->inactive_list);
+		for (i = 0; i < NR_LRU_LISTS ; i++)
+			INIT_LIST_HEAD(&mz->lists[i]);
 		spin_lock_init(&mz->lru_lock);
 	}
 	return 0;
Index: linux-2.6.24-rc3-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/vmscan.c
+++ linux-2.6.24-rc3-mm2/mm/vmscan.c
@@ -1250,11 +1250,15 @@ static unsigned long shrink_zone(int pri
 		 * because memory controller hits its limit.
 		 * Then, don't modify zone reclaim related data.
 		 */
-		nr[LRU_ACTIVE] = mem_cgroup_calc_reclaim_active(sc->mem_cgroup,
-					zone, priority);
-
-		nr[LRU_INACTIVE] = mem_cgroup_calc_reclaim_inactive(sc->mem_cgroup,
-					zone, priority);
+		nr[LRU_ACTIVE_FILE] = mem_cgroup_calc_reclaim(sc->mem_cgroup,
+					zone, priority, 1, 1);
+		nr[LRU_INACTIVE_FILE] = mem_cgroup_calc_reclaim(sc->mem_cgroup,
+					zone, priority, 0, 1);
+
+		nr[LRU_ACTIVE_ANON] = mem_cgroup_calc_reclaim(sc->mem_cgroup,
+					zone, priority, 1, 0);
+		nr[LRU_INACTIVE_ANON] = mem_cgroup_calc_reclaim(sc->mem_cgroup,
+					zone, priority, 0, 0);
 	}
 
 	while (nr[LRU_ACTIVE_ANON] || nr[LRU_INACTIVE_ANON] ||

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 10/20] SEQ replacement for anonymous pages
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (8 preceding siblings ...)
  2007-12-18 21:15 ` [patch 09/20] split anon & file LRUs for memcontrol code Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-19  5:17   ` KOSAKI Motohiro
  2007-12-18 21:15 ` [patch 11/20] add newly swapped in pages to the inactive list Rik van Riel
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn

[-- Attachment #1: rvr-03-linux-2.6-vm-anon-seq.patch --]
[-- Type: text/plain, Size: 6528 bytes --]

We avoid evicting and scanning anonymous pages for the most part, but
under some workloads we can end up with most of memory filled with
anonymous pages.  At that point, we suddenly need to clear the referenced
bits on all of memory, which can take ages on very large memory systems.

We can reduce the maximum number of pages that need to be scanned by
not taking the referenced state into account when deactivating an
anonymous page.  After all, every anonymous page starts out referenced,
so why check?

If an anonymous page gets referenced again before it reaches the end
of the inactive list, we move it back to the active list.

To keep the maximum amount of necessary work reasonable, we scale the
active to inactive ratio with the size of memory, using the formula
active:inactive ratio = sqrt(memory in GB * 10).

Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
instead of by the amount of memory present in the system.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc3-mm2/include/linux/mm_inline.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/mm_inline.h
+++ linux-2.6.24-rc3-mm2/include/linux/mm_inline.h
@@ -106,4 +106,16 @@ del_page_from_lru(struct zone *zone, str
 	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
 }
 
+static inline int inactive_anon_low(struct zone *zone)
+{
+	unsigned long active, inactive;
+
+	active = zone_page_state(zone, NR_ACTIVE_ANON);
+	inactive = zone_page_state(zone, NR_INACTIVE_ANON);
+
+	if (inactive * zone->inactive_ratio < active)
+		return 1;
+
+	return 0;
+}
 #endif
Index: linux-2.6.24-rc3-mm2/include/linux/mmzone.h
===================================================================
--- linux-2.6.24-rc3-mm2.orig/include/linux/mmzone.h
+++ linux-2.6.24-rc3-mm2/include/linux/mmzone.h
@@ -313,6 +313,11 @@ struct zone {
 	 */
 	int prev_priority;
 
+	/*
+	 * The ratio of active to inactive pages.
+	 */
+	unsigned int inactive_ratio;
+
 
 	ZONE_PADDING(_pad2_)
 	/* Rarely used or read-mostly fields */
Index: linux-2.6.24-rc3-mm2/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/page_alloc.c
+++ linux-2.6.24-rc3-mm2/mm/page_alloc.c
@@ -4221,6 +4221,34 @@ void setup_per_zone_pages_min(void)
 	calculate_totalreserve_pages();
 }
 
+/**
+ * setup_per_zone_inactive_ratio - called when min_free_kbytes changes.
+ *
+ * The inactive anon list should be small enough that the VM never has to
+ * do too much work, but large enough that each inactive page has a chance
+ * to be referenced again before it is swapped out.
+ *
+ * The inactive_anon ratio is the ratio of active to inactive anonymous
+ * pages.  Ie. a ratio of 3 means 3:1 or 25% of the anonymous pages are
+ * on the inactive list.
+ */
+void setup_per_zone_inactive_ratio(void)
+{
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		unsigned int gb, ratio;
+
+		/* Zone size in gigabytes */
+		gb = zone->present_pages >> (30 - PAGE_SHIFT);
+		ratio = int_sqrt(10 * gb);
+		if (!ratio)
+			ratio = 1;
+
+		zone->inactive_ratio = ratio;
+	}
+}
+
 /*
  * Initialise min_free_kbytes.
  *
@@ -4258,6 +4286,7 @@ static int __init init_per_zone_pages_mi
 		min_free_kbytes = 65536;
 	setup_per_zone_pages_min();
 	setup_per_zone_lowmem_reserve();
+	setup_per_zone_inactive_ratio();
 	return 0;
 }
 module_init(init_per_zone_pages_min)
Index: linux-2.6.24-rc3-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/vmscan.c
+++ linux-2.6.24-rc3-mm2/mm/vmscan.c
@@ -1018,7 +1018,7 @@ static inline int zone_is_near_oom(struc
 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 				struct scan_control *sc, int priority, int file)
 {
-	unsigned long pgmoved;
+	unsigned long pgmoved = 0;
 	int pgdeactivate = 0;
 	unsigned long pgscanned;
 	LIST_HEAD(l_hold);	/* The pages which were snipped off */
@@ -1057,12 +1057,25 @@ static void shrink_active_list(unsigned 
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
-		if (page_referenced(page, 0, sc->mem_cgroup))
-			lru = LRU_ACTIVE_ANON;
+		if (page_referenced(page, 0, sc->mem_cgroup)) {
+			if (file)
+				/* Referenced file pages stay active. */
+				lru = LRU_ACTIVE_ANON;
+			else
+				/* Anonymous pages always get deactivated. */
+				pgmoved++;
+		}
 		list_add(&page->lru, &list[lru]);
 	}
 
 	/*
+	 * Count the referenced anon pages as rotated, to balance pageout
+	 * scan pressure between file and anonymous pages in get_scan_ratio.
+	 */
+	if (!file)
+		zone->recent_rotated_anon += pgmoved;
+
+	/*
 	 * Now put the pages back to the appropriate [file or anon] inactive
 	 * and active lists.
 	 */
@@ -1144,7 +1157,11 @@ static unsigned long shrink_list(enum lr
 {
 	int file = is_file_lru(lru);
 
-	if (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE) {
+	if (lru == LRU_ACTIVE_FILE) {
+		shrink_active_list(nr_to_scan, zone, sc, priority, file);
+		return 0;
+	}
+	if (lru == LRU_ACTIVE_ANON && inactive_anon_low(zone)) {
 		shrink_active_list(nr_to_scan, zone, sc, priority, file);
 		return 0;
 	}
@@ -1261,6 +1278,9 @@ static unsigned long shrink_zone(int pri
 					zone, priority, 0, 0);
 	}
 
+	if (!inactive_anon_low(zone))
+		nr[LRU_ACTIVE_ANON] = 0;
+
 	while (nr[LRU_ACTIVE_ANON] || nr[LRU_INACTIVE_ANON] ||
 			nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) {
 		for_each_lru(l) {
@@ -1565,6 +1585,14 @@ loop_again:
 			    priority != DEF_PRIORITY)
 				continue;
 
+			/*
+			 * Do some background aging of the anon list, to give
+			 * pages a chance to be referenced before reclaiming.
+			 */
+			if (inactive_anon_low(zone))
+				shrink_active_list(SWAP_CLUSTER_MAX, zone,
+							&sc, priority, 0);
+
 			if (!zone_watermark_ok(zone, order, zone->pages_high,
 					       0, 0)) {
 				end_zone = i;
Index: linux-2.6.24-rc3-mm2/mm/vmstat.c
===================================================================
--- linux-2.6.24-rc3-mm2.orig/mm/vmstat.c
+++ linux-2.6.24-rc3-mm2/mm/vmstat.c
@@ -807,10 +807,12 @@ static void zoneinfo_show_print(struct s
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
 		   "\n  prev_priority:     %i"
-		   "\n  start_pfn:         %lu",
+		   "\n  start_pfn:         %lu"
+		   "\n  inactive_ratio:    %u",
 			   zone_is_all_unreclaimable(zone),
 		   zone->prev_priority,
-		   zone->zone_start_pfn);
+		   zone->zone_start_pfn,
+		   zone->inactive_ratio);
 	seq_putc(m, '\n');
 }
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 11/20] add newly swapped in pages to the inactive list
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (9 preceding siblings ...)
  2007-12-18 21:15 ` [patch 10/20] SEQ replacement for anonymous pages Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 12/20] No Reclaim LRU Infrastructure Rik van Riel
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn

[-- Attachment #1: rvr-swapin-inactive.patch --]
[-- Type: text/plain, Size: 912 bytes --]

Swapin_readahead can read in a lot of data that the processes in
memory never need.  Adding swap cache pages to the inactive list
prevents them from putting too much pressure on the working set.

This has the potential to help the programs that are already in
memory, but it could also be a disadvantage to processes that
are trying to get swapped in.

In short, this patch needs testing.

Signed-off-by: Rik van Riel <riel@redhat.com>

Index: linux-2.6.23-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.23-mm1.orig/mm/swap_state.c
+++ linux-2.6.23-mm1/mm/swap_state.c
@@ -370,7 +370,7 @@ struct page *read_swap_cache_async(swp_e
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active_anon(new_page);
+			lru_cache_add_anon(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 12/20] No Reclaim LRU Infrastructure
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (10 preceding siblings ...)
  2007-12-18 21:15 ` [patch 11/20] add newly swapped in pages to the inactive list Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 13/20] Non-reclaimable page statistics Rik van Riel
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: noreclaim-01.1-no-reclaim-infrastructure.patch --]
[-- Type: text/plain, Size: 25841 bytes --]

V1 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series
+ define NR_NORECLAIM and LRU_NORECLAIM to avoid errors when not
  configured.

V1 -> V2:
+  handle review comments -- various typos and errors.
+  extract "putback_all_noreclaim_pages()" into a separate patch
   and rework as "scan_all_zones_noreclaim_pages().

Infrastructure to manage pages excluded from reclaim--i.e., hidden
from vmscan.  Based on a patch by Larry Woodman of Red Hat. Reworked
to maintain "nonreclaimable" pages on a separate per-zone LRU list,
to "hide" them from vmscan.  A separate noreclaim pagevec is provided
for shrink_active_list() to move nonreclaimable pages to the noreclaim
list without over burdening the zone lru_lock.

Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
Thus, PG_noreclaim is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.  

The noreclaim infrastructure is enabled by a new mm Kconfig option
[CONFIG_]NORECLAIM.

A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether
or not a page is reclaimable.  Subsequent patches will add the various
!reclaimable tests.  We'll want to keep these tests light-weight for
use in shrink_active_list() and, possibly, the fault path.

Notes:

1.  for now, use bit 30 in page flags.  This restricts the no reclaim
    infrastructure to 64-bit systems.  [The mlock patch, later in this
    series, uses another of these 64-bit-system-only flags.]

    Rationale:  32-bit systems have no free page flags and are less
    likely to have the large amounts of memory that exhibit the problems
    this series attempts to solve.  [I'm sure someone will disabuse me
    of this notion.]

    Thus, NORECLAIM currently depends on [CONFIG_]64BIT.

2.  The pagevec to move pages to the noreclaim list results in another
    loop at the end of shrink_active_list().  If we ultimately adopt Rik
    van Riel's split lru approach, I think we'll need to find a way to
    factor all of these loops into some common code.

3.  TODO:  Memory Controllers maintain separate active and inactive lists.
    Need to consider whether they should also maintain a noreclaim list.  
    Also, convert to use Christoph's array of indexed lru variables?

    See //TODO note in mm/memcontrol.c re:  isolating non-reclaimable
    pages. 

4.  TODO:  more factoring of lru list handling.  But, I want to get this
    as close to functionally correct as possible before introducing those
    perturbations.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.24-rc4-mm1/mm/Kconfig
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/Kconfig
+++ linux-2.6.24-rc4-mm1/mm/Kconfig
@@ -194,3 +194,13 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config NORECLAIM
+	bool "Track non-reclaimable pages (EXPERIMENTAL; 64BIT only)"
+	depends on EXPERIMENTAL && 64BIT
+	help
+	  Supports tracking of non-reclaimable pages off the [in]active lists
+	  to avoid excessive reclaim overhead on large memory systems.  Pages
+	  may be non-reclaimable because:  they are locked into memory, they
+	  are anonymous pages for which no swap space exists, or they are anon
+	  pages that are expensive to unmap [long anon_vma "related vma" list.]
Index: linux-2.6.24-rc4-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.24-rc4-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.24-rc4-mm1/include/linux/page-flags.h
@@ -94,6 +94,7 @@
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead		PG_reclaim /* Reminder to do async read-ahead */
 
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 #define PG_pinned		PG_owner_priv_1	/* Xen pinned pagetable */
@@ -107,6 +108,8 @@
  *         63                            32                              0
  */
 #define PG_uncached		31	/* Page has been mapped as uncached */
+
+#define PG_noreclaim		30	/* Page is "non-reclaimable"  */
 #endif
 
 /*
@@ -160,6 +163,7 @@ static inline void SetPageUptodate(struc
 #define SetPageActive(page)	set_bit(PG_active, &(page)->flags)
 #define ClearPageActive(page)	clear_bit(PG_active, &(page)->flags)
 #define __ClearPageActive(page)	__clear_bit(PG_active, &(page)->flags)
+#define TestClearPageActive(page) test_and_clear_bit(PG_active, &(page)->flags)
 
 #define PageSlab(page)		test_bit(PG_slab, &(page)->flags)
 #define __SetPageSlab(page)	__set_bit(PG_slab, &(page)->flags)
@@ -261,6 +265,21 @@ static inline void __ClearPageTail(struc
 #define PageSwapCache(page)	0
 #endif
 
+#ifdef CONFIG_NORECLAIM
+#define PageNoreclaim(page)	test_bit(PG_noreclaim, &(page)->flags)
+#define SetPageNoreclaim(page)	set_bit(PG_noreclaim, &(page)->flags)
+#define ClearPageNoreclaim(page) clear_bit(PG_noreclaim, &(page)->flags)
+#define __ClearPageNoreclaim(page) __clear_bit(PG_noreclaim, &(page)->flags)
+#define TestClearPageNoreclaim(page) test_and_clear_bit(PG_noreclaim, \
+							 &(page)->flags)
+#else
+#define PageNoreclaim(page)	0
+#define SetPageNoreclaim(page)
+#define ClearPageNoreclaim(page)
+#define __ClearPageNoreclaim(page)
+#define TestClearPageNoreclaim(page) 0
+#endif
+
 #define PageUncached(page)	test_bit(PG_uncached, &(page)->flags)
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
Index: linux-2.6.24-rc4-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.24-rc4-mm1.orig/include/linux/mmzone.h
+++ linux-2.6.24-rc4-mm1/include/linux/mmzone.h
@@ -84,6 +84,11 @@ enum zone_stat_item {
 	NR_ACTIVE_ANON,		/*  "     "     "   "       "           */
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "           */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "           */
+#ifdef CONFIG_NORECLAIM
+	NR_NORECLAIM,	/*  "     "     "   "       "         */
+#else
+	NR_NORECLAIM=NR_ACTIVE_FILE, /* avoid compiler errors in dead code */
+#endif
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
@@ -123,10 +128,18 @@ enum lru_list {
 	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
 	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
 	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
-	NR_LRU_LISTS };
+#ifdef CONFIG_NORECLAIM
+	LRU_NORECLAIM,
+#else
+	LRU_NORECLAIM=LRU_ACTIVE_FILE,	/* avoid compiler errors in dead code */
+#endif
+	NR_LRU_LISTS
+};
 
 #define for_each_lru(l) for (l = 0; l < NR_LRU_LISTS; l++)
 
+#define for_each_reclaimable_lru(l) for (l = 0; l <= LRU_ACTIVE_FILE; l++)
+
 static inline int is_file_lru(enum lru_list l)
 {
 	BUILD_BUG_ON(LRU_INACTIVE_FILE != 2 || LRU_ACTIVE_FILE != 3);
Index: linux-2.6.24-rc4-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/page_alloc.c
+++ linux-2.6.24-rc4-mm1/mm/page_alloc.c
@@ -248,6 +248,7 @@ static void bad_page(struct page *page)
 			1 << PG_private |
 			1 << PG_locked	|
 			1 << PG_active	|
+			1 << PG_noreclaim	|
 			1 << PG_dirty	|
 			1 << PG_reclaim |
 			1 << PG_slab    |
@@ -482,6 +483,7 @@ static inline int free_pages_check(struc
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_reserved |
+			1 << PG_noreclaim |
 			1 << PG_buddy ))))
 		bad_page(page);
 	if (PageDirty(page))
@@ -629,6 +631,7 @@ static int prep_new_page(struct page *pa
 			1 << PG_private	|
 			1 << PG_locked	|
 			1 << PG_active	|
+			1 << PG_noreclaim	|
 			1 << PG_dirty	|
 			1 << PG_slab    |
 			1 << PG_swapcache |
Index: linux-2.6.24-rc4-mm1/include/linux/mm_inline.h
===================================================================
--- linux-2.6.24-rc4-mm1.orig/include/linux/mm_inline.h
+++ linux-2.6.24-rc4-mm1/include/linux/mm_inline.h
@@ -92,13 +92,36 @@ del_page_from_inactive_file_list(struct 
 	del_page_from_lru_list(zone, page, LRU_INACTIVE_FILE);
 }
 
+#ifdef CONFIG_NORECLAIM
+static inline void
+add_page_to_noreclaim_list(struct zone *zone, struct page *page)
+{
+	add_page_to_lru_list(zone, page, LRU_NORECLAIM);
+}
+
+static inline void
+del_page_from_noreclaim_list(struct zone *zone, struct page *page)
+{
+	del_page_from_lru_list(zone, page, LRU_NORECLAIM);
+}
+#else
+static inline void
+add_page_to_noreclaim_list(struct zone *zone, struct page *page) { }
+
+static inline void
+del_page_from_noreclaim_list(struct zone *zone, struct page *page) { }
+#endif
+
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
 	enum lru_list l = LRU_INACTIVE_ANON;
 
 	list_del(&page->lru);
-	if (PageActive(page)) {
+	if (PageNoreclaim(page)) {
+		__ClearPageNoreclaim(page);
+		l = LRU_NORECLAIM;
+	} else if (PageActive(page)) {
 		__ClearPageActive(page);
 		l = LRU_ACTIVE_ANON;
 	}
Index: linux-2.6.24-rc4-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.24-rc4-mm1.orig/include/linux/swap.h
+++ linux-2.6.24-rc4-mm1/include/linux/swap.h
@@ -175,6 +175,13 @@ extern void FASTCALL(lru_cache_add_file(
 extern void FASTCALL(lru_cache_add_anon(struct page *));
 extern void FASTCALL(lru_cache_add_active_file(struct page *));
 extern void FASTCALL(lru_cache_add_active_anon(struct page *));
+extern void FASTCALL(lru_cache_add_active_or_noreclaim(struct page *page,
+						struct vm_area_struct *vma));
+#ifdef CONFIG_NORECLAIM
+extern void FASTCALL(lru_cache_add_noreclaim(struct page *page));
+#else
+static inline void lru_cache_add_noreclaim(struct page *page) { }
+#endif
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
@@ -206,6 +213,16 @@ static inline int zone_reclaim(struct zo
 }
 #endif
 
+#ifdef CONFIG_NORECLAIM
+extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+#else
+static inline int page_reclaimable(struct page *page,
+						struct vm_area_struct *vma)
+{
+	return 1;
+}
+#endif
+
 extern int kswapd_run(int nid);
 
 #ifdef CONFIG_MMU
Index: linux-2.6.24-rc4-mm1/include/linux/pagevec.h
===================================================================
--- linux-2.6.24-rc4-mm1.orig/include/linux/pagevec.h
+++ linux-2.6.24-rc4-mm1/include/linux/pagevec.h
@@ -27,6 +27,11 @@ void __pagevec_lru_add_file(struct pagev
 void __pagevec_lru_add_active_file(struct pagevec *pvec);
 void __pagevec_lru_add_anon(struct pagevec *pvec);
 void __pagevec_lru_add_active_anon(struct pagevec *pvec);
+#ifdef CONFIG_NORECLAIM
+void __pagevec_lru_add_noreclaim(struct pagevec *pvec);
+#else
+static inline void __pagevec_lru_add_noreclaim(struct pagevec *pvec) { }
+#endif
 void pagevec_strip(struct pagevec *pvec);
 void pagevec_swap_free(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
Index: linux-2.6.24-rc4-mm1/mm/swap.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/swap.c
+++ linux-2.6.24-rc4-mm1/mm/swap.c
@@ -119,7 +119,8 @@ static void pagevec_move_tail(struct pag
 			zone = pagezone;
 			spin_lock(&zone->lru_lock);
 		}
-		if (PageLRU(page) && !PageActive(page)) {
+	 	if (PageLRU(page) && !PageActive(page) && \
+					!PageNoreclaim(page)) {
 			if (page_file_cache(page)) {
 				list_move_tail(&page->lru,
 						&zone->list[LRU_INACTIVE_FILE]);
@@ -153,7 +154,7 @@ int rotate_reclaimable_page(struct page 
 		return 1;
 	if (PageDirty(page))
 		return 1;
-	if (PageActive(page))
+	if (PageActive(page) || PageNoreclaim(page))
 		return 1;
 	if (!PageLRU(page))
 		return 1;
@@ -179,7 +180,7 @@ void fastcall activate_page(struct page 
 	struct zone *zone = page_zone(page);
 
 	spin_lock_irq(&zone->lru_lock);
-	if (PageLRU(page) && !PageActive(page)) {
+	if (PageLRU(page) && !PageActive(page) && !PageNoreclaim(page)) {
 		int lru = LRU_BASE;
 		lru += page_file_cache(page);
 		del_page_from_lru_list(zone, page, lru);
@@ -202,7 +203,8 @@ void fastcall activate_page(struct page 
  */
 void fastcall mark_page_accessed(struct page *page)
 {
-	if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+	if (!PageActive(page) && !PageNoreclaim(page) &&
+			PageReferenced(page) && PageLRU(page)) {
 		activate_page(page);
 		ClearPageReferenced(page);
 	} else if (!PageReferenced(page)) {
@@ -256,6 +258,50 @@ void fastcall lru_cache_add_active_file(
 	put_cpu_var(lru_add_active_file_pvecs);
 }
 
+#ifdef CONFIG_NORECLAIM
+static DEFINE_PER_CPU(struct pagevec, lru_add_noreclaim_pvecs) = { 0, };
+
+void fastcall lru_cache_add_noreclaim(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_noreclaim_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_noreclaim(pvec);
+	put_cpu_var(lru_add_noreclaim_pvecs);
+}
+
+void fastcall lru_cache_add_active_or_noreclaim(struct page *page,
+					struct vm_area_struct *vma)
+{
+	if (page_reclaimable(page, vma)) {
+		if (page_file_cache(page))
+			lru_cache_add_active_file(page);
+		else
+			lru_cache_add_active_anon(page);
+	} else
+		lru_cache_add_noreclaim(page);
+}
+
+static inline void __drain_noreclaim_pvec(struct pagevec **pvec, int cpu)
+{
+	*pvec = &per_cpu(lru_add_noreclaim_pvecs, cpu);
+	if (pagevec_count(*pvec))
+		__pagevec_lru_add_noreclaim(*pvec);
+}
+#else
+void fastcall lru_cache_add_active_or_noreclaim(struct page *page,
+					struct vm_area_struct *vma)
+{
+	if (page_file_cache(page))
+		lru_cache_add_active_file(page);
+	else
+		lru_cache_add_active_anon(page);
+}
+
+static inline void __drain_noreclaim_pvec(struct pagevec **pvec, int cpu) { }
+#endif
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
@@ -290,6 +336,8 @@ static void drain_cpu_pagevecs(int cpu)
 		pagevec_move_tail(pvec);
 		local_irq_restore(flags);
 	}
+
+	__drain_noreclaim_pvec(&pvec, cpu);
 }
 
 void lru_add_drain(void)
@@ -361,6 +409,8 @@ void release_pages(struct page **pages, 
 
 		if (PageLRU(page)) {
 			struct zone *pagezone = page_zone(page);
+			int is_lru_page;
+
 			if (pagezone != zone) {
 				if (zone)
 					spin_unlock_irqrestore(&zone->lru_lock,
@@ -368,8 +418,10 @@ void release_pages(struct page **pages, 
 				zone = pagezone;
 				spin_lock_irqsave(&zone->lru_lock, flags);
 			}
-			VM_BUG_ON(!PageLRU(page));
-			__ClearPageLRU(page);
+			is_lru_page = PageLRU(page);
+			VM_BUG_ON(!(is_lru_page));
+			if (is_lru_page)
+				__ClearPageLRU(page);
 			del_page_from_lru(zone, page);
 		}
 
@@ -448,6 +500,7 @@ void __pagevec_lru_add_file(struct pagev
 			zone = pagezone;
 			spin_lock_irq(&zone->lru_lock);
 		}
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		add_page_to_inactive_file_list(zone, page);
@@ -476,7 +529,7 @@ void __pagevec_lru_add_active_file(struc
 		}
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
-		VM_BUG_ON(PageActive(page));
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
 		SetPageActive(page);
 		add_page_to_active_file_list(zone, page);
 	}
@@ -538,6 +591,35 @@ void __pagevec_lru_add_active_anon(struc
 	pagevec_reinit(pvec);
 }
 
+#ifdef CONFIG_NORECLAIM
+void __pagevec_lru_add_noreclaim(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		VM_BUG_ON(PageActive(page) || PageNoreclaim(page));
+		SetPageNoreclaim(page);
+		add_page_to_noreclaim_list(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+#endif
+
 /*
  * Try to drop buffers from the pages in a pagevec
  */
Index: linux-2.6.24-rc4-mm1/mm/migrate.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/migrate.c
+++ linux-2.6.24-rc4-mm1/mm/migrate.c
@@ -52,9 +52,18 @@ int migrate_prep(void)
 	return 0;
 }
 
+/*
+ * move_to_lru() - place @page onto appropriate lru list
+ * based on preserved page flags:  active, noreclaim, none
+ */
 static inline void move_to_lru(struct page *page)
 {
-	if (PageActive(page)) {
+	if (PageNoreclaim(page)) {
+		VM_BUG_ON(PageActive(page));
+		ClearPageNoreclaim(page);
+		lru_cache_add_noreclaim(page);
+	} else if (PageActive(page)) {
+		VM_BUG_ON(PageNoreclaim(page));	/* race ? */
 		/*
 		 * lru_cache_add_active checks that
 		 * the PG_active bit is off.
@@ -65,6 +74,7 @@ static inline void move_to_lru(struct pa
 		else
 			lru_cache_add_active_anon(page);
 	} else {
+		VM_BUG_ON(PageNoreclaim(page));	/* race ? */
 		if (page_file_cache(page))
 			lru_cache_add_file(page);
 		else
@@ -341,8 +351,11 @@ static void migrate_page_copy(struct pag
 		SetPageReferenced(newpage);
 	if (PageUptodate(page))
 		SetPageUptodate(newpage);
-	if (PageActive(page))
+	if (TestClearPageActive(page)) {
+		VM_BUG_ON(PageNoreclaim(page));
 		SetPageActive(newpage);
+	} else if (TestClearPageNoreclaim(page))
+		SetPageNoreclaim(newpage);
 	if (PageChecked(page))
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
@@ -356,7 +369,6 @@ static void migrate_page_copy(struct pag
 #ifdef CONFIG_SWAP
 	ClearPageSwapCache(page);
 #endif
-	ClearPageActive(page);
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
 	page->mapping = NULL;
Index: linux-2.6.24-rc4-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/vmscan.c
+++ linux-2.6.24-rc4-mm1/mm/vmscan.c
@@ -480,6 +480,11 @@ static unsigned long shrink_page_list(st
 
 		sc->nr_scanned++;
 
+		if (!page_reclaimable(page, NULL)) {
+			SetPageNoreclaim(page);
+			goto keep_locked;
+		}
+
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 
@@ -582,7 +587,7 @@ static unsigned long shrink_page_list(st
 		 * possible for a page to have PageDirty set, but it is actually
 		 * clean (all its buffers are clean).  This happens if the
 		 * buffers were written out directly, with submit_bh(). ext3
-		 * will do this, as well as the blockdev mapping. 
+		 * will do this, as well as the blockdev mapping.
 		 * try_to_release_page() will discover that cleanness and will
 		 * drop the buffers and mark the page clean - it can be freed.
 		 *
@@ -614,6 +619,7 @@ activate_locked:
 		/* Not a candidate for swapping, so reclaim swap space. */
 		if (PageSwapCache(page) && vm_swap_full())
 			remove_exclusive_swap_page(page);
+		VM_BUG_ON(PageActive(page));
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
@@ -664,6 +670,14 @@ int __isolate_lru_page(struct page *page
 	if (mode != ISOLATE_BOTH && (!page_file_cache(page) != !file))
 		return ret;
 
+	/*
+	 * Non-reclaimable pages shouldn't make it onto either the active
+	 * nor the inactive list. However, when doing lumpy reclaim of
+	 * higher order pages we can still run into them.
+	 */
+	if (PageNoreclaim(page))
+		return ret;
+
 	ret = -EBUSY;
 	if (likely(get_page_unless_zero(page))) {
 		/*
@@ -775,7 +789,7 @@ static unsigned long isolate_lru_pages(u
 				/* else it is being freed elsewhere */
 				list_move(&cursor_page->lru, src);
 			default:
-				break;
+				break;	/* ! on LRU or wrong list */
 			}
 		}
 	}
@@ -831,9 +845,10 @@ static unsigned long clear_active_flags(
  * refcount on the page, which is a fundamentnal difference from
  * isolate_lru_pages (which is called without a stable reference).
  *
- * The returned page will have PageLru() cleared, and PageActive set,
- * if it was found on the active list. This flag generally will need to be
- * cleared by the caller before letting the page go.
+ * The returned page will have the PageLru() cleared, and the PageActive or
+ * PageNoreclaim will be set, if it was found on the active or noreclaim list,
+ * respectively. This flag generally will need to be cleared by the caller
+ * before letting the page go.
  *
  * The vmstat page counts corresponding to the list on which the page was
  * found will be decremented.
@@ -858,6 +873,11 @@ int isolate_lru_page(struct page *page)
 			ClearPageLRU(page);
 
 			lru += page_file_cache(page) + !!PageActive(page);
+
+			/* NoReclaim pages have their own list. */
+			if (PageNoreclaim(page))
+				lru = LRU_NORECLAIM;
+
 			del_page_from_lru_list(zone, page, lru);
 		}
 		spin_unlock_irq(&zone->lru_lock);
@@ -956,14 +976,19 @@ static unsigned long shrink_inactive_lis
 			VM_BUG_ON(PageLRU(page));
 			SetPageLRU(page);
 			list_del(&page->lru);
-			if (page_file_cache(page)) {
-				lru += LRU_FILE;
-				zone->recent_rotated_file++;
+			if (PageNoreclaim(page)) {
+				VM_BUG_ON(PageActive(page));
+				lru = LRU_NORECLAIM;
 			} else {
-				zone->recent_rotated_anon++;
+				if (page_file_cache(page)) {
+					lru += LRU_FILE;
+					zone->recent_rotated_file++;
+				} else {
+					zone->recent_rotated_anon++;
+				}
+				if (PageActive(page))
+					lru += LRU_ACTIVE;
 			}
-			if (PageActive(page))
-				lru += LRU_ACTIVE;
 			add_page_to_lru_list(zone, page, lru);
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
@@ -1057,6 +1082,13 @@ static void shrink_active_list(unsigned 
 		cond_resched();
 		page = lru_to_page(&l_hold);
 		list_del(&page->lru);
+
+		if (!page_reclaimable(page, NULL)) {
+			/* Non-reclaimable pages go onto their own list. */
+			list_add(&page->lru, &list[LRU_NORECLAIM]);
+			continue;
+		}
+
 		if (page_referenced(page, 0, sc->mem_cgroup)) {
 			if (file)
 				/* Referenced file pages stay active. */
@@ -1143,6 +1175,33 @@ static void shrink_active_list(unsigned 
 		zone->recent_rotated_anon += pgmoved;
 	}
 
+#ifdef CONFIG_NORECLAIM
+	pgmoved = 0;
+	while (!list_empty(&list[LRU_NORECLAIM])) {
+		page = lru_to_page(&list[LRU_NORECLAIM]);
+		prefetchw_prev_lru_page(page, &list[LRU_NORECLAIM], flags);
+
+		VM_BUG_ON(PageLRU(page));
+		SetPageLRU(page);
+		VM_BUG_ON(!PageActive(page));
+		ClearPageActive(page);
+		VM_BUG_ON(PageNoreclaim(page));
+		SetPageNoreclaim(page);
+
+		list_move(&page->lru, &zone->list[LRU_NORECLAIM]);
+		pgmoved++;
+		if (!pagevec_add(&pvec, page)) {
+			__mod_zone_page_state(zone, NR_NORECLAIM, pgmoved);
+//TODO:  count these as deactivations?
+			pgmoved = 0;
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_NORECLAIM, pgmoved);
+#endif
+
 	__count_zone_vm_events(PGREFILL, zone, pgscanned);
 	__count_vm_events(PGDEACTIVATE, pgdeactivate);
 	spin_unlock_irq(&zone->lru_lock);
@@ -1251,7 +1310,7 @@ static unsigned long shrink_zone(int pri
 		 * Add one to nr_to_scan just to make sure that the kernel
 		 * will slowly sift through the active list.
 		 */
-		for_each_lru(l) {
+		for_each_reclaimable_lru(l) {
 			int file = is_file_lru(l);
 			zone->nr_scan[l] += (zone_page_state(zone,
 				NR_INACTIVE_ANON + l) >> priority) + 1;
@@ -1283,7 +1342,7 @@ static unsigned long shrink_zone(int pri
 
 	while (nr[LRU_ACTIVE_ANON] || nr[LRU_INACTIVE_ANON] ||
 			nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) {
-		for_each_lru(l) {
+		for_each_reclaimable_lru(l) {
 			if (nr[l]) {
 				nr_to_scan = min(nr[l],
 					(unsigned long)sc->swap_cluster_max);
@@ -1822,8 +1881,8 @@ static unsigned long shrink_all_zones(un
 		if (zone_is_all_unreclaimable(zone) && prio != DEF_PRIORITY)
 			continue;
 
-		for_each_lru(l) {
-			/* For pass = 0 we don't shrink the active list */
+		for_each_reclaimable_lru(l) {
+			/* For pass = 0, we don't shrink the active list */
 			if (pass == 0 &&
 				(l == LRU_ACTIVE_ANON || l == LRU_ACTIVE_FILE))
 				continue;
@@ -2169,3 +2228,29 @@ int zone_reclaim(struct zone *zone, gfp_
 	return ret;
 }
 #endif
+
+#ifdef CONFIG_NORECLAIM
+/*
+ * page_reclaimable(struct page *page, struct vm_area_struct *vma)
+ * Test whether page is reclaimable--i.e., should be placed on active/inactive
+ * lists vs noreclaim list.
+ *
+ * @page       - page to test
+ * @vma        - vm area in which page is/will be mapped.  May be NULL.
+ *               If !NULL, called from fault path.
+ *
+ * Reasons page might not be reclaimable:
+ * TODO - later patches
+ *
+ * TODO:  specify locking assumptions
+ */
+int page_reclaimable(struct page *page, struct vm_area_struct *vma)
+{
+
+	VM_BUG_ON(PageNoreclaim(page));
+
+	/* TODO:  test page [!]reclaimable conditions */
+
+	return 1;
+}
+#endif
Index: linux-2.6.24-rc4-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/mempolicy.c
+++ linux-2.6.24-rc4-mm1/mm/mempolicy.c
@@ -1912,7 +1912,7 @@ static void gather_stats(struct page *pa
 	if (PageSwapCache(page))
 		md->swapcache++;
 
-	if (PageActive(page))
+	if (PageActive(page) || PageNoreclaim(page))
 		md->active++;
 
 	if (PageWriteback(page))
Index: linux-2.6.24-rc4-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/memcontrol.c
+++ linux-2.6.24-rc4-mm1/mm/memcontrol.c
@@ -521,6 +521,10 @@ unsigned long mem_cgroup_isolate_pages(u
 		scan++;
 		list_move(&pc->lru, &pc_list);
 
+//TODO:  for now, don't isolate non-reclaimable pages.  When/if
+// mem controller supports a noreclaim list, we'll need to make
+// at least ISOLATE_ACTIVE visible outside of vm_scan and pass
+// the 'take_nonreclaimable' flag accordingly.
 		if (__isolate_lru_page(page, mode, file) == 0) {
 			list_move(&page->lru, dst);
 			nr_taken++;

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 13/20] Non-reclaimable page statistics
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (11 preceding siblings ...)
  2007-12-18 21:15 ` [patch 12/20] No Reclaim LRU Infrastructure Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 14/20] Scan noreclaim list for reclaimable pages Rik van Riel
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: noreclaim-01.2-report-nonreclaimable-memory.patch --]
[-- Type: text/plain, Size: 4706 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series

V1 -> V2:
	no changes

Report non-reclaimable pages per zone and system wide.

Note:  may want to track/report some specific reasons for 
nonreclaimability for deciding when to splice the noreclaim
lists back to the normal lru.  That will be tricky,
especially in shrink_active_list(), where we'd need someplace
to save the per page reason for non-reclaimability until the
pages are dumped back onto the noreclaim list from the pagevec.

Note:  my tests indicate that NR_NORECLAIM and probably the
other LRU stats aren't being maintained properly--especially
with large amounts of mlocked memory and the mlock patch in
this series installed.  Can't be sure of this, as I don't 
know why the pages are on the noreclaim list. Needs further
investigation.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.24-rc4-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/page_alloc.c
+++ linux-2.6.24-rc4-mm1/mm/page_alloc.c
@@ -1882,12 +1882,20 @@ void show_free_areas(void)
 	}
 
 	printk("Active_anon:%lu active_file:%lu inactive_anon%lu\n"
-		" inactive_file:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+		" inactive_file:%lu"
+//TODO:  check/adjust line lengths
+#ifdef CONFIG_NORECLAIM
+		" noreclaim:%lu"
+#endif
+		" dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",
 		global_page_state(NR_ACTIVE_ANON),
 		global_page_state(NR_ACTIVE_FILE),
 		global_page_state(NR_INACTIVE_ANON),
 		global_page_state(NR_INACTIVE_FILE),
+#ifdef CONFIG_NORECLAIM
+		global_page_state(NR_NORECLAIM),
+#endif
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -1914,6 +1922,9 @@ void show_free_areas(void)
 			" inactive_anon:%lukB"
 			" active_file:%lukB"
 			" inactive_file:%lukB"
+#ifdef CONFIG_NORECLAIM
+			" noreclaim:%lukB"
+#endif
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1927,6 +1938,9 @@ void show_free_areas(void)
 			K(zone_page_state(zone, NR_INACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ACTIVE_FILE)),
 			K(zone_page_state(zone, NR_INACTIVE_FILE)),
+#ifdef CONFIG_NORECLAIM
+			K(zone_page_state(zone, NR_NORECLAIM)),
+#endif
 			K(zone->present_pages),
 			zone->pages_scanned,
 			(zone_is_all_unreclaimable(zone) ? "yes" : "no")
Index: linux-2.6.24-rc4-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/vmstat.c
+++ linux-2.6.24-rc4-mm1/mm/vmstat.c
@@ -690,6 +690,9 @@ static const char * const vmstat_text[] 
 	"nr_active_anon",
 	"nr_inactive_file",
 	"nr_active_file",
+#ifdef CONFIG_NORECLAIM
+	"nr_noreclaim",
+#endif
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
Index: linux-2.6.24-rc4-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/drivers/base/node.c
+++ linux-2.6.24-rc4-mm1/drivers/base/node.c
@@ -52,6 +52,9 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d Inactive(anon): %8lu kB\n"
 		       "Node %d Active(file):   %8lu kB\n"
 		       "Node %d Inactive(file): %8lu kB\n"
+#ifdef CONFIG_NORECLAIM
+		       "Node %d Noreclaim:    %8lu kB\n"
+#endif
 #ifdef CONFIG_HIGHMEM
 		       "Node %d HighTotal:      %8lu kB\n"
 		       "Node %d HighFree:       %8lu kB\n"
@@ -76,6 +79,9 @@ static ssize_t node_read_meminfo(struct 
 		       nid, node_page_state(nid, NR_INACTIVE_ANON),
 		       nid, node_page_state(nid, NR_ACTIVE_FILE),
 		       nid, node_page_state(nid, NR_INACTIVE_FILE),
+#ifdef CONFIG_NORECLAIM
+		       nid, node_page_state(nid, NR_NORECLAIM),
+#endif
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
 		       nid, K(i.freehigh),
Index: linux-2.6.24-rc4-mm1/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/fs/proc/proc_misc.c
+++ linux-2.6.24-rc4-mm1/fs/proc/proc_misc.c
@@ -162,6 +162,9 @@ static int meminfo_read_proc(char *page,
 		"Inactive(anon): %8lu kB\n"
 		"Active(file):   %8lu kB\n"
 		"Inactive(file): %8lu kB\n"
+#ifdef CONFIG_NORECLAIM
+		"Noreclaim:    %8lu kB\n"
+#endif
 #ifdef CONFIG_HIGHMEM
 		"HighTotal:      %8lu kB\n"
 		"HighFree:       %8lu kB\n"
@@ -194,6 +197,9 @@ static int meminfo_read_proc(char *page,
 		K(global_page_state(NR_INACTIVE_ANON)),
 		K(global_page_state(NR_ACTIVE_FILE)),
 		K(global_page_state(NR_INACTIVE_FILE)),
+#ifdef CONFIG_NORECLAIM
+		K(global_page_state(NR_NORECLAIM)),
+#endif
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
 		K(i.freehigh),

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 14/20] Scan noreclaim list for reclaimable pages
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (12 preceding siblings ...)
  2007-12-18 21:15 ` [patch 13/20] Non-reclaimable page statistics Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 15/20] ramfs pages are non-reclaimable Rik van Riel
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: noreclaim-01.3-scan-noreclaim-list-for-reclaimable-pages.patch --]
[-- Type: text/plain, Size: 8528 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series

New in V2

This patch adds a function to scan individual or all zones' noreclaim
lists and move any pages that have become reclaimable onto the respective
zone's inactive list, where shrink_inactive_list() will deal with them.

This replaces the function to splice the entire noreclaim list onto the
active list for rescan by shrink_active_list().  That method had problems
with vmstat accounting and complicated '[__]isolate_lru_pages()'.  Now,
__isolate_lru_page() will never isolate a non-reclaimable page.  The
only time it should see one is when scanning nearby pages for lumpy
reclaim.

  TODO:  This approach may still need some refinement.
         E.g., put back to active list?

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>


Index: linux-2.6.24-rc4-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.24-rc4-mm1.orig/include/linux/swap.h
+++ linux-2.6.24-rc4-mm1/include/linux/swap.h
@@ -7,6 +7,7 @@
 #include <linux/list.h>
 #include <linux/sched.h>
 #include <linux/memcontrol.h>
+#include <linux/node.h>
 
 #include <asm/atomic.h>
 #include <asm/page.h>
@@ -215,12 +216,26 @@ static inline int zone_reclaim(struct zo
 
 #ifdef CONFIG_NORECLAIM
 extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
+extern void scan_zone_noreclaim_pages(struct zone *);
+extern void scan_all_zones_noreclaim_pages(void);
+extern unsigned long scan_noreclaim_pages;
+extern int scan_noreclaim_handler(struct ctl_table *, int, struct file *,
+					void __user *, size_t *, loff_t *);
+extern int scan_noreclaim_register_node(struct node *node);
+extern void scan_noreclaim_unregister_node(struct node *node);
 #else
 static inline int page_reclaimable(struct page *page,
 						struct vm_area_struct *vma)
 {
 	return 1;
 }
+static inline void scan_zone_noreclaim_pages(struct zone *z) { }
+static inline void scan_all_zones_noreclaim_pages(void) { }
+static inline int scan_noreclaim_register_node(struct node *node)
+{
+	return 0;
+}
+static inline void scan_noreclaim_unregister_node(struct node *node) { }
 #endif
 
 extern int kswapd_run(int nid);
Index: linux-2.6.24-rc4-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/vmscan.c
+++ linux-2.6.24-rc4-mm1/mm/vmscan.c
@@ -39,6 +39,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/memcontrol.h>
+#include <linux/sysctl.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -2253,4 +2254,144 @@ int page_reclaimable(struct page *page, 
 
 	return 1;
 }
+
+/**
+ * scan_zone_noreclaim_pages(@zone)
+ * @zone - zone to scan
+ *
+ * Scan @zone's noreclaim LRU lists to check for pages that have become
+ * reclaimable.  Move those that have to @zone's inactive list where they
+ * become candidates for reclaim, unless shrink_inactive_zone() decides
+ * to reactivate them.  Pages that are still non-reclaimable are rotated
+ * back onto @zone's noreclaim list.
+ */
+#define SCAN_NORECLAIM_BATCH_SIZE 16UL	/* arbitrary lock hold batch size */
+void scan_zone_noreclaim_pages(struct zone *zone)
+{
+	struct list_head *l_noreclaim = &zone->list[LRU_NORECLAIM];
+	struct list_head *l_inactive_anon  = &zone->list[LRU_INACTIVE_ANON];
+	struct list_head *l_inactive_file  = &zone->list[LRU_INACTIVE_FILE];
+	unsigned long scan;
+	unsigned long nr_to_scan = zone_page_state(zone, NR_NORECLAIM);
+
+	while (nr_to_scan > 0) {
+		unsigned long batch_size = min(nr_to_scan,
+						SCAN_NORECLAIM_BATCH_SIZE);
+
+		spin_lock_irq(&zone->lru_lock);
+		for (scan = 0;  scan < batch_size; scan++) {
+			struct page* page = lru_to_page(l_noreclaim);
+
+			if (unlikely(!PageLRU(page) || !PageNoreclaim(page)))
+				continue;
+
+			prefetchw_prev_lru_page(page, l_noreclaim, flags);
+
+			ClearPageNoreclaim(page); /* for page_reclaimable() */
+			if(page_reclaimable(page, NULL)) {
+				__dec_zone_state(zone, NR_NORECLAIM);
+				if (page_file_cache(page)) {
+					list_move(&page->lru, l_inactive_file);
+					__inc_zone_state(zone, NR_INACTIVE_FILE);
+				} else {
+					list_move(&page->lru, l_inactive_anon);
+					__inc_zone_state(zone, NR_INACTIVE_ANON);
+				}
+			} else {
+				SetPageNoreclaim(page);
+				list_move(&page->lru, l_noreclaim);
+			}
+
+		}
+		spin_unlock_irq(&zone->lru_lock);
+
+		nr_to_scan -= batch_size;
+	}
+}
+
+
+/**
+ * scan_all_zones_noreclaim_pages()
+ *
+ * A really big hammer:  scan all zones' noreclaim LRU lists to check for
+ * pages that have become reclaimable.  Move those back to the zones'
+ * inactive list where they become candidates for reclaim.
+ * This occurs when, e.g., we have unswappable pages on the noreclaim lists,
+ * and we add swap to the system.  As such, it runs in the context of a task
+ * that has possibly/probably made some previously non-reclaimable pages
+ * reclaimable.
+//TODO:  or as a last resort under extreme memory pressure--before OOM?
+ */
+void scan_all_zones_noreclaim_pages(void)
+{
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		scan_zone_noreclaim_pages(zone);
+	}
+}
+
+/*
+ * scan_noreclaim_pages [vm] sysctl handler.  On demand re-scan of
+ * all nodes' noreclaim lists for reclaimable pages
+ */
+unsigned long scan_noreclaim_pages;
+
+int scan_noreclaim_handler( struct ctl_table *table, int write,
+			   struct file *file, void __user *buffer,
+			   size_t *length, loff_t *ppos)
+{
+	proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
+
+	if (write && *(unsigned long *)table->data)
+		scan_all_zones_noreclaim_pages();
+
+	scan_noreclaim_pages = 0;
+	return 0;
+}
+
+/*
+ * per node 'scan_noreclaim_pages' attribute.  On demand re-scan of
+ * a specified node's per zone noreclaim lists for reclaimable pages.
+ */
+
+static ssize_t read_scan_noreclaim_node(struct sys_device *dev, char *buf)
+{
+	return sprintf(buf, "0\n");	/* always zero; should fit... */
+}
+
+static ssize_t write_scan_noreclaim_node(struct sys_device *dev,
+                                       const char *buf, size_t count)
+{
+	struct zone *node_zones = NODE_DATA(dev->id)->node_zones;
+	struct zone *zone;
+	unsigned long req = simple_strtoul(buf, NULL, 10);
+
+	if (!req)
+		return 1;	/* zero is no-op */
+
+	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+		if (!populated_zone(zone))
+			continue;
+		scan_zone_noreclaim_pages(zone);
+	}
+	return 1;
+}
+
+
+static SYSDEV_ATTR(scan_noreclaim_pages, S_IRUGO | S_IWUSR,
+			read_scan_noreclaim_node,
+			write_scan_noreclaim_node);
+
+int scan_noreclaim_register_node(struct node *node)
+{
+	return sysdev_create_file(&node->sysdev, &attr_scan_noreclaim_pages);
+}
+
+void scan_noreclaim_unregister_node(struct node *node)
+{
+	sysdev_remove_file(&node->sysdev, &attr_scan_noreclaim_pages);
+}
+
+
 #endif
Index: linux-2.6.24-rc4-mm1/kernel/sysctl.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/kernel/sysctl.c
+++ linux-2.6.24-rc4-mm1/kernel/sysctl.c
@@ -1150,6 +1150,16 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_NORECLAIM
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "scan_noreclaim_pages",
+		.data		= &scan_noreclaim_pages,
+		.maxlen		= sizeof(scan_noreclaim_pages),
+		.mode		= 0644,
+		.proc_handler	= &scan_noreclaim_handler,
+	},
+#endif
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
Index: linux-2.6.24-rc4-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/drivers/base/node.c
+++ linux-2.6.24-rc4-mm1/drivers/base/node.c
@@ -13,6 +13,7 @@
 #include <linux/nodemask.h>
 #include <linux/cpu.h>
 #include <linux/device.h>
+#include <linux/swap.h>
 
 static struct sysdev_class node_class = {
 	set_kset_name("node"),
@@ -162,6 +163,8 @@ int register_node(struct node *node, int
 		sysdev_create_file(&node->sysdev, &attr_meminfo);
 		sysdev_create_file(&node->sysdev, &attr_numastat);
 		sysdev_create_file(&node->sysdev, &attr_distance);
+
+		scan_noreclaim_register_node(node);
 	}
 	return error;
 }
@@ -180,6 +183,8 @@ void unregister_node(struct node *node)
 	sysdev_remove_file(&node->sysdev, &attr_numastat);
 	sysdev_remove_file(&node->sysdev, &attr_distance);
 
+	scan_noreclaim_unregister_node(node);
+
 	sysdev_unregister(&node->sysdev);
 }
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 15/20] ramfs pages are non-reclaimable
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (13 preceding siblings ...)
  2007-12-18 21:15 ` [patch 14/20] Scan noreclaim list for reclaimable pages Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 16/20] SHM_LOCKED pages are nonreclaimable Rik van Riel
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: noreclaim-02-ramdisk-and-ramfs-pages-are-nonreclaimable.patch --]
[-- Type: text/plain, Size: 3550 bytes --]

V2 -> V3:
+  rebase to 23-mm1 atop RvR's split LRU series [no changes]

V1 -> V2:
+  add ramfs pages to this class of non-reclaimable pages by
   marking ramfs address_space [mapping] as non-reclaimble.

Christoph Lameter pointed out that ram disk pages also clutter the
LRU lists.  When vmscan finds them dirty and tries to clean them,
the ram disk writeback function just redirties the page so that it
goes back onto the active list.  Round and round she goes...

Define new address_space flag [shares address_space flags member
with mapping's gfp mask] to indicate that the address space contains
all non-reclaimable pages.  This will provide for efficient testing
of ramdisk pages in page_reclaimable().

Also provide wrapper functions to set/test the noreclaim state to
minimize #ifdefs in ramdisk driver and any other users of this
facility.

Set the noreclaim state on address_space structures for new
ramdisk inodes.  Test the noreclaim state in page_reclaimable()
to cull non-reclaimable pages.

Similarly, ramfs pages are non-reclaimable.  Set the 'noreclaim'
address_space flag for new ramfs inodes.

These changes depend on [CONFIG_]NORECLAIM.


Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc4-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.24-rc4-mm1.orig/include/linux/pagemap.h
+++ linux-2.6.24-rc4-mm1/include/linux/pagemap.h
@@ -30,6 +30,28 @@ static inline void mapping_set_error(str
 	}
 }
 
+#ifdef CONFIG_NORECLAIM
+#define AS_NORECLAIM	(__GFP_BITS_SHIFT + 2)	/* e.g., ramdisk, SHM_LOCK */
+
+static inline void mapping_set_noreclaim(struct address_space *mapping)
+{
+	set_bit(AS_NORECLAIM, &mapping->flags);
+}
+
+static inline int mapping_non_reclaimable(struct address_space *mapping)
+{
+	if (mapping && (mapping->flags & AS_NORECLAIM))
+		return 1;
+	return 0;
+}
+#else
+static inline void mapping_set_noreclaim(struct address_space *mapping) { }
+static inline int mapping_non_reclaimable(struct address_space *mapping)
+{
+	return 0;
+}
+#endif
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
Index: linux-2.6.24-rc4-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/vmscan.c
+++ linux-2.6.24-rc4-mm1/mm/vmscan.c
@@ -2241,6 +2241,7 @@ int zone_reclaim(struct zone *zone, gfp_
  *               If !NULL, called from fault path.
  *
  * Reasons page might not be reclaimable:
+ * + page's mapping marked non-reclaimable
  * TODO - later patches
  *
  * TODO:  specify locking assumptions
@@ -2250,6 +2251,9 @@ int page_reclaimable(struct page *page, 
 
 	VM_BUG_ON(PageNoreclaim(page));
 
+	if (mapping_non_reclaimable(page_mapping(page)))
+		return 0;
+
 	/* TODO:  test page [!]reclaimable conditions */
 
 	return 1;
Index: linux-2.6.24-rc4-mm1/fs/ramfs/inode.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/fs/ramfs/inode.c
+++ linux-2.6.24-rc4-mm1/fs/ramfs/inode.c
@@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
 		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		mapping_set_noreclaim(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
 		default:

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 16/20] SHM_LOCKED pages are nonreclaimable
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (14 preceding siblings ...)
  2007-12-18 21:15 ` [patch 15/20] ramfs pages are non-reclaimable Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 17/20] non-reclaimable mlocked pages Rik van Riel
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: noreclaim-03-SHM_LOCKed-pages-are-nonreclaimable.patch --]
[-- Type: text/plain, Size: 7679 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series.
+ Use scan_mapping_noreclaim_page() on unlock.  See below.

V1 -> V2:
+  modify to use reworked 'scan_all_zones_noreclaim_pages()'
   See 'TODO' below - still pending.

While working with Nick Piggin's mlock patches, I noticed that
shmem segments locked via shmctl(SHM_LOCKED) were not being handled.
SHM_LOCKed pages work like ramdisk pages--the writeback function
just redirties the page so that it can't be reclaimed.  Deal with
these using the same approach as for ram disk pages.

Use the AS_NORECLAIM flag to mark address_space of SHM_LOCKed
shared memory regions as non-reclaimable.  Then these pages
will be culled off the normal LRU lists during vmscan.

Add new wrapper function to clear the mapping's noreclaim state
when/if shared memory segment is munlocked.

Add 'scan_mapping_noreclaim_page()' to mm/vmscan.c to scan all
pages in the shmem segment's mapping [struct address_space] for
reclaimability now that they're no longer locked.  If so, move
them to the appropriate zone lru list.

Changes depend on [CONFIG_]NORECLAIM.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: linux-2.6.24-rc4-mm1/mm/shmem.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/shmem.c
+++ linux-2.6.24-rc4-mm1/mm/shmem.c
@@ -1366,10 +1366,13 @@ int shmem_lock(struct file *file, int lo
 		if (!user_shm_lock(inode->i_size, user))
 			goto out_nomem;
 		info->flags |= VM_LOCKED;
+		mapping_set_noreclaim(file->f_mapping);
 	}
 	if (!lock && (info->flags & VM_LOCKED) && user) {
 		user_shm_unlock(inode->i_size, user);
 		info->flags &= ~VM_LOCKED;
+		mapping_clear_noreclaim(file->f_mapping);
+		scan_mapping_noreclaim_pages(file->f_mapping);
 	}
 	retval = 0;
 out_nomem:
Index: linux-2.6.24-rc4-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.24-rc4-mm1.orig/include/linux/pagemap.h
+++ linux-2.6.24-rc4-mm1/include/linux/pagemap.h
@@ -38,14 +38,20 @@ static inline void mapping_set_noreclaim
 	set_bit(AS_NORECLAIM, &mapping->flags);
 }
 
+static inline void mapping_clear_noreclaim(struct address_space *mapping)
+{
+	clear_bit(AS_NORECLAIM, &mapping->flags);
+}
+
 static inline int mapping_non_reclaimable(struct address_space *mapping)
 {
-	if (mapping && (mapping->flags & AS_NORECLAIM))
-		return 1;
+	if (mapping)
+		return test_bit(AS_NORECLAIM, &mapping->flags);
 	return 0;
 }
 #else
 static inline void mapping_set_noreclaim(struct address_space *mapping) { }
+static inline void mapping_clear_noreclaim(struct address_space *mapping) { }
 static inline int mapping_non_reclaimable(struct address_space *mapping)
 {
 	return 0;
Index: linux-2.6.24-rc4-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/vmscan.c
+++ linux-2.6.24-rc4-mm1/mm/vmscan.c
@@ -2259,6 +2259,29 @@ int page_reclaimable(struct page *page, 
 	return 1;
 }
 
+/*
+ * check_move_noreclaim_page() -- check page for reclaimability and move
+ * to appropriate zone lru list.
+ * zone->lru_lock held on entry/exit.
+ */
+static void check_move_noreclaim_page(struct page *page, struct zone* zone)
+{
+
+	ClearPageNoreclaim(page); /* for page_reclaimable() */
+	if(page_reclaimable(page, NULL)) {
+		enum lru_list l = LRU_INACTIVE_ANON + page_file_cache(page);
+		__dec_zone_state(zone, NR_NORECLAIM);
+		list_move(&page->lru, &zone->list[l]);
+		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
+	} else {
+		/*
+		 * rotate noreclaim list
+		 */
+		SetPageNoreclaim(page);
+		list_move(&page->lru, &zone->list[LRU_NORECLAIM]);
+	}
+}
+
 /**
  * scan_zone_noreclaim_pages(@zone)
  * @zone - zone to scan
@@ -2273,8 +2296,6 @@ int page_reclaimable(struct page *page, 
 void scan_zone_noreclaim_pages(struct zone *zone)
 {
 	struct list_head *l_noreclaim = &zone->list[LRU_NORECLAIM];
-	struct list_head *l_inactive_anon  = &zone->list[LRU_INACTIVE_ANON];
-	struct list_head *l_inactive_file  = &zone->list[LRU_INACTIVE_FILE];
 	unsigned long scan;
 	unsigned long nr_to_scan = zone_page_state(zone, NR_NORECLAIM);
 
@@ -2286,26 +2307,15 @@ void scan_zone_noreclaim_pages(struct zo
 		for (scan = 0;  scan < batch_size; scan++) {
 			struct page* page = lru_to_page(l_noreclaim);
 
-			if (unlikely(!PageLRU(page) || !PageNoreclaim(page)))
+			if (TestSetPageLocked(page))
 				continue;
 
 			prefetchw_prev_lru_page(page, l_noreclaim, flags);
 
-			ClearPageNoreclaim(page); /* for page_reclaimable() */
-			if(page_reclaimable(page, NULL)) {
-				__dec_zone_state(zone, NR_NORECLAIM);
-				if (page_file_cache(page)) {
-					list_move(&page->lru, l_inactive_file);
-					__inc_zone_state(zone, NR_INACTIVE_FILE);
-				} else {
-					list_move(&page->lru, l_inactive_anon);
-					__inc_zone_state(zone, NR_INACTIVE_ANON);
-				}
-			} else {
-				SetPageNoreclaim(page);
-				list_move(&page->lru, l_noreclaim);
-			}
+			if (likely(PageLRU(page) && PageNoreclaim(page)))
+				check_move_noreclaim_page(page, zone);
 
+			unlock_page(page);
 		}
 		spin_unlock_irq(&zone->lru_lock);
 
@@ -2335,6 +2345,62 @@ void scan_all_zones_noreclaim_pages(void
 	}
 }
 
+/**
+ * scan_mapping_noreclaim_pages(mapping)
+ * @mapping - struct address_space to scan for reclaimable pages
+ *
+ * scan all pages in mapping.  check non-reclaimable pages for
+ * reclaimabililty and move them to the appropriate zone lru list.
+ */
+void scan_mapping_noreclaim_pages(struct address_space *mapping)
+{
+	pgoff_t next = 0;
+	pgoff_t end   = i_size_read(mapping->host);
+	struct zone *zone;
+	struct pagevec pvec;
+
+	if (mapping->nrpages == 0)
+		return;
+
+	pagevec_init(&pvec, 0);
+	while (next < end &&
+		pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		int i;
+
+		zone = NULL;
+
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+			pgoff_t page_index = page->index;
+			struct zone *pagezone = page_zone(page);
+
+			if (page_index > next)
+				next = page_index;
+			next++;
+
+			if (TestSetPageLocked(page))
+				continue;
+
+			if (pagezone != zone) {
+				if (zone)
+					spin_unlock(&zone->lru_lock);
+				zone = pagezone;
+				spin_lock(&zone->lru_lock);
+			}
+
+			if (PageLRU(page) && PageNoreclaim(page))
+				check_move_noreclaim_page(page, zone);
+
+			unlock_page(page);
+
+		}
+		if (zone)
+			spin_unlock(&zone->lru_lock);
+		pagevec_release(&pvec);
+	}
+
+}
+
 /*
  * scan_noreclaim_pages [vm] sysctl handler.  On demand re-scan of
  * all nodes' noreclaim lists for reclaimable pages
Index: linux-2.6.24-rc4-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.24-rc4-mm1.orig/include/linux/swap.h
+++ linux-2.6.24-rc4-mm1/include/linux/swap.h
@@ -218,6 +218,7 @@ static inline int zone_reclaim(struct zo
 extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
 extern void scan_zone_noreclaim_pages(struct zone *);
 extern void scan_all_zones_noreclaim_pages(void);
+extern void scan_mapping_noreclaim_pages(struct address_space *);
 extern unsigned long scan_noreclaim_pages;
 extern int scan_noreclaim_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
@@ -231,6 +232,9 @@ static inline int page_reclaimable(struc
 }
 static inline void scan_zone_noreclaim_pages(struct zone *z) { }
 static inline void scan_all_zones_noreclaim_pages(void) { }
+static inline void scan_mapping_noreclaim_pages(struct address_space *mapping)
+{
+}
 static inline int scan_noreclaim_register_node(struct node *node)
 {
 	return 0;

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 17/20] non-reclaimable mlocked pages
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (15 preceding siblings ...)
  2007-12-18 21:15 ` [patch 16/20] SHM_LOCKED pages are nonreclaimable Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-19  0:56   ` Nick Piggin
  2007-12-18 21:15 ` [patch 18/20] mlock vma pages under mmap_sem held for read Rik van Riel
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: noreclaim-04.1-prepare-for-mlocked-pages.patch --]
[-- Type: text/plain, Size: 30034 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series
+ fix page flags macros for *PageMlocked() when not configured.
+ ensure lru_add_drain_all() runs on all cpus when NORECLIM_MLOCK
  configured.  Was just for NUMA.

V1 -> V2:
+ moved this patch [and related patches] up to right after
  ramdisk/ramfs and SHM_LOCKed patches.
+ add [back] missing put_page() in putback_lru_page().
  This solved page leakage as seen by stats in previous
  version.
+ fix up munlock_vma_page() to isolate page from lru
  before calling try_to_unlock().  Think I detected a
  race here.
+ use TestClearPageMlock() on old page in migrate.c's
  migrate_page_copy() to clean up old page.
+ live dangerously:  remove TestSetPageLocked() in 
  is_mlocked_vma()--should only be called on new pages in
  the fault path--iff we chose to cull there [later patch].
+ Add PG_mlocked to free_pages_check() etc to detect mlock
  state mismanagement.
  NOTE:  temporarily [???] commented out--tripping over it
  under load.  Why?

Rework of a patch by Nick Piggin -- part 1 of 2.

This patch:

1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
   stub version of the mlock/noreclaim APIs when it's
   not configured.  Depends on [CONFIG_]NORECLAIM.

2) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   nonreclaimable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.

   Uses a bit available only to 64-bit systems.

   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_noreclaim
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.

3) add the mlock/noreclaim infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on noreclaim
   LRU list.

4) update vmscan.c:page_reclaimable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull nonreclaimable pages in fault
   path" patch is included.

5) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.  
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.

mm/internal.h and mm/mlock.c changes:
Originally Signed-off-by: Nick Piggin <npiggin@suse.de>

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>


Index: linux-2.6.24-rc4-mm1/mm/Kconfig
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/Kconfig
+++ linux-2.6.24-rc4-mm1/mm/Kconfig
@@ -204,3 +204,17 @@ config NORECLAIM
 	  may be non-reclaimable because:  they are locked into memory, they
 	  are anonymous pages for which no swap space exists, or they are anon
 	  pages that are expensive to unmap [long anon_vma "related vma" list.]
+
+config NORECLAIM_MLOCK
+	bool "Exclude mlock'ed pages from reclaim"
+	depends on NORECLAIM
+	help
+	  Treats mlock'ed pages as no-reclaimable.  Removing these pages from
+	  the LRU [in]active lists avoids the overhead of attempting to reclaim
+	  them.  Pages marked non-reclaimable for this reason will become
+	  reclaimable again when the last mlock is removed.
+	  when no swap space exists.  Removing these pages from the LRU lists
+	  avoids the overhead of attempting to reclaim them.  Pages marked
+	  non-reclaimable for this reason will become reclaimable again when/if
+	  sufficient swap space is added to the system.
+
Index: linux-2.6.24-rc4-mm1/mm/internal.h
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/internal.h
+++ linux-2.6.24-rc4-mm1/mm/internal.h
@@ -36,6 +36,60 @@ static inline void __put_page(struct pag
 
 extern int isolate_lru_page(struct page *page);
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * called only for new pages in fault path
+ */
+extern int is_mlocked_vma(struct vm_area_struct *, struct page *);
+
+/*
+ * must be called with vma's mmap_sem held for read, and page locked.
+ */
+extern void mlock_vma_page(struct page *page);
+
+extern int __mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, int lock);
+
+/*
+ * mlock all pages in this vma range.  For mmap()/mremap()/...
+ */
+static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	__mlock_vma_pages_range(vma, start, end, 1);
+}
+
+/*
+ * munlock range of pages.   For munmap() and exit().
+ * Always called to operate on a full vma that is being unmapped.
+ */
+static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+// TODO:  verify my assumption.  Should we just drop the start/end args?
+	VM_BUG_ON(start != vma->vm_start || end != vma->vm_end);
+
+	vma->vm_flags &= ~VM_LOCKED;	/* try_to_unlock() needs this */
+	__mlock_vma_pages_range(vma, start, end, 0);
+}
+
+extern void clear_page_mlock(struct page *page);
+
+#else /* CONFIG_NORECLAIM_MLOCK */
+static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
+{
+	return 0;
+}
+static inline void clear_page_mlock(struct page *page) { }
+static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { }
+static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { }
+
+#endif /* CONFIG_NORECLAIM_MLOCK */
+
+
 extern void fastcall __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
Index: linux-2.6.24-rc4-mm1/mm/mlock.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/mlock.c
+++ linux-2.6.24-rc4-mm1/mm/mlock.c
@@ -8,10 +8,16 @@
 #include <linux/capability.h>
 #include <linux/mman.h>
 #include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
 #include <linux/module.h>
+#include <linux/rmap.h>
+#include <linux/mmzone.h>
+
+#include "internal.h"
 
 int can_do_mlock(void)
 {
@@ -23,19 +29,224 @@ int can_do_mlock(void)
 }
 EXPORT_SYMBOL(can_do_mlock);
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * Mlocked pages are marked with PageMlocked() flag for efficient testing
+ * in vmscan and, possibly, the fault path.
+ *
+ * An mlocked page [PageMlocked(page)] is non-reclaimable.  As such, it will
+ * be placed on the LRU "noreclaim" list, rather than the [in]active lists.
+ * The noreclaim list is an LRU sibling list to the [in]active lists.
+ * PageNoreclaim is set to indicate the non-reclaimable state.
+ *
+//TODO:  no longer counting, but does this still apply to lazy setting
+// of PageMlocked() ??
+ * When lazy incrementing via vmscan, it is important to ensure that the
+ * vma's VM_LOCKED status is not concurrently being modified, otherwise we
+ * may have elevated mlock_count of a page that is being munlocked. So lazy
+ * mlocked must take the mmap_sem for read, and verify that the vma really
+ * is locked (see mm/rmap.c).
+ */
+
+/*
+ * add isolated page to appropriate LRU list, adjusting stats as needed.
+ * Page may still be non-reclaimable for other reasons.
+//TODO:  move to vmscan.c as global along with isolate_lru_page()?
+ */
+static void putback_lru_page(struct page *page)
+{
+	VM_BUG_ON(PageLRU(page));
+
+	ClearPageNoreclaim(page);
+	ClearPageActive(page);
+	lru_cache_add_active_or_noreclaim(page, NULL);
+	put_page(page);		/* ref from isolate */
+}
+
+/*
+ * Clear the page's PageMlocked().  This can be useful in a situation where
+ * we want to unconditionally remove a page from the pagecache.
+ *
+ * It is legal to call this function for any page, mlocked or not.
+ * If called for a page that is still mapped by mlocked vmas, all we do
+ * is revert to lazy LRU behaviour -- semantics are not broken.
+ */
+void clear_page_mlock(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+
+	if (likely(!PageMlocked(page)))
+		return;
+	ClearPageMlocked(page);
+	if (!isolate_lru_page(page))
+		putback_lru_page(page);
+}
+
+/*
+ * Mark page as mlocked if not already.
+ * If page on LRU, isolate and putback to move to noreclaim list.
+ */
+void mlock_vma_page(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+
+	if (!TestSetPageMlocked(page) && !isolate_lru_page(page))
+			putback_lru_page(page);
+}
+
+/*
+ * called from munlock()/munmap() path with page supposedly on the LRU.
+ *
+ * Note:  unlike mlock_vma_page(), we can't just clear the PageMlocked
+ * [in try_to_unlock()] and then attempt to isolate the page.  We must
+ * isolate the page() to keep others from messing with its noreclaim
+ * and mlocked state while trying to unlock.  However, we pre-clear the
+ * mlocked state anyway as we might lose the isolation race and we might
+ * not get another chance to clear PageMlocked.  If we successfully
+ * isolate the page and try_to_unlock() detects other VM_LOCKED vmas
+ * mapping the page, we just restore the PageMlocked state.  If we lose
+ * the isolation race, and the page is mapped by other VM_LOCKED vmas,
+ * we'll detect this in try_to_unmap() and we'll call mlock_vma_page()
+ * above, if/when we try to reclaim the page.
+ */
+static void munlock_vma_page(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+
+	if (TestClearPageMlocked(page) && !isolate_lru_page(page)) {
+		if (try_to_unlock(page) == SWAP_MLOCK)
+			SetPageMlocked(page);	/* still VM_LOCKED */
+		putback_lru_page(page);
+	}
+}
+
+/*
+ * Called in fault path via page_reclaimable() for a new page
+ * to determine if it's being mapped into a LOCKED vma.
+ * If so, mark page as mlocked.
+ */
+int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
+{
+	VM_BUG_ON(PageMlocked(page));	// TODO:  needed?
+	VM_BUG_ON(PageLRU(page));
+
+	if (likely(!(vma->vm_flags & VM_LOCKED)))
+		return 0;
+
+	SetPageMlocked(page);
+	return 1;
+}
+
+/*
+ * mlock or munlock a range of pages in the vma depending on whether
+ * @lock is 1 or 0, respectively.  @lock must match vm_flags VM_LOCKED
+ * state.
+TODO:   we don't really need @lock, as we can determine it from vm_flags
+ *
+ * This takes care of making the pages present too.
+ *
+ * vma->vm_mm->mmap_sem must be held for write.
+ */
+int __mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, int lock)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = start;
+	struct page *pages[16]; /* 16 gives a reasonable batch */
+	int write = !!(vma->vm_flags & VM_WRITE);
+	int nr_pages;
+	int ret = 0;
+
+	BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK);
+	VM_BUG_ON(lock != !!(vma->vm_flags & VM_LOCKED));
+
+	if (vma->vm_flags & VM_IO)
+		return ret;
+
+	lru_add_drain_all();	/* push cached pages to LRU */
+
+	nr_pages = (end - start) / PAGE_SIZE;
+
+	while (nr_pages > 0) {
+		int i;
+
+		cond_resched();
+
+		/*
+		 * get_user_pages makes pages present if we are
+		 * setting mlock.
+		 */
+		ret = get_user_pages(current, mm, addr,
+				min_t(int, nr_pages, ARRAY_SIZE(pages)),
+				write, 0, pages, NULL);
+		if (ret < 0)
+			break;
+		if (ret == 0) {
+			/*
+			 * We know the vma is there, so the only time
+			 * we cannot get a single page should be an
+			 * error (ret < 0) case.
+			 */
+			WARN_ON(1);
+			ret = -EFAULT;
+			break;
+		}
+
+		lru_add_drain();	/* push cached pages to LRU */
+
+		for (i = 0; i < ret; i++) {
+			struct page *page = pages[i];
+
+			lock_page(page);
+			if (lock)
+				mlock_vma_page(page);
+			else
+				munlock_vma_page(page);
+			unlock_page(page);
+			put_page(page);		/* ref from get_user_pages() */
+
+			addr += PAGE_SIZE;	/* for next get_user_pages() */
+			nr_pages--;
+		}
+	}
+
+	lru_add_drain_all();	/* to update stats */
+
+	return ret;
+}
+
+#else /* CONFIG_NORECLAIM_MLOCK */
+
+/*
+ * Just make pages present if @lock true.  No-op if unlocking.
+ */
+int __mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, int lock)
+{
+	int ret = 0;
+
+	if (!lock || vma->vm_flags & VM_IO)
+		return ret;
+
+	return make_pages_present(start, end);
+}
+#endif /* CONFIG_NORECLAIM_MLOCK */
+
 static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	unsigned long start, unsigned long end, unsigned int newflags)
 {
-	struct mm_struct * mm = vma->vm_mm;
+	struct mm_struct *mm = vma->vm_mm;
 	pgoff_t pgoff;
-	int pages;
+	int nr_pages;
 	int ret = 0;
+	int lock;
 
 	if (newflags == vma->vm_flags) {
 		*prev = vma;
 		goto out;
 	}
 
+//TODO:  linear_page_index() ?   non-linear pages?
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
 			  vma->vm_file, pgoff, vma_policy(vma));
@@ -59,24 +270,25 @@ static int mlock_fixup(struct vm_area_st
 	}
 
 success:
+	lock = !!(newflags & VM_LOCKED);
+
+	/*
+	 * Keep track of amount of locked VM.
+	 */
+	nr_pages = (end - start) >> PAGE_SHIFT;
+	if (!lock)
+		nr_pages = -nr_pages;
+	mm->locked_vm += nr_pages;
+
 	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
 	 * It's okay if try_to_unmap_one unmaps a page just after we
-	 * set VM_LOCKED, make_pages_present below will bring it back.
+	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 	vma->vm_flags = newflags;
 
-	/*
-	 * Keep track of amount of locked VM.
-	 */
-	pages = (end - start) >> PAGE_SHIFT;
-	if (newflags & VM_LOCKED) {
-		pages = -pages;
-		if (!(newflags & VM_IO))
-			ret = make_pages_present(start, end);
-	}
+	__mlock_vma_pages_range(vma, start, end, lock);
 
-	mm->locked_vm -= pages;
 out:
 	if (ret == -ENOMEM)
 		ret = -EAGAIN;
Index: linux-2.6.24-rc4-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/vmscan.c
+++ linux-2.6.24-rc4-mm1/mm/vmscan.c
@@ -2238,10 +2238,11 @@ int zone_reclaim(struct zone *zone, gfp_
  *
  * @page       - page to test
  * @vma        - vm area in which page is/will be mapped.  May be NULL.
- *               If !NULL, called from fault path.
+ *               If !NULL, called from fault path for a new page.
  *
  * Reasons page might not be reclaimable:
- * + page's mapping marked non-reclaimable
+ * 1) page's mapping marked non-reclaimable
+ * 2) page is mlock'ed into memory.
  * TODO - later patches
  *
  * TODO:  specify locking assumptions
@@ -2254,6 +2255,11 @@ int page_reclaimable(struct page *page, 
 	if (mapping_non_reclaimable(page_mapping(page)))
 		return 0;
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+	if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
+		return 0;
+#endif
+
 	/* TODO:  test page [!]reclaimable conditions */
 
 	return 1;
Index: linux-2.6.24-rc4-mm1/include/linux/page-flags.h
===================================================================
--- linux-2.6.24-rc4-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.24-rc4-mm1/include/linux/page-flags.h
@@ -110,6 +110,7 @@
 #define PG_uncached		31	/* Page has been mapped as uncached */
 
 #define PG_noreclaim		30	/* Page is "non-reclaimable"  */
+#define PG_mlocked		29	/* Page is vma mlocked */
 #endif
 
 /*
@@ -163,6 +164,7 @@ static inline void SetPageUptodate(struc
 #define SetPageActive(page)	set_bit(PG_active, &(page)->flags)
 #define ClearPageActive(page)	clear_bit(PG_active, &(page)->flags)
 #define __ClearPageActive(page)	__clear_bit(PG_active, &(page)->flags)
+#define TestSetPageActive(page) test_and_set_bit(PG_active, &(page)->flags)
 #define TestClearPageActive(page) test_and_clear_bit(PG_active, &(page)->flags)
 
 #define PageSlab(page)		test_bit(PG_slab, &(page)->flags)
@@ -270,8 +272,17 @@ static inline void __ClearPageTail(struc
 #define SetPageNoreclaim(page)	set_bit(PG_noreclaim, &(page)->flags)
 #define ClearPageNoreclaim(page) clear_bit(PG_noreclaim, &(page)->flags)
 #define __ClearPageNoreclaim(page) __clear_bit(PG_noreclaim, &(page)->flags)
-#define TestClearPageNoreclaim(page) test_and_clear_bit(PG_noreclaim, \
-							 &(page)->flags)
+#define TestClearPageNoreclaim(page) \
+				test_and_clear_bit(PG_noreclaim, &(page)->flags)
+#ifdef CONFIG_NORECLAIM_MLOCK
+#define PageMlocked(page)	test_bit(PG_mlocked, &(page)->flags)
+#define SetPageMlocked(page)	set_bit(PG_mlocked, &(page)->flags)
+#define ClearPageMlocked(page) clear_bit(PG_mlocked, &(page)->flags)
+#define __ClearPageMlocked(page) __clear_bit(PG_mlocked, &(page)->flags)
+#define TestSetPageMlocked(page) test_and_set_bit(PG_mlocked, &(page)->flags)
+#define TestClearPageMlocked(page) \
+				test_and_clear_bit(PG_mlocked, &(page)->flags)
+#endif
 #else
 #define PageNoreclaim(page)	0
 #define SetPageNoreclaim(page)
@@ -279,6 +290,14 @@ static inline void __ClearPageTail(struc
 #define __ClearPageNoreclaim(page)
 #define TestClearPageNoreclaim(page) 0
 #endif
+#ifndef CONFIG_NORECLAIM_MLOCK
+#define PageMlocked(page)	0
+#define SetPageMlocked(page)
+#define ClearPageMlocked(page)
+#define __ClearPageMlocked(page)
+#define TestSetPageMlocked(page) 0
+#define TestClearPageMlocked(page) 0
+#endif
 
 #define PageUncached(page)	test_bit(PG_uncached, &(page)->flags)
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
Index: linux-2.6.24-rc4-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.24-rc4-mm1.orig/include/linux/rmap.h
+++ linux-2.6.24-rc4-mm1/include/linux/rmap.h
@@ -112,6 +112,17 @@ unsigned long page_address_in_vma(struct
  */
 int page_mkclean(struct page *);
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/*
+ * called in munlock()/munmap() path to check for other vmas holding
+ * the page mlocked.
+ */
+int try_to_unlock(struct page *);
+#define TRY_TO_UNLOCK 1
+#else
+#define TRY_TO_UNLOCK 0		/* for compiler -- dead code elimination */
+#endif
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
@@ -135,5 +146,6 @@ static inline int page_mkclean(struct pa
 #define SWAP_SUCCESS	0
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
+#define SWAP_MLOCK	3
 
 #endif	/* _LINUX_RMAP_H */
Index: linux-2.6.24-rc4-mm1/mm/rmap.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/rmap.c
+++ linux-2.6.24-rc4-mm1/mm/rmap.c
@@ -52,6 +52,8 @@
 
 #include <asm/tlbflush.h>
 
+#include "internal.h"
+
 struct kmem_cache *anon_vma_cachep;
 
 /* This must be called under the mmap_sem. */
@@ -284,10 +286,17 @@ static int page_referenced_one(struct pa
 	if (!pte)
 		goto out;
 
+	/*
+	 * Don't want to elevate referenced for mlocked page that gets this far,
+	 * in order that it progresses to try_to_unmap and is moved to the
+	 * noreclaim list.
+	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		referenced++;
 		*mapcount = 1;	/* break early from loop */
-	} else if (ptep_clear_flush_young(vma, address, pte))
+		goto out_unmap;
+	}
+
+	if (ptep_clear_flush_young(vma, address, pte))
 		referenced++;
 
 	/* Pretend the page is referenced if the task has the
@@ -296,6 +305,7 @@ static int page_referenced_one(struct pa
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
+out_unmap:
 	(*mapcount)--;
 	pte_unmap_unlock(pte, ptl);
 out:
@@ -384,11 +394,6 @@ static int page_referenced_file(struct p
 		 */
 		if (mem_cont && (mm_cgroup(vma->vm_mm) != mem_cont))
 			continue;
-		if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE))
-				  == (VM_LOCKED|VM_MAYSHARE)) {
-			referenced++;
-			break;
-		}
 		referenced += page_referenced_one(page, vma, &mapcount);
 		if (!mapcount)
 			break;
@@ -712,10 +717,15 @@ static int try_to_unmap_one(struct page 
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
-		ret = SWAP_FAIL;
-		goto out_unmap;
+	if (!migration) {
+		if (vma->vm_flags & VM_LOCKED) {
+			ret = SWAP_MLOCK;
+			goto out_unmap;
+		}
+		if (ptep_clear_flush_young(vma, address, pte)) {
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
 	}
 
 	/* Nuke the page table entry. */
@@ -797,6 +807,10 @@ out:
  * For very sparsely populated VMAs this is a little inefficient - chances are
  * there there won't be many ptes located within the scan cluster.  In this case
  * maybe we could scan further - to the end of the pte page, perhaps.
+ *
+TODO:  still accurate with noreclaim infrastructure?
+ * Mlocked pages also aren't handled very well at the moment: they aren't
+ * moved off the LRU like they are for linear pages.
  */
 #define CLUSTER_SIZE	min(32*PAGE_SIZE, PMD_SIZE)
 #define CLUSTER_MASK	(~(CLUSTER_SIZE - 1))
@@ -868,10 +882,28 @@ static void try_to_unmap_cluster(unsigne
 	pte_unmap_unlock(pte - 1, ptl);
 }
 
-static int try_to_unmap_anon(struct page *page, int migration)
+/**
+ * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
+ * rmap method
+ * @page: the page to unmap/unlock
+ * @unlock:  request for unlock rather than unmap [unlikely]
+ * @migration:  unmapping for migration - ignored if @unlock
+ *
+ * Find all the mappings of a page using the mapping pointer and the vma chains
+ * contained in the anon_vma struct it points to.
+ *
+ * This function is only called from try_to_unmap/try_to_unlock for
+ * anonymous pages.
+ * When called from try_to_unlock(), the mmap_sem of the mm containing the vma
+ * where the page was found will be held for write.  So, we won't recheck
+ * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
+ * 'LOCKED.
+ */
+static int try_to_unmap_anon(struct page *page, int unlock, int migration)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
+	unsigned int mlocked = 0;
 	int ret = SWAP_AGAIN;
 
 	anon_vma = page_lock_anon_vma(page);
@@ -879,25 +911,53 @@ static int try_to_unmap_anon(struct page
 		return ret;
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
-		ret = try_to_unmap_one(page, vma, migration);
+		if (TRY_TO_UNLOCK && unlikely(unlock)) {
+			if (!(vma->vm_flags & VM_LOCKED))
+				continue;	/* must visit all vmas */
+			mlocked++;
+			break;			/* no need to look further */
+		} else
+			ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
 			break;
+		if (ret == SWAP_MLOCK) {
+			if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
+				if (vma->vm_flags & VM_LOCKED) {
+					mlock_vma_page(page);
+					mlocked++;
+				}
+				up_read(&vma->vm_mm->mmap_sem);
+			}
+		}
 	}
-
 	page_unlock_anon_vma(anon_vma);
+
+	if (mlocked)
+		ret = SWAP_MLOCK;
+	else if (ret == SWAP_MLOCK)
+		ret = SWAP_AGAIN;
+
 	return ret;
 }
 
 /**
- * try_to_unmap_file - unmap file page using the object-based rmap method
- * @page: the page to unmap
+ * try_to_unmap_file - unmap or unlock file page using the object-based
+ * rmap method
+ * @page: the page to unmap/unlock
+ * @unlock:  request for unlock rather than unmap [unlikely]
+ * @migration:  unmapping for migration - ignored if @unlock
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
  *
- * This function is only called from try_to_unmap for object-based pages.
+ * This function is only called from try_to_unmap/try_to_unlock for
+ * object-based pages.
+ * When called from try_to_unlock(), the mmap_sem of the mm containing the vma
+ * where the page was found will be held for write.  So, we won't recheck
+ * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
+ * 'LOCKED.
  */
-static int try_to_unmap_file(struct page *page, int migration)
+static int try_to_unmap_file(struct page *page, int unlock, int migration)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -908,20 +968,47 @@ static int try_to_unmap_file(struct page
 	unsigned long max_nl_cursor = 0;
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
+	unsigned int mlocked = 0;
 
 	read_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
-		ret = try_to_unmap_one(page, vma, migration);
+		if (TRY_TO_UNLOCK && unlikely(unlock)) {
+			if (!(vma->vm_flags & VM_LOCKED))
+				continue;	/* must visit all vmas */
+			mlocked++;
+			break;			/* no need to look further */
+		} else
+			ret = try_to_unmap_one(page, vma, migration);
 		if (ret == SWAP_FAIL || !page_mapped(page))
 			goto out;
+		if (ret == SWAP_MLOCK) {
+			if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
+				if (vma->vm_flags & VM_LOCKED) {
+					mlock_vma_page(page);
+					mlocked++;
+				}
+				up_read(&vma->vm_mm->mmap_sem);
+			}
+			if (unlikely(unlock))
+				break;	/* stop on 1st mlocked vma */
+		}
 	}
 
+	if (mlocked)
+		goto out;
+
 	if (list_empty(&mapping->i_mmap_nonlinear))
 		goto out;
 
 	list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-		if ((vma->vm_flags & VM_LOCKED) && !migration)
+		if (TRY_TO_UNLOCK && unlikely(unlock)) {
+			if (!(vma->vm_flags & VM_LOCKED))
+				continue;	/* must visit all vmas */
+			mlocked++;
+			goto out;		/* no need to look further */
+		}
+		if (!migration && (vma->vm_flags & VM_LOCKED))
 			continue;
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
@@ -955,8 +1042,6 @@ static int try_to_unmap_file(struct page
 	do {
 		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-			if ((vma->vm_flags & VM_LOCKED) && !migration)
-				continue;
 			cursor = (unsigned long) vma->vm_private_data;
 			while ( cursor < max_nl_cursor &&
 				cursor < vma->vm_end - vma->vm_start) {
@@ -981,6 +1066,10 @@ static int try_to_unmap_file(struct page
 		vma->vm_private_data = NULL;
 out:
 	read_unlock(&mapping->i_mmap_lock);
+	if (mlocked)
+		ret = SWAP_MLOCK;
+	else if (ret == SWAP_MLOCK)
+		ret = SWAP_AGAIN;
 	return ret;
 }
 
@@ -995,6 +1084,7 @@ out:
  * SWAP_SUCCESS	- we succeeded in removing all mappings
  * SWAP_AGAIN	- we missed a mapping, try again later
  * SWAP_FAIL	- the page is unswappable
+ * SWAP_MLOCK	- page is mlocked.
  */
 int try_to_unmap(struct page *page, int migration)
 {
@@ -1003,12 +1093,32 @@ int try_to_unmap(struct page *page, int 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page, migration);
+		ret = try_to_unmap_anon(page, 0, migration);
 	else
-		ret = try_to_unmap_file(page, migration);
-
-	if (!page_mapped(page))
+		ret = try_to_unmap_file(page, 0, migration);
+	if (ret != SWAP_MLOCK && !page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
 }
 
+#ifdef CONFIG_NORECLAIM_MLOCK
+/**
+ * try_to_unlock - Check page's rmap for other vma's holding page locked.
+ * @page: the page to be unlocked.   will be returned with PG_mlocked
+ * cleared if no vmas are VM_LOCKED.
+ *
+ * Return values are:
+ *
+ * SWAP_SUCCESS	- no vma's holding page locked.
+ * SWAP_MLOCK	- page is mlocked.
+ */
+int try_to_unlock(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
+
+	if (PageAnon(page))
+		return(try_to_unmap_anon(page, 1, 0));
+	else
+		return(try_to_unmap_file(page, 1, 0));
+}
+#endif
Index: linux-2.6.24-rc4-mm1/mm/migrate.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/migrate.c
+++ linux-2.6.24-rc4-mm1/mm/migrate.c
@@ -366,6 +366,9 @@ static void migrate_page_copy(struct pag
 		set_page_dirty(newpage);
  	}
 
+	if (TestClearPageMlocked(page))
+		SetPageMlocked(newpage);
+
 #ifdef CONFIG_SWAP
 	ClearPageSwapCache(page);
 #endif
Index: linux-2.6.24-rc4-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/page_alloc.c
+++ linux-2.6.24-rc4-mm1/mm/page_alloc.c
@@ -255,6 +255,7 @@ static void bad_page(struct page *page)
 			1 << PG_swapcache |
 			1 << PG_writeback |
 			1 << PG_swapbacked |
+			1 << PG_mlocked |
 			1 << PG_buddy );
 	set_page_count(page, 0);
 	reset_page_mapcount(page);
@@ -484,6 +485,9 @@ static inline int free_pages_check(struc
 			1 << PG_writeback |
 			1 << PG_reserved |
 			1 << PG_noreclaim |
+// TODO:  always trip this under heavy workloads.
+// Why isn't this being cleared on last unmap/unlock?
+//			1 << PG_mlocked |
 			1 << PG_buddy ))))
 		bad_page(page);
 	if (PageDirty(page))
@@ -638,6 +642,8 @@ static int prep_new_page(struct page *pa
 			1 << PG_writeback |
 			1 << PG_reserved |
 			1 << PG_swapbacked |
+//TODO:  why hitting this?
+//			1 << PG_mlocked |
 			1 << PG_buddy ))))
 		bad_page(page);
 
@@ -650,7 +656,9 @@ static int prep_new_page(struct page *pa
 
 	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
 			1 << PG_referenced | 1 << PG_arch_1 |
-			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
+			1 << PG_owner_priv_1 | 1 << PG_mappedtodisk |
+//TODO take care of it here, for now.
+			1 << PG_mlocked );
 	set_page_private(page, 0);
 	set_page_refcounted(page);
 
Index: linux-2.6.24-rc4-mm1/mm/swap.c
===================================================================
--- linux-2.6.24-rc4-mm1.orig/mm/swap.c
+++ linux-2.6.24-rc4-mm1/mm/swap.c
@@ -346,7 +346,7 @@ void lru_add_drain(void)
 	put_cpu();
 }
 
-#ifdef CONFIG_NUMA
+#if defined(CONFIG_NUMA) || defined(CONFIG_NORECLAIM_MLOCK)
 static void lru_add_drain_per_cpu(struct work_struct *dummy)
 {
 	lru_add_drain();

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 18/20] mlock vma pages under mmap_sem held for read
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (16 preceding siblings ...)
  2007-12-18 21:15 ` [patch 17/20] non-reclaimable mlocked pages Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 19/20] handle mlocked pages during map/unmap and truncate Rik van Riel
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: noreclaim-04.1a-lock-vma-pages-under-read-lock.patch --]
[-- Type: text/plain, Size: 6588 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no change]
+ fix function return types [void -> int] to fix build when
  not configured.

New in V2.

We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas.  However, this can lead to
very long lock hold times attempting to fault in a large memory region
to mlock it into memory.   This can hold off other faults against the
mm [multithreaded tasks] and other scans of the mm, such as via /proc.
To alleviate this, downgrade the mmap_sem to read mode during the 
population of the region for locking.  This is especially the case 
if we need to reclaim memory to lock down the region.  We [probably?]
don't need to do this for unlocking as all of the pages should be
resident--they're already mlocked.

Now, the caller's of the mlock functions [mlock_fixup() and 
mlock_vma_pages_range()] expect the mmap_sem to be returned in write
mode.  Changing all callers appears to be way too much effort at this
point.  So, restore write mode before returning.  Note that this opens
a window where the mmap list could change in a multithreaded process.
So, at least for mlock_fixup(), where we could be called in a loop over
multiple vmas, we check that a vma still exists at the start address
and that vma still covers the page range [start,end).  If not, we return
an error, -EAGAIN, and let the caller deal with it.

Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup()
if the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma.  Again, let the caller
deal with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.

With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down.  

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: Linux/mm/mlock.c
===================================================================
--- Linux.orig/mm/mlock.c	2007-11-12 16:21:59.000000000 -0500
+++ Linux/mm/mlock.c	2007-11-12 16:22:01.000000000 -0500
@@ -215,6 +215,37 @@ int __mlock_vma_pages_range(struct vm_ar
 	return ret;
 }
 
+/**
+ * mlock_vma_pages_range
+ * @vma - vm area to mlock into memory
+ * @start - start address in @vma of range to mlock,
+ * @end   - end address in @vma of range
+ *
+ * Called with current->mm->mmap_sem held write locked.  Downgrade to read
+ * for faulting in pages.  This can take a looong time for large segments.
+ *
+ * We need to restore the mmap_sem to write locked because our callers'
+ * callers expect this.	 However, because the mmap could have changed
+ * [in a multi-threaded process], we need to recheck.
+ */
+int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	downgrade_write(&mm->mmap_sem);
+	__mlock_vma_pages_range(vma, start, end, 1);
+
+	up_read(&mm->mmap_sem);
+	/* vma can change or disappear */
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	/* non-NULL vma must contain @start, but need to check @end */
+	if (!vma ||  end > vma->vm_end)
+		return -EAGAIN;
+	return 0;
+}
+
 #else /* CONFIG_NORECLAIM_MLOCK */
 
 /*
@@ -281,14 +312,38 @@ success:
 	mm->locked_vm += nr_pages;
 
 	/*
-	 * vm_flags is protected by the mmap_sem held in write mode.
+	 * vm_flags is protected by the mmap_sem held for write.
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 	vma->vm_flags = newflags;
 
+	/*
+	 * mmap_sem is currently held for write.  If we're locking pages,
+	 * downgrade the write lock to a read lock so that other faults,
+	 * mmap scans, ... while we fault in all pages.
+	 */
+	if (lock)
+		downgrade_write(&mm->mmap_sem);
+
 	__mlock_vma_pages_range(vma, start, end, lock);
 
+	if (lock) {
+		/*
+		 * Need to reacquire mmap sem in write mode, as our callers
+		 * expect this.  We have no support for atomically upgrading
+		 * a sem to write, so we need to check for changes while sem
+		 * is unlocked.
+		 */
+		up_read(&mm->mmap_sem);
+		/* vma can change or disappear */
+		down_write(&mm->mmap_sem);
+		*prev = find_vma(mm, start);
+		/* non-NULL *prev must contain @start, but need to check @end */
+		if (!(*prev) || end > (*prev)->vm_end)
+			ret = -EAGAIN;
+	}
+
 out:
 	if (ret == -ENOMEM)
 		ret = -EAGAIN;
Index: Linux/mm/internal.h
===================================================================
--- Linux.orig/mm/internal.h	2007-11-12 16:21:59.000000000 -0500
+++ Linux/mm/internal.h	2007-11-12 16:22:01.000000000 -0500
@@ -53,24 +53,21 @@ extern int __mlock_vma_pages_range(struc
 /*
  * mlock all pages in this vma range.  For mmap()/mremap()/...
  */
-static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end)
-{
-	__mlock_vma_pages_range(vma, start, end, 1);
-}
+extern int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
 
 /*
  * munlock range of pages.   For munmap() and exit().
  * Always called to operate on a full vma that is being unmapped.
  */
-static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
+static inline int munlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end)
 {
 // TODO:  verify my assumption.  Should we just drop the start/end args?
 	VM_BUG_ON(start != vma->vm_start || end != vma->vm_end);
 
 	vma->vm_flags &= ~VM_LOCKED;	/* try_to_unlock() needs this */
-	__mlock_vma_pages_range(vma, start, end, 0);
+	return __mlock_vma_pages_range(vma, start, end, 0);
 }
 
 extern void clear_page_mlock(struct page *page);
@@ -82,10 +79,10 @@ static inline int is_mlocked_vma(struct 
 }
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
-static inline void mlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end) { }
-static inline void munlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end) { }
+static inline int mlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { return 0; }
+static inline int munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end) { return 0; }
 
 #endif /* CONFIG_NORECLAIM_MLOCK */
 

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 19/20] handle mlocked pages during map/unmap and truncate
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (17 preceding siblings ...)
  2007-12-18 21:15 ` [patch 18/20] mlock vma pages under mmap_sem held for read Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-18 21:15 ` [patch 20/20] account mlocked pages Rik van Riel
  2007-12-22 20:27 ` [patch 00/20] VM pageout scalability improvements Balbir Singh
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Lee Schermerhorn

[-- Attachment #1: noreclaim-04.2-move-mlocked-pages-off-the-LRU.patch --]
[-- Type: text/plain, Size: 6747 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no changes]

V1 -> V2:
+  modified mmap.c:mmap_region() to return error if mlock_vma_pages_range()
   does.  This can only occur if the vma gets removed/changed while
   we're switching mmap_sem lock modes.   Most callers don't care, but
   sys_remap_file_pages() appears to.

Rework of Nick Piggins's "mm: move mlocked pages off the LRU" patch
-- part 2 0f 2.

Remove mlocked pages from the LRU using "NoReclaim infrastructure"
during mmap()/mremap().  Try to move back to normal LRU lists on
munmap() when last locked mapping removed.  Removed PageMlocked()
status when page truncated from file.


Originally Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

Index: Linux/mm/mmap.c
===================================================================
--- Linux.orig/mm/mmap.c	2007-11-12 16:21:48.000000000 -0500
+++ Linux/mm/mmap.c	2007-11-12 16:22:03.000000000 -0500
@@ -32,6 +32,8 @@
 #include <asm/tlb.h>
 #include <asm/mmu_context.h>
 
+#include "internal.h"
+
 #ifndef arch_mmap_check
 #define arch_mmap_check(addr, len, flags)	(0)
 #endif
@@ -1211,9 +1213,13 @@ out:	
 	vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
-	}
-	if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
+		/*
+		 * makes pages present; downgrades, drops, requires mmap_sem
+		 */
+		error = mlock_vma_pages_range(vma, addr, addr + len);
+		if (error)
+			return error;	/* vma gone! */
+	} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
 		make_pages_present(addr, addr + len);
 	return addr;
 
@@ -1892,6 +1898,19 @@ int do_munmap(struct mm_struct *mm, unsi
 	vma = prev? prev->vm_next: mm->mmap;
 
 	/*
+	 * unlock any mlock()ed ranges before detaching vmas
+	 */
+	if (mm->locked_vm) {
+		struct vm_area_struct *tmp = vma;
+		while (tmp && tmp->vm_start < end) {
+			if (tmp->vm_flags & VM_LOCKED)
+				munlock_vma_pages_range(tmp,
+						 tmp->vm_start, tmp->vm_end);
+			tmp = tmp->vm_next;
+		}
+	}
+
+	/*
 	 * Remove the vma's, and unmap the actual pages
 	 */
 	detach_vmas_to_be_unmapped(mm, vma, prev, end);
@@ -2024,7 +2043,7 @@ out:
 	mm->total_vm += len >> PAGE_SHIFT;
 	if (flags & VM_LOCKED) {
 		mm->locked_vm += len >> PAGE_SHIFT;
-		make_pages_present(addr, addr + len);
+		mlock_vma_pages_range(vma, addr, addr + len);
 	}
 	return addr;
 }
@@ -2035,13 +2054,26 @@ EXPORT_SYMBOL(do_brk);
 void exit_mmap(struct mm_struct *mm)
 {
 	struct mmu_gather *tlb;
-	struct vm_area_struct *vma = mm->mmap;
+	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
 
 	/* mm's last user has gone, and its about to be pulled down */
 	arch_exit_mmap(mm);
 
+	if (mm->locked_vm) {
+		vma = mm->mmap;
+		while (vma) {
+			if (vma->vm_flags & VM_LOCKED)
+				munlock_vma_pages_range(vma,
+						vma->vm_start, vma->vm_end);
+			vma = vma->vm_next;
+		}
+	}
+
+	vma = mm->mmap;
+
+
 	lru_add_drain();
 	flush_cache_mm(mm);
 	tlb = tlb_gather_mmu(mm, 1);
Index: Linux/mm/mremap.c
===================================================================
--- Linux.orig/mm/mremap.c	2007-11-12 16:21:48.000000000 -0500
+++ Linux/mm/mremap.c	2007-11-12 16:22:03.000000000 -0500
@@ -23,6 +23,8 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
+#include "internal.h"
+
 static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
@@ -232,8 +234,8 @@ static unsigned long move_vma(struct vm_
 	if (vm_flags & VM_LOCKED) {
 		mm->locked_vm += new_len >> PAGE_SHIFT;
 		if (new_len > old_len)
-			make_pages_present(new_addr + old_len,
-					   new_addr + new_len);
+			mlock_vma_pages_range(vma, new_addr + old_len,
+						   new_addr + new_len);
 	}
 
 	return new_addr;
@@ -373,7 +375,7 @@ unsigned long do_mremap(unsigned long ad
 			vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages);
 			if (vma->vm_flags & VM_LOCKED) {
 				mm->locked_vm += pages;
-				make_pages_present(addr + old_len,
+				mlock_vma_pages_range(vma, addr + old_len,
 						   addr + new_len);
 			}
 			ret = addr;
Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c	2007-11-12 16:21:59.000000000 -0500
+++ Linux/mm/vmscan.c	2007-11-12 16:22:03.000000000 -0500
@@ -534,6 +534,10 @@ static unsigned long shrink_page_list(st
 				goto activate_locked;
 			case SWAP_AGAIN:
 				goto keep_locked;
+			case SWAP_MLOCK:
+				ClearPageActive(page);
+				SetPageNoreclaim(page);
+				goto keep_locked;	/* to noreclaim list */
 			case SWAP_SUCCESS:
 				; /* try to free the page below */
 			}
Index: Linux/mm/filemap.c
===================================================================
--- Linux.orig/mm/filemap.c	2007-11-12 16:21:48.000000000 -0500
+++ Linux/mm/filemap.c	2007-11-12 16:22:03.000000000 -0500
@@ -2504,8 +2504,16 @@ generic_file_direct_IO(int rw, struct ki
 	if (rw == WRITE) {
 		write_len = iov_length(iov, nr_segs);
 		end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
-	       	if (mapping_mapped(mapping))
+		if (mapping_mapped(mapping)) {
+			/*
+			 * Calling unmap_mapping_range like this is wrong,
+			 * because it can lead to mlocked pages being
+			 * discarded (this is true even before the Noreclaim
+			 * mlock work). direct-IO vs pagecache is a load of
+			 * junk anyway, so who cares.
+			 */
 			unmap_mapping_range(mapping, offset, write_len, 0);
+		}
 	}
 
 	retval = filemap_write_and_wait(mapping);
Index: Linux/mm/truncate.c
===================================================================
--- Linux.orig/mm/truncate.c	2007-11-12 16:21:48.000000000 -0500
+++ Linux/mm/truncate.c	2007-11-12 16:22:03.000000000 -0500
@@ -18,6 +18,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
+#include "internal.h"
 
 
 /**
@@ -104,6 +105,7 @@ truncate_complete_page(struct address_sp
 		do_invalidatepage(page, 0);
 
 	remove_from_page_cache(page);
+	clear_page_mlock(page);
 	ClearPageUptodate(page);
 	ClearPageMappedToDisk(page);
 	page_cache_release(page);	/* pagecache ref */
@@ -128,6 +130,7 @@ invalidate_complete_page(struct address_
 	if (PagePrivate(page) && !try_to_release_page(page, 0))
 		return 0;
 
+	clear_page_mlock(page);
 	ret = remove_mapping(mapping, page);
 
 	return ret;
@@ -354,6 +357,7 @@ invalidate_complete_page2(struct address
 	if (PageDirty(page))
 		goto failed;
 
+	clear_page_mlock(page);
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 20/20] account mlocked pages
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (18 preceding siblings ...)
  2007-12-18 21:15 ` [patch 19/20] handle mlocked pages during map/unmap and truncate Rik van Riel
@ 2007-12-18 21:15 ` Rik van Riel
  2007-12-22 20:27 ` [patch 00/20] VM pageout scalability improvements Balbir Singh
  20 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-18 21:15 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, lee.shermerhorn, Nick Piggin, Lee Schermerhorn

[-- Attachment #1: noreclaim-04.3-account-mlocked-pages.patch --]
[-- Type: text/plain, Size: 6194 bytes --]

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series
+ fix definitions of NR_MLOCK to fix build errors when not configured.

V1 -> V2:
+  new in V2 -- pulled in & reworked from Nick's previous series

  From: Nick Piggin <npiggin@suse.de>
  To: Linux Memory Management <linux-mm@kvack.org>
  Cc: Nick Piggin <npiggin@suse.de>, Andrew Morton <akpm@osdl.org>
  Subject: [patch 4/4] mm: account mlocked pages
  Date:	Mon, 12 Mar 2007 07:39:14 +0100 (CET)

Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).

Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.  I don't know whether we'll want to keep
these stats in the long run, but during testing of this series, I find
them useful.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>


Index: Linux/drivers/base/node.c
===================================================================
--- Linux.orig/drivers/base/node.c	2007-11-14 09:26:19.000000000 -0500
+++ Linux/drivers/base/node.c	2007-11-14 10:06:14.000000000 -0500
@@ -55,6 +55,9 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d Inactive(file): %8lu kB\n"
 #ifdef CONFIG_NORECLAIM
 		       "Node %d Noreclaim:    %8lu kB\n"
+#ifdef CONFIG_NORECLAIM_MLOCK
+		       "Node %d Mlocked:       %8lu kB\n"
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		       "Node %d HighTotal:      %8lu kB\n"
@@ -82,6 +85,9 @@ static ssize_t node_read_meminfo(struct 
 		       nid, node_page_state(nid, NR_INACTIVE_FILE),
 #ifdef CONFIG_NORECLAIM
 		       nid, node_page_state(nid, NR_NORECLAIM),
+#ifdef CONFIG_NORECLAIM_MLOCK
+		       nid, K(node_page_state(nid, NR_MLOCK)),
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		       nid, K(i.totalhigh),
Index: Linux/fs/proc/proc_misc.c
===================================================================
--- Linux.orig/fs/proc/proc_misc.c	2007-11-14 09:26:19.000000000 -0500
+++ Linux/fs/proc/proc_misc.c	2007-11-14 09:27:21.000000000 -0500
@@ -160,6 +160,9 @@ static int meminfo_read_proc(char *page,
 		"Inactive(file): %8lu kB\n"
 #ifdef CONFIG_NORECLAIM
 		"Noreclaim:    %8lu kB\n"
+#ifdef CONFIG_NORECLAIM_MLOCK
+		"Mlocked:      %8lu kB\n"
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		"HighTotal:      %8lu kB\n"
@@ -195,6 +198,9 @@ static int meminfo_read_proc(char *page,
 		K(global_page_state(NR_INACTIVE_FILE)),
 #ifdef CONFIG_NORECLAIM
 		K(global_page_state(NR_NORECLAIM)),
+#ifdef CONFIG_NORECLAIM_MLOCK
+		K(global_page_state(NR_MLOCK)),
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		K(i.totalhigh),
Index: Linux/include/linux/mmzone.h
===================================================================
--- Linux.orig/include/linux/mmzone.h	2007-11-14 09:26:19.000000000 -0500
+++ Linux/include/linux/mmzone.h	2007-11-14 09:54:53.000000000 -0500
@@ -86,8 +86,12 @@ enum zone_stat_item {
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "           */
 #ifdef CONFIG_NORECLAIM
 	NR_NORECLAIM,	/*  "     "     "   "       "         */
+#ifdef CONFIG_NORECLAIM_MLOCK
+	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
+#endif
 #else
-	NR_NORECLAIM=NR_ACTIVE_FILE, /* avoid compiler errors in dead code */
+	NR_NORECLAIM=NR_ACTIVE_FILE,	/* avoid compiler errors in dead code */
+	NR_MLOCK=NR_ACTIVE_FILE,	/* avoid compiler errors... */
 #endif
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
Index: Linux/mm/mlock.c
===================================================================
--- Linux.orig/mm/mlock.c	2007-11-14 09:27:18.000000000 -0500
+++ Linux/mm/mlock.c	2007-11-14 10:05:37.000000000 -0500
@@ -75,11 +75,11 @@ void clear_page_mlock(struct page *page)
 {
 	BUG_ON(!PageLocked(page));
 
-	if (likely(!PageMlocked(page)))
-		return;
-	ClearPageMlocked(page);
-	if (!isolate_lru_page(page))
-		putback_lru_page(page);
+	if (unlikely(TestClearPageMlocked(page))) {
+		dec_zone_page_state(page, NR_MLOCK);
+		if (!isolate_lru_page(page))
+			putback_lru_page(page);
+	}
 }
 
 /*
@@ -90,8 +90,11 @@ void mlock_vma_page(struct page *page)
 {
 	BUG_ON(!PageLocked(page));
 
-	if (!TestSetPageMlocked(page) && !isolate_lru_page(page))
+	if (!TestSetPageMlocked(page)) {
+		inc_zone_page_state(page, NR_MLOCK);
+		if (!isolate_lru_page(page))
 			putback_lru_page(page);
+	}
 }
 
 /*
@@ -113,10 +116,22 @@ static void munlock_vma_page(struct page
 {
 	BUG_ON(!PageLocked(page));
 
-	if (TestClearPageMlocked(page) && !isolate_lru_page(page)) {
-		if (try_to_unlock(page) == SWAP_MLOCK)
-			SetPageMlocked(page);	/* still VM_LOCKED */
-		putback_lru_page(page);
+	if (TestClearPageMlocked(page)) {
+		dec_zone_page_state(page, NR_MLOCK);
+		if (!isolate_lru_page(page)) {
+			if (try_to_unlock(page) == SWAP_MLOCK) {
+				SetPageMlocked(page);	/* still VM_LOCKED */
+				inc_zone_page_state(page, NR_MLOCK);
+			}
+			putback_lru_page(page);
+		}
+		/*
+		 * Else we lost the race.  let try_to_unmap() deal with it.
+		 * At least we get the page state and mlock stats right.
+		 * However, page is still on the noreclaim list.  We'll fix
+		 * that up when the page is eventually freed or we scan the
+		 * noreclaim list.
+		 */
 	}
 }
 
@@ -133,7 +148,8 @@ int is_mlocked_vma(struct vm_area_struct
 	if (likely(!(vma->vm_flags & VM_LOCKED)))
 		return 0;
 
-	SetPageMlocked(page);
+	if (!TestSetPageMlocked(page))
+		inc_zone_page_state(page, NR_MLOCK);
 	return 1;
 }
 
Index: Linux/mm/migrate.c
===================================================================
--- Linux.orig/mm/migrate.c	2007-11-14 09:27:17.000000000 -0500
+++ Linux/mm/migrate.c	2007-11-14 09:27:21.000000000 -0500
@@ -371,8 +371,15 @@ static void migrate_page_copy(struct pag
 		set_page_dirty(newpage);
  	}
 
-	if (TestClearPageMlocked(page))
+	if (TestClearPageMlocked(page)) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		__dec_zone_page_state(page, NR_MLOCK);
 		SetPageMlocked(newpage);
+		__inc_zone_page_state(newpage, NR_MLOCK);
+		local_irq_restore(flags);
+	}
 
 #ifdef CONFIG_SWAP
 	ClearPageSwapCache(page);

-- 
All Rights Reversed


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2007-12-18 21:15 ` [patch 02/20] make the inode i_mmap_lock a reader/writer lock Rik van Riel
@ 2007-12-19  0:48   ` Nick Piggin
  2007-12-19  4:09     ` KOSAKI Motohiro
  2007-12-19 15:52     ` Lee Schermerhorn
  0 siblings, 2 replies; 59+ messages in thread
From: Nick Piggin @ 2007-12-19  0:48 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, lee.shermerhorn, Lee Schermerhorn

On Wednesday 19 December 2007 08:15, Rik van Riel wrote:
> I have seen soft cpu lockups in page_referenced_file() due to
> contention on i_mmap_lock() for different pages.  Making the
> i_mmap_lock a reader/writer lock should increase parallelism
> in vmscan for file back pages mapped into many address spaces.
>
> Read lock the i_mmap_lock for all usage except:
>
> 1) mmap/munmap:  linking vma into i_mmap prio_tree or removing
> 2) unmap_mapping_range:   protecting vm_truncate_count
>
> rmap:  try_to_unmap_file() required new cond_resched_rwlock().
> To reduce code duplication, I recast cond_resched_lock() as a
> [static inline] wrapper around reworked cond_sched_lock() =>
> __cond_resched_lock(void *lock, int type).
> New cond_resched_rwlock() implemented as another wrapper.

Reader/writer locks really suck in terms of fairness and starvation,
especially when the read-side is common and frequent. (also, single
threaded performance of the read-side is worse).

I know Lee saw some big latencies on the anon_vma list lock when
running (IIRC) a large benchmark... but are there more realistic
situations where this is a problem?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-18 21:15 ` [patch 17/20] non-reclaimable mlocked pages Rik van Riel
@ 2007-12-19  0:56   ` Nick Piggin
  2007-12-19 13:45     ` Rik van Riel
  2007-12-20  7:19     ` Christoph Lameter
  0 siblings, 2 replies; 59+ messages in thread
From: Nick Piggin @ 2007-12-19  0:56 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, lee.shermerhorn, Lee Schermerhorn

On Wednesday 19 December 2007 08:15, Rik van Riel wrote:

> Rework of a patch by Nick Piggin -- part 1 of 2.
>
> This patch:
>
> 1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
>    stub version of the mlock/noreclaim APIs when it's
>    not configured.  Depends on [CONFIG_]NORECLAIM.
>
> 2) add yet another page flag--PG_mlocked--to indicate that
>    the page is locked for efficient testing in vmscan and,
>    optionally, fault path.  This allows early culling of
>    nonreclaimable pages, preventing them from getting to
>    page_referenced()/try_to_unmap().  Also allows separate
>    accounting of mlock'd pages, as Nick's original patch
>    did.
>
>    Uses a bit available only to 64-bit systems.
>
>    Note:  Nick's original mlock patch used a PG_mlocked
>    flag.  I had removed this in favor of the PG_noreclaim
>    flag + an mlock_count [new page struct member].  I
>    restored the PG_mlocked flag to eliminate the new
>    count field.
>
> 3) add the mlock/noreclaim infrastructure to mm/mlock.c,
>    with internal APIs in mm/internal.h.  This is a rework
>    of Nick's original patch to these files, taking into
>    account that mlocked pages are now kept on noreclaim
>    LRU list.
>
> 4) update vmscan.c:page_reclaimable() to check PageMlocked()
>    and, if vma passed in, the vm_flags.  Note that the vma
>    will only be passed in for new pages in the fault path;
>    and then only if the "cull nonreclaimable pages in fault
>    path" patch is included.
>
> 5) add try_to_unlock() to rmap.c to walk a page's rmap and
>    ClearPageMlocked() if no other vmas have it mlocked.
>    Reuses as much of try_to_unmap() as possible.  This
>    effectively replaces the use of one of the lru list links
>    as an mlock count.  If this mechanism let's pages in mlocked
>    vmas leak through w/o PG_mlocked set [I don't know that it
>    does], we should catch them later in try_to_unmap().  One
>    hopes this will be rare, as it will be relatively expensive.

Hmm, I still don't know (or forgot) why you don't just use the
old scheme of having an mlock count in the LRU bit, and removing
the mlocked page from the LRU completely.

These mlocked pages don't need to be on a non-reclaimable list,
because we can find them again via the ptes when they become
unlocked, and there is no point background scanning them, because
they're always going to be locked while they're mlocked.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2007-12-19  0:48   ` Nick Piggin
@ 2007-12-19  4:09     ` KOSAKI Motohiro
  2007-12-19 15:52     ` Lee Schermerhorn
  1 sibling, 0 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2007-12-19  4:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, Rik van Riel, linux-mm, linux-kernel,
	lee.shermerhorn, Lee Schermerhorn

Hi

> > rmap:  try_to_unmap_file() required new cond_resched_rwlock().
> > To reduce code duplication, I recast cond_resched_lock() as a
> > [static inline] wrapper around reworked cond_sched_lock() =>
> > __cond_resched_lock(void *lock, int type).
> > New cond_resched_rwlock() implemented as another wrapper.
> 
> Reader/writer locks really suck in terms of fairness and starvation,
> especially when the read-side is common and frequent. (also, single
> threaded performance of the read-side is worse).

Agreed.

rwlock got bad performance some case. (especially on many cpu machine)

if many cpu grab read-lock on and off on many cpu system.
then at least 1 cpu always grab read lock and the cpu of waiting write-lock 
never get lock.

threrefore, rwlock often make performance weakness of stress.


I want know testcase for this patch and run it.
Do you have it?


/kosaki



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 10/20] SEQ replacement for anonymous pages
  2007-12-18 21:15 ` [patch 10/20] SEQ replacement for anonymous pages Rik van Riel
@ 2007-12-19  5:17   ` KOSAKI Motohiro
  2007-12-19 13:40     ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2007-12-19  5:17 UTC (permalink / raw)
  To: Rik van Riel; +Cc: kosaki.motohiro, linux-mm, linux-kernel, lee.shermerhorn

Hi Rik-san,

> To keep the maximum amount of necessary work reasonable, we scale the
> active to inactive ratio with the size of memory, using the formula
> active:inactive ratio = sqrt(memory in GB * 10).

Great.

why do you think best formula is sqrt(GB*10)?
please tell me if you don't mind.

and i have a bit worry to it works well or not on small systems.
because it is indicate 1:1 ratio on less than 100MB memory system.
Do you think this viewpoint?


/kosaki


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 10/20] SEQ replacement for anonymous pages
  2007-12-19  5:17   ` KOSAKI Motohiro
@ 2007-12-19 13:40     ` Rik van Riel
  2007-12-20  2:04       ` KOSAKI Motohiro
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-19 13:40 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: kosaki.motohiro, linux-mm, linux-kernel, lee.shermerhorn

On Wed, 19 Dec 2007 14:17:53 +0900
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi Rik-san,
> 
> > To keep the maximum amount of necessary work reasonable, we scale the
> > active to inactive ratio with the size of memory, using the formula
> > active:inactive ratio = sqrt(memory in GB * 10).

> why do you think best formula is sqrt(GB*10)?
> please tell me if you don't mind.

On a 1GB system, this leads to a ratio of 3 active anon
pages to 1 inactive anon page, and a maximum inactive
anon list size of 250MB.
 
On a 1TB system, this leads to a ratio of 100 active anon
pages to 1 inactive anon page, and a maximum inactive
anon list size of 10GB.

The numbers in-between looked reasonable :)

Basically the requirement is that the inactive anon list 
is large enough that pages get a chance to be referenced
again, but small enough that the maximum amount of work
the VM needs to do is bounded to something reasonable.

> and i have a bit worry to it works well or not on small systems.
> because it is indicate 1:1 ratio on less than 100MB memory system.
> Do you think this viewpoint?

A 1:1 ratio simply means that the inactive anon list is
the same size as the active anon list. Page replacement
should still work fine that way.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-19  0:56   ` Nick Piggin
@ 2007-12-19 13:45     ` Rik van Riel
  2007-12-19 14:24       ` Peter Zijlstra
                         ` (2 more replies)
  2007-12-20  7:19     ` Christoph Lameter
  1 sibling, 3 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-19 13:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-mm, linux-kernel, Lee Schermerhorn

On Wed, 19 Dec 2007 11:56:48 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Wednesday 19 December 2007 08:15, Rik van Riel wrote:
> 
> > Rework of a patch by Nick Piggin -- part 1 of 2.
> >
> > This patch:
> >
> > 1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
> >    stub version of the mlock/noreclaim APIs when it's
> >    not configured.  Depends on [CONFIG_]NORECLAIM.

> Hmm, I still don't know (or forgot) why you don't just use the
> old scheme of having an mlock count in the LRU bit, and removing
> the mlocked page from the LRU completely.

How do we detect those pages reliably in the lumpy reclaim code?
 
> These mlocked pages don't need to be on a non-reclaimable list,
> because we can find them again via the ptes when they become
> unlocked, and there is no point background scanning them, because
> they're always going to be locked while they're mlocked.

Agreed.

The main reason I sent out these patches now is that I just
wanted to get some comments from other upstream developers.

I have gotten distracted by other work so much that I spent
most of my time forward porting the patch set, and not enough
time working with the rest of the upstream community to get
the code moving forward.

To be honest, I have only briefly looked at the non-reclaimable
code.  I would be more than happy to merge any improvements to
that code.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-19 13:45     ` Rik van Riel
@ 2007-12-19 14:24       ` Peter Zijlstra
  2007-12-19 14:53         ` Rik van Riel
  2007-12-19 16:04       ` Lee Schermerhorn
  2007-12-19 23:34       ` Nick Piggin
  2 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2007-12-19 14:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Nick Piggin, linux-mm, linux-kernel, Lee Schermerhorn


On Wed, 2007-12-19 at 08:45 -0500, Rik van Riel wrote:
> On Wed, 19 Dec 2007 11:56:48 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > On Wednesday 19 December 2007 08:15, Rik van Riel wrote:
> > 
> > > Rework of a patch by Nick Piggin -- part 1 of 2.
> > >
> > > This patch:
> > >
> > > 1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
> > >    stub version of the mlock/noreclaim APIs when it's
> > >    not configured.  Depends on [CONFIG_]NORECLAIM.
> 
> > Hmm, I still don't know (or forgot) why you don't just use the
> > old scheme of having an mlock count in the LRU bit, and removing
> > the mlocked page from the LRU completely.
> 
> How do we detect those pages reliably in the lumpy reclaim code?
>  
> > These mlocked pages don't need to be on a non-reclaimable list,
> > because we can find them again via the ptes when they become
> > unlocked, and there is no point background scanning them, because
> > they're always going to be locked while they're mlocked.

I thought Lee had patches that moved pages with long rmap chains (both
anon and file) out onto the non-reclaim list, for those a slow
background scan does make sense.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-19 14:24       ` Peter Zijlstra
@ 2007-12-19 14:53         ` Rik van Riel
  2007-12-19 16:08           ` Lee Schermerhorn
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-19 14:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Nick Piggin, linux-mm, linux-kernel, Lee Schermerhorn

On Wed, 19 Dec 2007 15:24:07 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> I thought Lee had patches that moved pages with long rmap chains (both
> anon and file) out onto the non-reclaim list, for those a slow
> background scan does make sense.

I suspect we won't be needing that code.  The SEQ replacement for
swap backed pages might reduce the number of pages that need to
be scanned to a reasonable number.

Remember, steady states are not a big problem with the current VM.
It's the sudden burst of scanning that happens when the VM decides
that it should start swapping (and every anonymous page is referenced)
that kills large systems.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2007-12-19  0:48   ` Nick Piggin
  2007-12-19  4:09     ` KOSAKI Motohiro
@ 2007-12-19 15:52     ` Lee Schermerhorn
  2007-12-19 16:31       ` Rik van Riel
  1 sibling, 1 reply; 59+ messages in thread
From: Lee Schermerhorn @ 2007-12-19 15:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Rik van Riel, linux-mm, linux-kernel, lee.shermerhorn

On Wed, 2007-12-19 at 11:48 +1100, Nick Piggin wrote:
> On Wednesday 19 December 2007 08:15, Rik van Riel wrote:
> > I have seen soft cpu lockups in page_referenced_file() due to
> > contention on i_mmap_lock() for different pages.  Making the
> > i_mmap_lock a reader/writer lock should increase parallelism
> > in vmscan for file back pages mapped into many address spaces.
> >
> > Read lock the i_mmap_lock for all usage except:
> >
> > 1) mmap/munmap:  linking vma into i_mmap prio_tree or removing
> > 2) unmap_mapping_range:   protecting vm_truncate_count
> >
> > rmap:  try_to_unmap_file() required new cond_resched_rwlock().
> > To reduce code duplication, I recast cond_resched_lock() as a
> > [static inline] wrapper around reworked cond_sched_lock() =>
> > __cond_resched_lock(void *lock, int type).
> > New cond_resched_rwlock() implemented as another wrapper.
> 
> Reader/writer locks really suck in terms of fairness and starvation,
> especially when the read-side is common and frequent. (also, single
> threaded performance of the read-side is worse).
> 
> I know Lee saw some big latencies on the anon_vma list lock when
> running (IIRC) a large benchmark... but are there more realistic
> situations where this is a problem?

Yes, we see the stall on the anon_vma lock most frequently running the
AIM benchmark with several tens of thousands of processes--all forked
from the same parent.  If we push the system into reclaim, all cpus end
up spinning on the lock in one of the anon_vma's shared by all the
tasks.  Quite easy to reproduce.  I have also seen this running stress
tests to force reclaim under Dave Anderson's "usex" exerciser--e.g.,
testing the split LRU and noreclaim patches--even with the reader-writer
lock patch. 

I've seen the lockups on the i_mmap_lock running Oracle workloads on our
large servers.  This is running an OLTP workload with only a thousand or
so "clients" all running the same application image.   Again, when the
system attempts to reclaim we end up spinning on the i_mmap_lock of one
of the files [possibly the shared global shmem segment] shared by all
the applications.  I also see it with the usex stress load--also, with
and without this patch.  I think this is a more probably
scenario--thousands of processes sharing a single file, such as
libc.so--than thousands of processes all descended from a single
ancestor w/o exec'ing.

I keep these patches up to date for testing.  I don't have conclusive
evidence whether they alleviate or exacerbate the problem nor by how
much.  

Lee


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-19 13:45     ` Rik van Riel
  2007-12-19 14:24       ` Peter Zijlstra
@ 2007-12-19 16:04       ` Lee Schermerhorn
  2007-12-20 20:56         ` Rik van Riel
  2007-12-19 23:34       ` Nick Piggin
  2 siblings, 1 reply; 59+ messages in thread
From: Lee Schermerhorn @ 2007-12-19 16:04 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Nick Piggin, linux-mm, linux-kernel

On Wed, 2007-12-19 at 08:45 -0500, Rik van Riel wrote:
> On Wed, 19 Dec 2007 11:56:48 +1100
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > On Wednesday 19 December 2007 08:15, Rik van Riel wrote:
> > 
> > > Rework of a patch by Nick Piggin -- part 1 of 2.
> > >
> > > This patch:
> > >
> > > 1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
> > >    stub version of the mlock/noreclaim APIs when it's
> > >    not configured.  Depends on [CONFIG_]NORECLAIM.
> 
> > Hmm, I still don't know (or forgot) why you don't just use the
> > old scheme of having an mlock count in the LRU bit, and removing
> > the mlocked page from the LRU completely.
> 
> How do we detect those pages reliably in the lumpy reclaim code?

I wanted to try to treat nonreclaimable pages, whatever the reason,
uniformly.  Lumpy reclaim wasn't there when I started on this, but we've
been able to handle them.  I was more interested in page migration.  The
act of isolating the page from the LRU [under zone lru_lock] arbitrates
between racing tasks attempting to migrate the same page.  That and we
keep the isolated pages on a list using the LRU links.  We can't migrate
pages that we can't successfully isolate from the LRU list.  Nick had
addressed this by putting mlocked pages back on the lru for migration.
Seemed like unnecessary work to me.  And, as I recall, this lost the
mlock count and it had to be reestablished over time.

>  
> > These mlocked pages don't need to be on a non-reclaimable list,
> > because we can find them again via the ptes when they become
> > unlocked, and there is no point background scanning them, because
> > they're always going to be locked while they're mlocked.
> 
> Agreed.

I also agree they don't need to be scanned.  And, altho' having them on
an LRU list has other uses, I suppose that having mlocked pages on the
noreclaim list could be considered "clutter" if we did want to scan the
noreclaim list for other types of non-reclaimable pages that might have
become reclaimable.  

Lee

> 
> The main reason I sent out these patches now is that I just
> wanted to get some comments from other upstream developers.
> 
> I have gotten distracted by other work so much that I spent
> most of my time forward porting the patch set, and not enough
> time working with the rest of the upstream community to get
> the code moving forward.
> 
> To be honest, I have only briefly looked at the non-reclaimable
> code.  I would be more than happy to merge any improvements to
> that code.
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-19 14:53         ` Rik van Riel
@ 2007-12-19 16:08           ` Lee Schermerhorn
  0 siblings, 0 replies; 59+ messages in thread
From: Lee Schermerhorn @ 2007-12-19 16:08 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Peter Zijlstra, Nick Piggin, linux-mm, linux-kernel

On Wed, 2007-12-19 at 09:53 -0500, Rik van Riel wrote:
> On Wed, 19 Dec 2007 15:24:07 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > I thought Lee had patches that moved pages with long rmap chains (both
> > anon and file) out onto the non-reclaim list, for those a slow
> > background scan does make sense.
> 
> I suspect we won't be needing that code.  The SEQ replacement for
> swap backed pages might reduce the number of pages that need to
> be scanned to a reasonable number.
> 
> Remember, steady states are not a big problem with the current VM.
> It's the sudden burst of scanning that happens when the VM decides
> that it should start swapping (and every anonymous page is referenced)
> that kills large systems.

Yes, I still have the patch [for long anon_vma lists--not for
excessively mapped file, yet] and I'm keeping it up to date and tested.
I do see softlockups on the anon_vma and i_mmap_locks under stress, even
with the reader/writer lock patches.  I'll be trying the workloads on
Rik's latest patches to see if they address these lockups.

Lee


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2007-12-19 15:52     ` Lee Schermerhorn
@ 2007-12-19 16:31       ` Rik van Riel
  2007-12-19 16:53         ` Lee Schermerhorn
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-19 16:31 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Nick Piggin, linux-mm, linux-kernel, lee.shermerhorn

On Wed, 19 Dec 2007 10:52:09 -0500
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> I keep these patches up to date for testing.  I don't have conclusive
> evidence whether they alleviate or exacerbate the problem nor by how
> much.  

When the queued locking from Ingo's x86 tree hits mainline,
I suspect that spinlocks may end up behaving a lot nicer.

Should I drop the rwlock patches from my tree for now and
focus on just the page reclaim stuff?

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2007-12-19 16:31       ` Rik van Riel
@ 2007-12-19 16:53         ` Lee Schermerhorn
  2007-12-19 19:28           ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Lee Schermerhorn @ 2007-12-19 16:53 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Nick Piggin, linux-mm, linux-kernel, lee.schermerhorn

On Wed, 2007-12-19 at 11:31 -0500, Rik van Riel wrote:
> On Wed, 19 Dec 2007 10:52:09 -0500
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > I keep these patches up to date for testing.  I don't have conclusive
> > evidence whether they alleviate or exacerbate the problem nor by how
> > much.  
> 
> When the queued locking from Ingo's x86 tree hits mainline,
> I suspect that spinlocks may end up behaving a lot nicer.

That would be worth testing with our problematic workloads...

> 
> Should I drop the rwlock patches from my tree for now and
> focus on just the page reclaim stuff?

That's fine with me.  They're out there is anyone is interested.  I'll
keep them up to date in my tree [and hope they don't conflict with split
lru and noreclaim patches too much] for occasional testing.

Lee


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2007-12-19 16:53         ` Lee Schermerhorn
@ 2007-12-19 19:28           ` Peter Zijlstra
  2007-12-19 23:40             ` Nick Piggin
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2007-12-19 19:28 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Rik van Riel, Nick Piggin, linux-mm, linux-kernel


On Wed, 2007-12-19 at 11:53 -0500, Lee Schermerhorn wrote:
> On Wed, 2007-12-19 at 11:31 -0500, Rik van Riel wrote:
> > On Wed, 19 Dec 2007 10:52:09 -0500
> > Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > 
> > > I keep these patches up to date for testing.  I don't have conclusive
> > > evidence whether they alleviate or exacerbate the problem nor by how
> > > much.  
> > 
> > When the queued locking from Ingo's x86 tree hits mainline,
> > I suspect that spinlocks may end up behaving a lot nicer.
> 
> That would be worth testing with our problematic workloads...
> 
> > 
> > Should I drop the rwlock patches from my tree for now and
> > focus on just the page reclaim stuff?
> 
> That's fine with me.  They're out there is anyone is interested.  I'll
> keep them up to date in my tree [and hope they don't conflict with split
> lru and noreclaim patches too much] for occasional testing.

Of course, someone would need to implement ticket locks for ia64 -
preferably without the 256 cpu limit.

Nick, growing spinlock_t to 64 bits would yield space for 64k cpus,
right? I'm guessing that would be enough for a while, even for SGI.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-19 13:45     ` Rik van Riel
  2007-12-19 14:24       ` Peter Zijlstra
  2007-12-19 16:04       ` Lee Schermerhorn
@ 2007-12-19 23:34       ` Nick Piggin
  2 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2007-12-19 23:34 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, Lee Schermerhorn

On Thursday 20 December 2007 00:45, Rik van Riel wrote:
> On Wed, 19 Dec 2007 11:56:48 +1100
>
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > On Wednesday 19 December 2007 08:15, Rik van Riel wrote:
> > > Rework of a patch by Nick Piggin -- part 1 of 2.
> > >
> > > This patch:
> > >
> > > 1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the
> > >    stub version of the mlock/noreclaim APIs when it's
> > >    not configured.  Depends on [CONFIG_]NORECLAIM.
> >
> > Hmm, I still don't know (or forgot) why you don't just use the
> > old scheme of having an mlock count in the LRU bit, and removing
> > the mlocked page from the LRU completely.
>
> How do we detect those pages reliably in the lumpy reclaim code?

They will have PG_mlocked set.


> > These mlocked pages don't need to be on a non-reclaimable list,
> > because we can find them again via the ptes when they become
> > unlocked, and there is no point background scanning them, because
> > they're always going to be locked while they're mlocked.
>
> Agreed.
>
> The main reason I sent out these patches now is that I just
> wanted to get some comments from other upstream developers.
>
> I have gotten distracted by other work so much that I spent
> most of my time forward porting the patch set, and not enough
> time working with the rest of the upstream community to get
> the code moving forward.
>
> To be honest, I have only briefly looked at the non-reclaimable
> code.  I would be more than happy to merge any improvements to
> that code.

I haven't had too much time to look at it either, although it does
seem like a reasonable idea.

However the mlock code could be completely separate from the slow
scan pages (and not be on those LRUs at all).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2007-12-19 19:28           ` Peter Zijlstra
@ 2007-12-19 23:40             ` Nick Piggin
  2007-12-20  7:04               ` Christoph Lameter
  0 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2007-12-19 23:40 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Lee Schermerhorn, Rik van Riel, linux-mm, linux-kernel

On Thursday 20 December 2007 06:28, Peter Zijlstra wrote:
> On Wed, 2007-12-19 at 11:53 -0500, Lee Schermerhorn wrote:
> > On Wed, 2007-12-19 at 11:31 -0500, Rik van Riel wrote:
> > > On Wed, 19 Dec 2007 10:52:09 -0500
> > >
> > > Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > > > I keep these patches up to date for testing.  I don't have conclusive
> > > > evidence whether they alleviate or exacerbate the problem nor by how
> > > > much.
> > >
> > > When the queued locking from Ingo's x86 tree hits mainline,
> > > I suspect that spinlocks may end up behaving a lot nicer.
> >
> > That would be worth testing with our problematic workloads...
> >
> > > Should I drop the rwlock patches from my tree for now and
> > > focus on just the page reclaim stuff?
> >
> > That's fine with me.  They're out there is anyone is interested.  I'll
> > keep them up to date in my tree [and hope they don't conflict with split
> > lru and noreclaim patches too much] for occasional testing.
>
> Of course, someone would need to implement ticket locks for ia64 -
> preferably without the 256 cpu limit.

Yep. Wouldn't be hard at all -- ia64 has a "fetchadd" with acquire
semantics.

The only reason the x86 ticket locks have the 256 CPu limit is that
if they go any bigger, we can't use the partial registers so would
have to have a few more instructions.


> Nick, growing spinlock_t to 64 bits would yield space for 64k cpus
> right? I'm guessing that would be enough for a while, even for SGI.

A 32 bit spinlock would allow 64K cpus (ticket lock has 2 counters,
each would be 16 bits). And it would actually shrink the spinlock in
the case of preempt kernels too (because it would no longer have the
lockbreak field).

And yes, I'll go out on a limb and say that 64k CPUs ought to be
enough for anyone ;)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 10/20] SEQ replacement for anonymous pages
  2007-12-19 13:40     ` Rik van Riel
@ 2007-12-20  2:04       ` KOSAKI Motohiro
  0 siblings, 0 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2007-12-20  2:04 UTC (permalink / raw)
  To: Rik van Riel; +Cc: kosaki.motohiro, linux-mm, linux-kernel, Lee Schermerhorn

Hi Rik-san

> > > To keep the maximum amount of necessary work reasonable, we scale the
> > > active to inactive ratio with the size of memory, using the formula
> > > active:inactive ratio = sqrt(memory in GB * 10).
> 
> > why do you think best formula is sqrt(GB*10)?
> > please tell me if you don't mind.
> 
> On a 1GB system, this leads to a ratio of 3 active anon
> pages to 1 inactive anon page, and a maximum inactive
> anon list size of 250MB.
>  
> On a 1TB system, this leads to a ratio of 100 active anon
> pages to 1 inactive anon page, and a maximum inactive
> anon list size of 10GB.
> 
> The numbers in-between looked reasonable :)

thanks for your kind description.
I think it make sense.

and, please add comment liked blow table if you don't mind.
for take more intuitive description.

total     return    max 
memory    value     inactive anon
-------------------------------------
 10MB       1         5MB
100MB       1        50MB
  1GB       3       250MB
 10GB      10       0.9GB
100GB      31         3GB
  1TB     101        10GB
 10TB     320        32GB


> Basically the requirement is that the inactive anon list 
> is large enough that pages get a chance to be referenced
> again, but small enough that the maximum amount of work
> the VM needs to do is bounded to something reasonable.
> 
> > and i have a bit worry to it works well or not on small systems.
> > because it is indicate 1:1 ratio on less than 100MB memory system.
> > Do you think this viewpoint?
> 
> A 1:1 ratio simply means that the inactive anon list is
> the same size as the active anon list. Page replacement
> should still work fine that way.

I'm sold. thanks.


/kosaki




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2007-12-19 23:40             ` Nick Piggin
@ 2007-12-20  7:04               ` Christoph Lameter
  2007-12-20  7:59                 ` Nick Piggin
  0 siblings, 1 reply; 59+ messages in thread
From: Christoph Lameter @ 2007-12-20  7:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Lee Schermerhorn, Rik van Riel, linux-mm, linux-kernel



> The only reason the x86 ticket locks have the 256 CPu limit is that
> if they go any bigger, we can't use the partial registers so would
> have to have a few more instructions.

x86_64 is going up to 4k or 16k cpus soon for our new hardware.

> A 32 bit spinlock would allow 64K cpus (ticket lock has 2 counters,
> each would be 16 bits). And it would actually shrink the spinlock in
> the case of preempt kernels too (because it would no longer have the
> lockbreak field).
> 
> And yes, I'll go out on a limb and say that 64k CPUs ought to be
> enough for anyone ;)

I think those things need a timeframe applied to it. Thats likely 
going to be true for the next 3 years (optimistic assessment ;-)).

Could you go to 32bit spinlock by default?

How about NUMA awareness for the spinlocks? Larger backoff periods for 
off node lock contentions please.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 01/20] convert anon_vma list lock a read/write lock
  2007-12-18 21:15 ` [patch 01/20] convert anon_vma list lock a read/write lock Rik van Riel
@ 2007-12-20  7:07   ` Christoph Lameter
  0 siblings, 0 replies; 59+ messages in thread
From: Christoph Lameter @ 2007-12-20  7:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, lee.shermerhorn, Lee Schermerhorn

Reviewed-by: Christoph Lameter <clameter@sgi.com>

Note that this is a nice improvement also to page migration.

Another solution may be to use a single linked list and RCU.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 03/20] move isolate_lru_page() to vmscan.c
  2007-12-18 21:15 ` [patch 03/20] move isolate_lru_page() to vmscan.c Rik van Riel
@ 2007-12-20  7:08   ` Christoph Lameter
  0 siblings, 0 replies; 59+ messages in thread
From: Christoph Lameter @ 2007-12-20  7:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, linux-kernel, lee.shermerhorn, Nick Piggin, Lee Schermerhorn

Reviewed-by: Christoph Lameter <clameter@sgi.com>



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-19  0:56   ` Nick Piggin
  2007-12-19 13:45     ` Rik van Riel
@ 2007-12-20  7:19     ` Christoph Lameter
  2007-12-20 15:33       ` Rik van Riel
  1 sibling, 1 reply; 59+ messages in thread
From: Christoph Lameter @ 2007-12-20  7:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rik van Riel, linux-mm, linux-kernel, lee.shermerhorn, Lee Schermerhorn

On Wed, 19 Dec 2007, Nick Piggin wrote:

> These mlocked pages don't need to be on a non-reclaimable list,
> because we can find them again via the ptes when they become
> unlocked, and there is no point background scanning them, because
> they're always going to be locked while they're mlocked.

But there is something to be said for having a consistent scheme. Here we 
already introduce address space flags for one kind of unreclaimability. 
Isnt it possible to come up with a way to categorize pages that works 
(mostly) the same way for all types of pages with reclaim issues?


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2007-12-20  7:04               ` Christoph Lameter
@ 2007-12-20  7:59                 ` Nick Piggin
  2008-01-02 23:35                   ` Mike Travis
  0 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2007-12-20  7:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Lee Schermerhorn, Rik van Riel, linux-mm, linux-kernel

On Thursday 20 December 2007 18:04, Christoph Lameter wrote:
> > The only reason the x86 ticket locks have the 256 CPu limit is that
> > if they go any bigger, we can't use the partial registers so would
> > have to have a few more instructions.
>
> x86_64 is going up to 4k or 16k cpus soon for our new hardware.
>
> > A 32 bit spinlock would allow 64K cpus (ticket lock has 2 counters,
> > each would be 16 bits). And it would actually shrink the spinlock in
> > the case of preempt kernels too (because it would no longer have the
> > lockbreak field).
> >
> > And yes, I'll go out on a limb and say that 64k CPUs ought to be
> > enough for anyone ;)
>
> I think those things need a timeframe applied to it. Thats likely
> going to be true for the next 3 years (optimistic assessment ;-)).

Yeah, that was tongue in cheek ;)


> Could you go to 32bit spinlock by default?

On x86, the size of the ticket locks is 32 bit, simply because I didn't
want to risk possible alignment bugs (a subsequent patch cuts it down to
16 bits, but this is a much smaller win than 64->32 in general because
of natural alignment of types).

Note that the ticket locks still support twice the number as the old
spinlocks, so I'm not causing a regression here... but yes, increasing
the size further will require an extra instruction or two.

> How about NUMA awareness for the spinlocks? Larger backoff periods for
> off node lock contentions please.

ticket locks can naturally tell you how many waiters there are, and how
many waiters are in front of you, so it is really nice for doing backoff
(eg. you can adapt the backoff *very* nicely depending on how many are in
front of you, and how quickly you are moving toward the front).

Also, since I got rid of the ->break_lock field, you could use that space
perhaps to add a cpu # of the lock holder for even more backoff context
(if you find that helps).

Anyway, I didn't do any of that because it obviously needs someone with
real hardware in order to tune it properly.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-20  7:19     ` Christoph Lameter
@ 2007-12-20 15:33       ` Rik van Riel
  2007-12-21 17:13         ` Lee Schermerhorn
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-20 15:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nick Piggin, linux-mm, linux-kernel, Lee Schermerhorn

On Wed, 19 Dec 2007 23:19:00 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 19 Dec 2007, Nick Piggin wrote:
> 
> > These mlocked pages don't need to be on a non-reclaimable list,
> > because we can find them again via the ptes when they become
> > unlocked, and there is no point background scanning them, because
> > they're always going to be locked while they're mlocked.
> 
> But there is something to be said for having a consistent scheme. 

The code as called from .c files should indeed be consistent.

However, since we never need to scan the non-reclaimable list,
we could use the inline functions in the .h files to have an
mlock count instead of a .lru list head in the non-reclaimable
pages.

At least, I think so.  I'm going to have to think about the
details a lot more.  I have no idea yet if there will be any
impact from batching the pages on pagevecs, vs. an atomic
mlock count...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-19 16:04       ` Lee Schermerhorn
@ 2007-12-20 20:56         ` Rik van Riel
  2007-12-21 10:52           ` Nick Piggin
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-20 20:56 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: Nick Piggin, linux-mm, linux-kernel

On Wed, 19 Dec 2007 11:04:26 -0500
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> On Wed, 2007-12-19 at 08:45 -0500, Rik van Riel wrote:
> > On Wed, 19 Dec 2007 11:56:48 +1100
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > > Hmm, I still don't know (or forgot) why you don't just use the
> > > old scheme of having an mlock count in the LRU bit, and removing
> > > the mlocked page from the LRU completely.
> > 
> > How do we detect those pages reliably in the lumpy reclaim code?
> 
> I wanted to try to treat nonreclaimable pages, whatever the reason,
> uniformly.  Lumpy reclaim wasn't there when I started on this, but we've
> been able to handle them.  I was more interested in page migration.  The
> act of isolating the page from the LRU [under zone lru_lock] arbitrates
> between racing tasks attempting to migrate the same page.  That and we
> keep the isolated pages on a list using the LRU links.  We can't migrate
> pages that we can't successfully isolate from the LRU list.

Good point.

Lets keep the nonreclaimable pages on a list, so we can keep
the migration code (and other code) consistent.

We can deal with lazily moving pages back to the nonreclaim
list if we find that, after one munlock, there are other
mlocking users of that page.

> I also agree they don't need to be scanned.  And, altho' having them on
> an LRU list has other uses, I suppose that having mlocked pages on the
> noreclaim list could be considered "clutter" if we did want to scan the
> noreclaim list for other types of non-reclaimable pages that might have
> become reclaimable.  

If we ever want to do that, we can always introduce separate
lists for those pages.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-20 20:56         ` Rik van Riel
@ 2007-12-21 10:52           ` Nick Piggin
  2007-12-21 14:17             ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2007-12-21 10:52 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Lee Schermerhorn, linux-mm, linux-kernel

On Friday 21 December 2007 07:56, Rik van Riel wrote:
> On Wed, 19 Dec 2007 11:04:26 -0500
>
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> > On Wed, 2007-12-19 at 08:45 -0500, Rik van Riel wrote:
> > > On Wed, 19 Dec 2007 11:56:48 +1100
> > >
> > > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > > Hmm, I still don't know (or forgot) why you don't just use the
> > > > old scheme of having an mlock count in the LRU bit, and removing
> > > > the mlocked page from the LRU completely.
> > >
> > > How do we detect those pages reliably in the lumpy reclaim code?
> >
> > I wanted to try to treat nonreclaimable pages, whatever the reason,
> > uniformly.  Lumpy reclaim wasn't there when I started on this, but we've
> > been able to handle them.  I was more interested in page migration.  The
> > act of isolating the page from the LRU [under zone lru_lock] arbitrates
> > between racing tasks attempting to migrate the same page.  That and we
> > keep the isolated pages on a list using the LRU links.  We can't migrate
> > pages that we can't successfully isolate from the LRU list.
>
> Good point.

Ah: that's what it was. The migration code got harder with my mlock
code (although I did have something that should have worked in theory,
it involved a few steps).

I don't have a particular problem with putting mlock pages on the slow
scan lists, although if you have huge mlocked data sets, you could
effectively speed up slow scan list scanning by orders of magnitude by
avoiding the mlocked pages.

I won't push it now, but I might see if I can rewrite it one day :)

BTW. if you have any workloads that are limited by page reclaim,
especially unmapped file backed pagecache reclaim, then I have some
stright-line-speedup patches which you might find interesting (I can
send them if you'd like to test).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-21 10:52           ` Nick Piggin
@ 2007-12-21 14:17             ` Rik van Riel
  2007-12-23 12:22               ` Nick Piggin
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-21 14:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Lee Schermerhorn, linux-mm, linux-kernel

On Fri, 21 Dec 2007 21:52:19 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> BTW. if you have any workloads that are limited by page reclaim,
> especially unmapped file backed pagecache reclaim, then I have some
> stright-line-speedup patches which you might find interesting (I can
> send them if you'd like to test).

I am definately interested in those.

The current upstream VM seems to be fairly good in steady state,
with the largest problems happening when the system switches states
(eg. from reclaiming page cache to swapping), but any speedups in
the steady state are good too.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-20 15:33       ` Rik van Riel
@ 2007-12-21 17:13         ` Lee Schermerhorn
  0 siblings, 0 replies; 59+ messages in thread
From: Lee Schermerhorn @ 2007-12-21 17:13 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Christoph Lameter, Nick Piggin, linux-mm, linux-kernel

On Thu, 2007-12-20 at 10:33 -0500, Rik van Riel wrote:
> On Wed, 19 Dec 2007 23:19:00 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Wed, 19 Dec 2007, Nick Piggin wrote:
> > 
> > > These mlocked pages don't need to be on a non-reclaimable list,
> > > because we can find them again via the ptes when they become
> > > unlocked, and there is no point background scanning them, because
> > > they're always going to be locked while they're mlocked.
> > 
> > But there is something to be said for having a consistent scheme. 
> 
> The code as called from .c files should indeed be consistent.
> 
> However, since we never need to scan the non-reclaimable list,
> we could use the inline functions in the .h files to have an
> mlock count instead of a .lru list head in the non-reclaimable
> pages.
> 
> At least, I think so.  I'm going to have to think about the
> details a lot more.  I have no idea yet if there will be any
> impact from batching the pages on pagevecs, vs. an atomic
> mlock count...

Just remember that page migration can migrate mlocked pages today, and
we'll want to continue to support this.  Migration uses the lru locks to
manage the pages selected for migration and uses isolation from lru list
under zone lru_lock to arbitrate any racing migrations.  Unless mlocked
pages are kept on an lru-like list, we'll need to put them back during
migration--possibly losing the mlock count or needing to save it
somewhere over [possibly lazy] migration.  

Note that having mlocked pages on the noreclaim lru list doesn't mean we
have to scan them to reclaim them.  It does however mean that we'd have
to skip over them to scan other types of non-reclaimable pages.  What
other types?  In my current implementation, SHM_LOCKed pages and
ramdisk/fs pages aren't mlocked.  Rather, they are culled to the
noreclaim list in vmscan when we detect a page on the [in]active list
that is in a nonreclaimable mapping.  This required minimal changes to
recognize and cull these nonreclaimable pages.  Currently PG_mlocked is
used only for pages mapped into a VM_LOCKED vma.  Since the noreclaim
list exists for the other types [and there could be more], it's little
additional work to keep mlocked pages there as well.  This makes them
play nice with migration.

Lee


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/20] VM pageout scalability improvements
  2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
                   ` (19 preceding siblings ...)
  2007-12-18 21:15 ` [patch 20/20] account mlocked pages Rik van Riel
@ 2007-12-22 20:27 ` Balbir Singh
  2007-12-23  0:21   ` Rik van Riel
  20 siblings, 1 reply; 59+ messages in thread
From: Balbir Singh @ 2007-12-22 20:27 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, lee.shermerhorn

Rik van Riel wrote:
> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.
> 

Hi, Rik,

I remember you mentioning that by large memory systems you mean systems
with at-least 128GB, does this definition still hold?

> This patch series improves VM scalability by:
> 
> 1) making the locking a little more scalable
> 
> 2) putting filesystem backed, swap backed and non-reclaimable pages
>    onto their own LRUs, so the system only scans the pages that it
>    can/should evict from memory
> 
> 3) switching to SEQ replacement for the anonymous LRUs, so the
>    number of pages that need to be scanned when the system
>    starts swapping is bound to a reasonable number
> 
> The noreclaim patches come verbatim from Lee Schermerhorn and
> Nick Piggin.  I have not taken a detailed look at them yet and
> all I have done is fix the rejects against the latest -mm kernel.
> 

Is there a consolidate patch available, it makes it easier to test.

> I am posting this series now because I would like to get more
> feedback, while I am studying and improving the noreclaim patches
> myself.
> 

What kind of tests show the problem? I'll try and review and test the code.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/20] VM pageout scalability improvements
  2007-12-22 20:27 ` [patch 00/20] VM pageout scalability improvements Balbir Singh
@ 2007-12-23  0:21   ` Rik van Riel
  2007-12-23 22:59     ` Balbir Singh
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-23  0:21 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, linux-kernel, lee.schermerhorn

On Sun, 23 Dec 2007 01:57:32 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> Rik van Riel wrote:
> > On large memory systems, the VM can spend way too much time scanning
> > through pages that it cannot (or should not) evict from memory. Not
> > only does it use up CPU time, but it also provokes lock contention
> > and can leave large systems under memory presure in a catatonic state.
> 
> I remember you mentioning that by large memory systems you mean systems
> with at-least 128GB, does this definition still hold?

It depends on the workload.  Certain test cases can wedge the
VM with as little as 16GB of RAM.  Other workloads cause trouble
at 32 or 64GB, with the system sometimes hanging for several
minutes, all the CPUs in the pageout code and no actual swap IO.

On systems of 128GB and more, we have seen systems hang in the
pageout code overnight, without deciding what to swap out.
 
> > This patch series improves VM scalability by:
> > 
> > 1) making the locking a little more scalable
> > 
> > 2) putting filesystem backed, swap backed and non-reclaimable pages
> >    onto their own LRUs, so the system only scans the pages that it
> >    can/should evict from memory
> > 
> > 3) switching to SEQ replacement for the anonymous LRUs, so the
> >    number of pages that need to be scanned when the system
> >    starts swapping is bound to a reasonable number
> > 
> > The noreclaim patches come verbatim from Lee Schermerhorn and
> > Nick Piggin.  I have not taken a detailed look at them yet and
> > all I have done is fix the rejects against the latest -mm kernel.
> 
> Is there a consolidate patch available, it makes it easier to test.

I will make a big patch available with the next version.  I have
to upgrade my patch set to newer noreclaim patches from Lee and
add a few small cleanups elsewhere.

> > I am posting this series now because I would like to get more
> > feedback, while I am studying and improving the noreclaim patches
> > myself.
> 
> What kind of tests show the problem? I'll try and review and test the code.

The easiest test possible simply allocates a ton of memory and
then touches it all.  Enough memory that the system needs to go
into swap.

Once memory is full, you will see the VM scan like mad, with a
big CPU spike (clearing the referenced bits off all pages) before
it starts swapping out anything.  That big CPU spike should be
gone or greatly reduced with my patches.

On really huge systems, that big CPU spike can be enough for one
CPU to spend so much time in the VM that all the other CPUs join
it, and the system goes under in a big lock contention fest.

Besides, even single threadedly clearing the referenced bits on
1TB worth of memory can't result in acceptable latencies :)

In the real world, users with large JVMs on their servers, which
sometimes go a little into swap, can trigger this system.  All of
the CPUs end up scanning the active list, and all pages have the
referenced bit set.  Even if the system eventually recovers, it
might as well have been dead.

Going into swap a little should only take a little bit of time.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-21 14:17             ` Rik van Riel
@ 2007-12-23 12:22               ` Nick Piggin
  2007-12-24  1:00                 ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2007-12-23 12:22 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Lee Schermerhorn, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 997 bytes --]

On Saturday 22 December 2007 01:17, Rik van Riel wrote:
> On Fri, 21 Dec 2007 21:52:19 +1100
>
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > BTW. if you have any workloads that are limited by page reclaim,
> > especially unmapped file backed pagecache reclaim, then I have some
> > stright-line-speedup patches which you might find interesting (I can
> > send them if you'd like to test).
>
> I am definately interested in those.

Sorry it took a few days...

So I'm testing throughput for the full pagecache lifecycle, insertion
to reclaim. Reading from a sparse file much bigger than memory.

On my 2 socket, 8 core Opteron setup:

Single dd from a single file:
vanilla:  687.01MB/s
patched: 1037.17MB/s (151%)

8 dds from a single file:
vanilla: 1458.04MB/s
patched: 5898.65MB/s (405%)

Not sure how well that translates to real world workloads, but it
might help somewhere. Admittedly some of the patches are pretty
complex...

Anyway, have a good christmas and new year :)

Thanks,
Nick

[-- Attachment #2: reclaim-speedups.tar.gz --]
[-- Type: application/x-tgz, Size: 19239 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/20] VM pageout scalability improvements
  2007-12-23  0:21   ` Rik van Riel
@ 2007-12-23 22:59     ` Balbir Singh
  2007-12-24  1:11       ` Rik van Riel
  0 siblings, 1 reply; 59+ messages in thread
From: Balbir Singh @ 2007-12-23 22:59 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, linux-kernel, lee.schermerhorn

Rik van Riel wrote:
> On Sun, 23 Dec 2007 01:57:32 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> Rik van Riel wrote:
>>> On large memory systems, the VM can spend way too much time scanning
>>> through pages that it cannot (or should not) evict from memory. Not
>>> only does it use up CPU time, but it also provokes lock contention
>>> and can leave large systems under memory presure in a catatonic state.
>> I remember you mentioning that by large memory systems you mean systems
>> with at-least 128GB, does this definition still hold?
> 
> It depends on the workload.  Certain test cases can wedge the
> VM with as little as 16GB of RAM.  Other workloads cause trouble
> at 32 or 64GB, with the system sometimes hanging for several
> minutes, all the CPUs in the pageout code and no actual swap IO.
> 

Interesting, I have not run into it so far. But I have smaller machines,
typically 4-8GB.

> On systems of 128GB and more, we have seen systems hang in the
> pageout code overnight, without deciding what to swap out.
> 
>>> This patch series improves VM scalability by:
>>>
>>> 1) making the locking a little more scalable
>>>
>>> 2) putting filesystem backed, swap backed and non-reclaimable pages
>>>    onto their own LRUs, so the system only scans the pages that it
>>>    can/should evict from memory
>>>
>>> 3) switching to SEQ replacement for the anonymous LRUs, so the
>>>    number of pages that need to be scanned when the system
>>>    starts swapping is bound to a reasonable number
>>>
>>> The noreclaim patches come verbatim from Lee Schermerhorn and
>>> Nick Piggin.  I have not taken a detailed look at them yet and
>>> all I have done is fix the rejects against the latest -mm kernel.
>> Is there a consolidate patch available, it makes it easier to test.
> 
> I will make a big patch available with the next version.  I have
> to upgrade my patch set to newer noreclaim patches from Lee and
> add a few small cleanups elsewhere.
> 

That would be nice. I'll try and help out by testing the patches and
running them

>>> I am posting this series now because I would like to get more
>>> feedback, while I am studying and improving the noreclaim patches
>>> myself.
>> What kind of tests show the problem? I'll try and review and test the code.
> 
> The easiest test possible simply allocates a ton of memory and
> then touches it all.  Enough memory that the system needs to go
> into swap.
> 
> Once memory is full, you will see the VM scan like mad, with a
> big CPU spike (clearing the referenced bits off all pages) before
> it starts swapping out anything.  That big CPU spike should be
> gone or greatly reduced with my patches.
> 
> On really huge systems, that big CPU spike can be enough for one
> CPU to spend so much time in the VM that all the other CPUs join
> it, and the system goes under in a big lock contention fest.
> 
> Besides, even single threadedly clearing the referenced bits on
> 1TB worth of memory can't result in acceptable latencies :)
> 
> In the real world, users with large JVMs on their servers, which
> sometimes go a little into swap, can trigger this system.  All of
> the CPUs end up scanning the active list, and all pages have the
> referenced bit set.  Even if the system eventually recovers, it
> might as well have been dead.
> 
> Going into swap a little should only take a little bit of time.
> 

Very fascinating, so we need to scale better with larger memory.
I suspect part of the answer will lie with using large/huge pages.



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 17/20] non-reclaimable mlocked pages
  2007-12-23 12:22               ` Nick Piggin
@ 2007-12-24  1:00                 ` Rik van Riel
  0 siblings, 0 replies; 59+ messages in thread
From: Rik van Riel @ 2007-12-24  1:00 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Lee Schermerhorn, linux-mm, linux-kernel

On Sun, 23 Dec 2007 23:22:08 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Not sure how well that translates to real world workloads, but it
> might help somewhere. Admittedly some of the patches are pretty
> complex...

I like your patch series.

They are completely orthogonal to my patches though, so I
won't tie them together by merging your series into mine :)

It looks like the majority of your patches could go into
-mm right away.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/20] VM pageout scalability improvements
  2007-12-23 22:59     ` Balbir Singh
@ 2007-12-24  1:11       ` Rik van Riel
  2007-12-28  3:20         ` Matt Mackall
  0 siblings, 1 reply; 59+ messages in thread
From: Rik van Riel @ 2007-12-24  1:11 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm, linux-kernel, lee.schermerhorn

On Mon, 24 Dec 2007 04:29:36 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> Rik van Riel wrote:

> > In the real world, users with large JVMs on their servers, which
> > sometimes go a little into swap, can trigger this system.  All of
> > the CPUs end up scanning the active list, and all pages have the
> > referenced bit set.  Even if the system eventually recovers, it
> > might as well have been dead.
> > 
> > Going into swap a little should only take a little bit of time.
> 
> Very fascinating, so we need to scale better with larger memory.
> I suspect part of the answer will lie with using large/huge pages.

Linus vetoed going to a larger soft page size, with good reason.

Just look at how much the 64kB page size on PPC64 sucks for most
workloads - it works for PPC64 because people buy PPC64 monster
systems for the kinds of monster workloads that work well with a
large page size, but it definately isn't general purpose.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/20] VM pageout scalability improvements
  2007-12-24  1:11       ` Rik van Riel
@ 2007-12-28  3:20         ` Matt Mackall
  0 siblings, 0 replies; 59+ messages in thread
From: Matt Mackall @ 2007-12-28  3:20 UTC (permalink / raw)
  To: Rik van Riel; +Cc: balbir, linux-mm, linux-kernel, lee.schermerhorn


On Sun, 2007-12-23 at 20:11 -0500, Rik van Riel wrote:
> On Mon, 24 Dec 2007 04:29:36 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > Rik van Riel wrote:
> 
> > > In the real world, users with large JVMs on their servers, which
> > > sometimes go a little into swap, can trigger this system.  All of
> > > the CPUs end up scanning the active list, and all pages have the
> > > referenced bit set.  Even if the system eventually recovers, it
> > > might as well have been dead.
> > > 
> > > Going into swap a little should only take a little bit of time.
> > 
> > Very fascinating, so we need to scale better with larger memory.
> > I suspect part of the answer will lie with using large/huge pages.
> 
> Linus vetoed going to a larger soft page size, with good reason.
> 
> Just look at how much the 64kB page size on PPC64 sucks for most
> workloads - it works for PPC64 because people buy PPC64 monster
> systems for the kinds of monster workloads that work well with a
> large page size, but it definately isn't general purpose.

Indeed, machines already exist with >> 1TB of RAM, so even going to 1MB
pages leaves these machines in trouble. Going to big pages a few years
ago would have pushed the problem back a few years, but now we need real
fixes.

-- 
Mathematics is the supreme nostalgia of our time.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2007-12-20  7:59                 ` Nick Piggin
@ 2008-01-02 23:35                   ` Mike Travis
  2008-01-03  6:07                     ` Nick Piggin
  0 siblings, 1 reply; 59+ messages in thread
From: Mike Travis @ 2008-01-02 23:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Peter Zijlstra, Lee Schermerhorn,
	Rik van Riel, linux-mm, linux-kernel, Ingo Molnar

Hi Nick,

Have you done anything more with allowing > 256 CPUS in this spinlock
patch?  We've been testing with 1k cpus and to verify with -mm kernel,
we need to "unpatch" these spinlock changes.

Thanks,
Mike

Nick Piggin wrote:
> On Thursday 20 December 2007 18:04, Christoph Lameter wrote:
>>> The only reason the x86 ticket locks have the 256 CPu limit is that
>>> if they go any bigger, we can't use the partial registers so would
>>> have to have a few more instructions.
>> x86_64 is going up to 4k or 16k cpus soon for our new hardware.
>>
>>> A 32 bit spinlock would allow 64K cpus (ticket lock has 2 counters,
>>> each would be 16 bits). And it would actually shrink the spinlock in
>>> the case of preempt kernels too (because it would no longer have the
>>> lockbreak field).
>>>
>>> And yes, I'll go out on a limb and say that 64k CPUs ought to be
>>> enough for anyone ;)
>> I think those things need a timeframe applied to it. Thats likely
>> going to be true for the next 3 years (optimistic assessment ;-)).
> 
> Yeah, that was tongue in cheek ;)
> 
> 
>> Could you go to 32bit spinlock by default?
> 
> On x86, the size of the ticket locks is 32 bit, simply because I didn't
> want to risk possible alignment bugs (a subsequent patch cuts it down to
> 16 bits, but this is a much smaller win than 64->32 in general because
> of natural alignment of types).
> 
> Note that the ticket locks still support twice the number as the old
> spinlocks, so I'm not causing a regression here... but yes, increasing
> the size further will require an extra instruction or two.
> 
>> How about NUMA awareness for the spinlocks? Larger backoff periods for
>> off node lock contentions please.
> 
> ticket locks can naturally tell you how many waiters there are, and how
> many waiters are in front of you, so it is really nice for doing backoff
> (eg. you can adapt the backoff *very* nicely depending on how many are in
> front of you, and how quickly you are moving toward the front).
> 
> Also, since I got rid of the ->break_lock field, you could use that space
> perhaps to add a cpu # of the lock holder for even more backoff context
> (if you find that helps).
> 
> Anyway, I didn't do any of that because it obviously needs someone with
> real hardware in order to tune it properly.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2008-01-02 23:35                   ` Mike Travis
@ 2008-01-03  6:07                     ` Nick Piggin
  2008-01-03  8:55                       ` Ingo Molnar
  0 siblings, 1 reply; 59+ messages in thread
From: Nick Piggin @ 2008-01-03  6:07 UTC (permalink / raw)
  To: Mike Travis
  Cc: Christoph Lameter, Peter Zijlstra, Lee Schermerhorn,
	Rik van Riel, linux-mm, linux-kernel, Ingo Molnar

On Thursday 03 January 2008 10:35, Mike Travis wrote:
> Hi Nick,
>
> Have you done anything more with allowing > 256 CPUS in this spinlock
> patch?  We've been testing with 1k cpus and to verify with -mm kernel,
> we need to "unpatch" these spinlock changes.
>
> Thanks,
> Mike

Hi Mike,

Actually I had it in my mind that 64 bit used single-byte locking like
i386, so I didn't think I'd caused a regression there.

I'll take a look at fixing that up now.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2008-01-03  6:07                     ` Nick Piggin
@ 2008-01-03  8:55                       ` Ingo Molnar
  2008-01-07  9:01                         ` Nick Piggin
  0 siblings, 1 reply; 59+ messages in thread
From: Ingo Molnar @ 2008-01-03  8:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Travis, Christoph Lameter, Peter Zijlstra, Lee Schermerhorn,
	Rik van Riel, linux-mm, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > Have you done anything more with allowing > 256 CPUS in this 
> > spinlock patch?  We've been testing with 1k cpus and to verify with 
> > -mm kernel, we need to "unpatch" these spinlock changes.
> 
> Hi Mike,
> 
> Actually I had it in my mind that 64 bit used single-byte locking like 
> i386, so I didn't think I'd caused a regression there.
> 
> I'll take a look at fixing that up now.

thanks - this is a serious showstopper for the ticket spinlock patch. 

( which has otherwise been performing very well in x86.git so far - it 
  has passed a few thousand bootup tests on 64-bit and 32-bit as well, 
  so we are close to it being in a mergable state. Would be a pity to
  lose it due to the 256 cpus limit. )

	Ingo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
  2008-01-03  8:55                       ` Ingo Molnar
@ 2008-01-07  9:01                         ` Nick Piggin
  0 siblings, 0 replies; 59+ messages in thread
From: Nick Piggin @ 2008-01-07  9:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Travis, Christoph Lameter, Peter Zijlstra, Lee Schermerhorn,
	Rik van Riel, linux-mm, linux-kernel, ak

On Thursday 03 January 2008 19:55, Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > Have you done anything more with allowing > 256 CPUS in this
> > > spinlock patch?  We've been testing with 1k cpus and to verify with
> > > -mm kernel, we need to "unpatch" these spinlock changes.
> >
> > Hi Mike,
> >
> > Actually I had it in my mind that 64 bit used single-byte locking like
> > i386, so I didn't think I'd caused a regression there.
> >
> > I'll take a look at fixing that up now.
>
> thanks - this is a serious showstopper for the ticket spinlock patch.
>
> ( which has otherwise been performing very well in x86.git so far - it
>   has passed a few thousand bootup tests on 64-bit and 32-bit as well,
>   so we are close to it being in a mergable state. Would be a pity to
>   lose it due to the 256 cpus limit. )

OK, this is what my test harness code looks like for > 256 CPUs
(basically the same as the in-kernel code, but some names etc. are slightly
different).

It passes my basic tests, and performance doesn't seem to have suffered.
I was going to suggest making the <= 256 vs > 256 cases config options, but
maybe we don't need to unless some CPUs are slow at shifts / rotates? I
don't know...

After I get comments, I will come up with an incremental patch against
the kernel... It will be interesting to know whether ticket locks help
big SGI systems.

static inline void xlock(lock_t *lock)
{
        lock_t inc = 0x00010000;
        lock_t tmp;

        __asm__ __volatile__ (
                "lock ; xaddl %0, %1\n"
                "movzwl %w0, %2\n\t"
                "shrl $16, %0\n\t"
                "1:\t"
                "cmpl %0, %2\n\t"
                "je 2f\n\t"
                "rep ; nop\n\t"
                "movzwl %1, %2\n\t"
                /* don't need lfence here, because loads are in-order */
                "jmp 1b\n"
                "2:"
                :"+Q" (inc), "+m" (*lock), "=r" (tmp)
                :
                :"memory", "cc");
}

static inline int xtrylock(lock_t *lock)
{
        lock_t tmp;
        lock_t new;

        asm volatile(
                "movl %2,%0\n\t"
                "movl %0,%1\n\t"
                "roll $16, %0\n\t"
                "cmpl %0,%1\n\t"
                "jne 1f\n\t"
                "addl $0x00010000, %1\n\t"
                "lock ; cmpxchgl %1,%2\n\t"
                "1:"
                "sete %b1\n\t"
                "movzbl %b1,%0\n\t"
                :"=&a" (tmp), "=r" (new), "+m" (*lock)
                :
                : "memory", "cc");

        return tmp;
}

static inline void xunlock(lock_t *lock)
{
        __asm__ __volatile__(
                "incw %0"
                :"+m" (*lock)
                :
                :"memory", "cc");
}

                        

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2008-01-07  9:02 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-18 21:15 [patch 00/20] VM pageout scalability improvements Rik van Riel
2007-12-18 21:15 ` [patch 01/20] convert anon_vma list lock a read/write lock Rik van Riel
2007-12-20  7:07   ` Christoph Lameter
2007-12-18 21:15 ` [patch 02/20] make the inode i_mmap_lock a reader/writer lock Rik van Riel
2007-12-19  0:48   ` Nick Piggin
2007-12-19  4:09     ` KOSAKI Motohiro
2007-12-19 15:52     ` Lee Schermerhorn
2007-12-19 16:31       ` Rik van Riel
2007-12-19 16:53         ` Lee Schermerhorn
2007-12-19 19:28           ` Peter Zijlstra
2007-12-19 23:40             ` Nick Piggin
2007-12-20  7:04               ` Christoph Lameter
2007-12-20  7:59                 ` Nick Piggin
2008-01-02 23:35                   ` Mike Travis
2008-01-03  6:07                     ` Nick Piggin
2008-01-03  8:55                       ` Ingo Molnar
2008-01-07  9:01                         ` Nick Piggin
2007-12-18 21:15 ` [patch 03/20] move isolate_lru_page() to vmscan.c Rik van Riel
2007-12-20  7:08   ` Christoph Lameter
2007-12-18 21:15 ` [patch 04/20] free swap space on swap-in/activation Rik van Riel
2007-12-18 21:15 ` [patch 05/20] define page_file_cache() function Rik van Riel
2007-12-18 21:15 ` [patch 06/20] debugging checks for page_file_cache() Rik van Riel
2007-12-18 21:15 ` [patch 07/20] Use an indexed array for LRU variables Rik van Riel
2007-12-18 21:15 ` [patch 08/20] split LRU lists into anon & file sets Rik van Riel
2007-12-18 21:15 ` [patch 09/20] split anon & file LRUs for memcontrol code Rik van Riel
2007-12-18 21:15 ` [patch 10/20] SEQ replacement for anonymous pages Rik van Riel
2007-12-19  5:17   ` KOSAKI Motohiro
2007-12-19 13:40     ` Rik van Riel
2007-12-20  2:04       ` KOSAKI Motohiro
2007-12-18 21:15 ` [patch 11/20] add newly swapped in pages to the inactive list Rik van Riel
2007-12-18 21:15 ` [patch 12/20] No Reclaim LRU Infrastructure Rik van Riel
2007-12-18 21:15 ` [patch 13/20] Non-reclaimable page statistics Rik van Riel
2007-12-18 21:15 ` [patch 14/20] Scan noreclaim list for reclaimable pages Rik van Riel
2007-12-18 21:15 ` [patch 15/20] ramfs pages are non-reclaimable Rik van Riel
2007-12-18 21:15 ` [patch 16/20] SHM_LOCKED pages are nonreclaimable Rik van Riel
2007-12-18 21:15 ` [patch 17/20] non-reclaimable mlocked pages Rik van Riel
2007-12-19  0:56   ` Nick Piggin
2007-12-19 13:45     ` Rik van Riel
2007-12-19 14:24       ` Peter Zijlstra
2007-12-19 14:53         ` Rik van Riel
2007-12-19 16:08           ` Lee Schermerhorn
2007-12-19 16:04       ` Lee Schermerhorn
2007-12-20 20:56         ` Rik van Riel
2007-12-21 10:52           ` Nick Piggin
2007-12-21 14:17             ` Rik van Riel
2007-12-23 12:22               ` Nick Piggin
2007-12-24  1:00                 ` Rik van Riel
2007-12-19 23:34       ` Nick Piggin
2007-12-20  7:19     ` Christoph Lameter
2007-12-20 15:33       ` Rik van Riel
2007-12-21 17:13         ` Lee Schermerhorn
2007-12-18 21:15 ` [patch 18/20] mlock vma pages under mmap_sem held for read Rik van Riel
2007-12-18 21:15 ` [patch 19/20] handle mlocked pages during map/unmap and truncate Rik van Riel
2007-12-18 21:15 ` [patch 20/20] account mlocked pages Rik van Riel
2007-12-22 20:27 ` [patch 00/20] VM pageout scalability improvements Balbir Singh
2007-12-23  0:21   ` Rik van Riel
2007-12-23 22:59     ` Balbir Singh
2007-12-24  1:11       ` Rik van Riel
2007-12-28  3:20         ` Matt Mackall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).