All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/7] Move mlocked pages off the LRU and track them
@ 2007-02-05 20:52 Christoph Lameter
  2007-02-05 20:52 ` [RFC 1/7] Make try_to_unmap return a special exit code Christoph Lameter
                   ` (7 more replies)
  0 siblings, 8 replies; 12+ messages in thread
From: Christoph Lameter @ 2007-02-05 20:52 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Christoph Hellwig, Arjan van de Ven, Nigel Cunningham,
	Martin J. Bligh, Peter Zijlstra, Nick Piggin, Christoph Lameter,
	Matt Mackall, Rik van Riel, KAMEZAWA Hiroyuki

[RFC] Remove mlocked pages from the LRU and track them

The patchset removes mlocked pages from the LRU and maintains a counter
for the number of discovered mlocked pages.

This is a lazy scheme for accounting for mlocked pages. The pages
may only be discovered to be mlocked during reclaim. However, we attempt
to detect mlocked pages at various other opportune moments. So in general
the mlock counter is not far off the number of actual mlocked pages in
the system.

Patch against 2.6.20-rc6-mm3

Known problems to be resolved:
- Page state bit used to mark a page mlocked is not available on i386 with
  NUMA.
- Note tested on SMP, UP. Need to catch a plane in 2 hours.

Tested on:
IA64 NUMA 12p

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 1/7] Make try_to_unmap return a special exit code
  2007-02-05 20:52 [RFC 0/7] Move mlocked pages off the LRU and track them Christoph Lameter
@ 2007-02-05 20:52 ` Christoph Lameter
  2007-02-05 20:52 ` [RFC 2/7] Add PageMlocked() page state bit and lru infrastructure Christoph Lameter
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2007-02-05 20:52 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Christoph Hellwig, Peter Zijlstra, Martin J. Bligh,
	Arjan van de Ven, Nick Piggin, Matt Mackall, Christoph Lameter,
	Nigel Cunningham, Rik van Riel, KAMEZAWA Hiroyuki

[PATCH] Make try_to_unmap() return SWAP_MLOCK for mlocked pages

Modify try_to_unmap() so that we can distinguish failing to
unmap due to a mlocked page from other causes.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: current/include/linux/rmap.h
===================================================================
--- current.orig/include/linux/rmap.h	2007-02-03 10:24:47.000000000 -0800
+++ current/include/linux/rmap.h	2007-02-03 10:25:08.000000000 -0800
@@ -134,5 +134,6 @@ static inline int page_mkclean(struct pa
 #define SWAP_SUCCESS	0
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
+#define SWAP_MLOCK	3
 
 #endif	/* _LINUX_RMAP_H */
Index: current/mm/rmap.c
===================================================================
--- current.orig/mm/rmap.c	2007-02-03 10:24:47.000000000 -0800
+++ current/mm/rmap.c	2007-02-03 10:25:08.000000000 -0800
@@ -631,10 +631,16 @@ static int try_to_unmap_one(struct page 
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration && ((vma->vm_flags & VM_LOCKED) ||
-			(ptep_clear_flush_young(vma, address, pte)))) {
-		ret = SWAP_FAIL;
-		goto out_unmap;
+	if (!migration) {
+		if (vma->vm_flags & VM_LOCKED) {
+			ret = SWAP_MLOCK;
+			goto out_unmap;
+		}
+
+		if (ptep_clear_flush_young(vma, address, pte)) {
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
 	}
 
 	/* Nuke the page table entry. */
@@ -799,7 +805,8 @@ static int try_to_unmap_anon(struct page
 
 	list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
 		ret = try_to_unmap_one(page, vma, migration);
-		if (ret == SWAP_FAIL || !page_mapped(page))
+		if (ret == SWAP_FAIL || ret == SWAP_MLOCK ||
+				!page_mapped(page))
 			break;
 	}
 	spin_unlock(&anon_vma->lock);
@@ -830,7 +837,8 @@ static int try_to_unmap_file(struct page
 	spin_lock(&mapping->i_mmap_lock);
 	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 		ret = try_to_unmap_one(page, vma, migration);
-		if (ret == SWAP_FAIL || !page_mapped(page))
+		if (ret == SWAP_FAIL || ret == SWAP_MLOCK ||
+				!page_mapped(page))
 			goto out;
 	}
 
@@ -913,6 +921,7 @@ out:
  * SWAP_SUCCESS	- we succeeded in removing all mappings
  * SWAP_AGAIN	- we missed a mapping, try again later
  * SWAP_FAIL	- the page is unswappable
+ * SWAP_MLOCK	- the page is under mlock()
  */
 int try_to_unmap(struct page *page, int migration)
 {
Index: current/mm/vmscan.c
===================================================================
--- current.orig/mm/vmscan.c	2007-02-03 10:25:00.000000000 -0800
+++ current/mm/vmscan.c	2007-02-03 10:25:12.000000000 -0800
@@ -516,6 +516,7 @@ static unsigned long shrink_page_list(st
 		if (page_mapped(page) && mapping) {
 			switch (try_to_unmap(page, 0)) {
 			case SWAP_FAIL:
+			case SWAP_MLOCK:
 				goto activate_locked;
 			case SWAP_AGAIN:
 				goto keep_locked;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 2/7] Add PageMlocked() page state bit and lru infrastructure
  2007-02-05 20:52 [RFC 0/7] Move mlocked pages off the LRU and track them Christoph Lameter
  2007-02-05 20:52 ` [RFC 1/7] Make try_to_unmap return a special exit code Christoph Lameter
@ 2007-02-05 20:52 ` Christoph Lameter
  2007-02-05 20:52 ` [RFC 3/7] Add NR_MLOCK ZVC Christoph Lameter
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2007-02-05 20:52 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Christoph Hellwig, Arjan van de Ven, Nigel Cunningham,
	Martin J. Bligh, Peter Zijlstra, Nick Piggin, Christoph Lameter,
	Matt Mackall, Rik van Riel, KAMEZAWA Hiroyuki

Add PageMlocked() infrastructure

This adds a new PG_mlocked to mark pages that were taken off the LRU
because they have a reference from a VM_LOCKED vma.

Also add pagevec handling for returning mlocked pages to the LRU.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: current/include/linux/page-flags.h
===================================================================
--- current.orig/include/linux/page-flags.h	2007-02-05 11:30:47.000000000 -0800
+++ current/include/linux/page-flags.h	2007-02-05 11:33:00.000000000 -0800
@@ -93,6 +93,7 @@
 
 #define PG_readahead		20	/* Reminder to do read-ahead */
 
+#define PG_mlocked		21	/* Page is mlocked */
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -235,6 +236,16 @@ static inline void SetPageUptodate(struc
 #define SetPageReadahead(page)	set_bit(PG_readahead, &(page)->flags)
 #define ClearPageReadahead(page) clear_bit(PG_readahead, &(page)->flags)
 
+/*
+ * PageMlocked set means that the page was taken off the LRU because
+ * a VM_LOCKED vma does exist. PageMlocked must be cleared before a
+ * page is put back onto the LRU. PageMlocked is only modified
+ * under the zone->lru_lock like PageLRU.
+ */
+#define PageMlocked(page)	test_bit(PG_mlocked, &(page)->flags)
+#define SetPageMlocked(page)	set_bit(PG_mlocked, &(page)->flags)
+#define ClearPageMlocked(page)	clear_bit(PG_mlocked, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: current/include/linux/pagevec.h
===================================================================
--- current.orig/include/linux/pagevec.h	2007-02-05 11:30:47.000000000 -0800
+++ current/include/linux/pagevec.h	2007-02-05 11:33:00.000000000 -0800
@@ -25,6 +25,7 @@ void __pagevec_release_nonlru(struct pag
 void __pagevec_free(struct pagevec *pvec);
 void __pagevec_lru_add(struct pagevec *pvec);
 void __pagevec_lru_add_active(struct pagevec *pvec);
+void __pagevec_lru_add_mlock(struct pagevec *pvec);
 void pagevec_strip(struct pagevec *pvec);
 unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
 		pgoff_t start, unsigned nr_pages);
Index: current/include/linux/swap.h
===================================================================
--- current.orig/include/linux/swap.h	2007-02-05 11:30:47.000000000 -0800
+++ current/include/linux/swap.h	2007-02-05 11:33:00.000000000 -0800
@@ -181,6 +181,7 @@ extern unsigned int nr_free_pagecache_pa
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
 extern void FASTCALL(lru_cache_add_tail(struct page *));
+extern void FASTCALL(lru_cache_add_mlock(struct page *));
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
Index: current/mm/swap.c
===================================================================
--- current.orig/mm/swap.c	2007-02-05 11:30:47.000000000 -0800
+++ current/mm/swap.c	2007-02-05 11:33:00.000000000 -0800
@@ -178,6 +178,7 @@ EXPORT_SYMBOL(mark_page_accessed);
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, };
 static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, };
 static DEFINE_PER_CPU(struct pagevec, lru_add_tail_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_add_mlock_pvecs) = { 0, };
 
 void fastcall lru_cache_add(struct page *page)
 {
@@ -199,6 +200,16 @@ void fastcall lru_cache_add_active(struc
 	put_cpu_var(lru_add_active_pvecs);
 }
 
+void fastcall lru_cache_add_mlock(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(lru_add_mlock_pvecs);
+
+	page_cache_get(page);
+	if (!pagevec_add(pvec, page))
+		__pagevec_lru_add_mlock(pvec);
+	put_cpu_var(lru_add_mlock_pvecs);
+}
+
 static void __pagevec_lru_add_tail(struct pagevec *pvec)
 {
 	int i;
@@ -237,6 +248,9 @@ static void __lru_add_drain(int cpu)
 	pvec = &per_cpu(lru_add_tail_pvecs, cpu);
 	if (pagevec_count(pvec))
 		__pagevec_lru_add_tail(pvec);
+	pvec = &per_cpu(lru_add_mlock_pvecs, cpu);
+	if (pagevec_count(pvec))
+		__pagevec_lru_add_mlock(pvec);
 }
 
 void lru_add_drain(void)
@@ -394,6 +408,7 @@ void __pagevec_lru_add(struct pagevec *p
 			spin_lock_irq(&zone->lru_lock);
 		}
 		VM_BUG_ON(PageLRU(page));
+		VM_BUG_ON(PageMlocked(page));
 		SetPageLRU(page);
 		add_page_to_inactive_list(zone, page);
 	}
@@ -423,6 +438,7 @@ void __pagevec_lru_add_active(struct pag
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 		VM_BUG_ON(PageActive(page));
+		VM_BUG_ON(PageMlocked(page));
 		SetPageActive(page);
 		add_page_to_active_list(zone, page);
 	}
@@ -432,6 +448,36 @@ void __pagevec_lru_add_active(struct pag
 	pagevec_reinit(pvec);
 }
 
+void __pagevec_lru_add_mlock(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		BUG_ON(PageLRU(page));
+		if (!PageMlocked(page))
+			continue;
+		ClearPageMlocked(page);
+		smp_wmb();
+		__dec_zone_state(zone, NR_MLOCK);
+		SetPageLRU(page);
+		add_page_to_active_list(zone, page);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
 /*
  * Function used uniquely to put pages back to the lru at the end of the
  * inactive list to preserve the lru order. Currently only used by swap

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 3/7] Add NR_MLOCK ZVC
  2007-02-05 20:52 [RFC 0/7] Move mlocked pages off the LRU and track them Christoph Lameter
  2007-02-05 20:52 ` [RFC 1/7] Make try_to_unmap return a special exit code Christoph Lameter
  2007-02-05 20:52 ` [RFC 2/7] Add PageMlocked() page state bit and lru infrastructure Christoph Lameter
@ 2007-02-05 20:52 ` Christoph Lameter
  2007-02-05 20:52 ` [RFC 4/7] Logic to move mlocked pages Christoph Lameter
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2007-02-05 20:52 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Christoph Hellwig, Peter Zijlstra, Martin J. Bligh,
	Arjan van de Ven, Nick Piggin, Matt Mackall, Christoph Lameter,
	Nigel Cunningham, Rik van Riel, KAMEZAWA Hiroyuki

Basic infrastructure to support NR_MLOCK

Add a new ZVC to support NR_MLOCK. NR_MLOCK counts the number of
mlocked pages taken off the LRU. Get rid of wrong calculation
of cache line size in mmzone.h.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: current/drivers/base/node.c
===================================================================
--- current.orig/drivers/base/node.c	2007-02-05 11:30:47.000000000 -0800
+++ current/drivers/base/node.c	2007-02-05 11:39:26.000000000 -0800
@@ -60,6 +60,7 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d FilePages:    %8lu kB\n"
 		       "Node %d Mapped:       %8lu kB\n"
 		       "Node %d AnonPages:    %8lu kB\n"
+		       "Node %d Mlock:        %8lu KB\n"
 		       "Node %d PageTables:   %8lu kB\n"
 		       "Node %d NFS_Unstable: %8lu kB\n"
 		       "Node %d Bounce:       %8lu kB\n"
@@ -82,6 +83,7 @@ static ssize_t node_read_meminfo(struct 
 		       nid, K(node_page_state(nid, NR_FILE_PAGES)),
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
 		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
+		       nid, K(node_page_state(nid, NR_MLOCK)),
 		       nid, K(node_page_state(nid, NR_PAGETABLE)),
 		       nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
 		       nid, K(node_page_state(nid, NR_BOUNCE)),
Index: current/fs/proc/proc_misc.c
===================================================================
--- current.orig/fs/proc/proc_misc.c	2007-02-05 11:30:47.000000000 -0800
+++ current/fs/proc/proc_misc.c	2007-02-05 11:39:26.000000000 -0800
@@ -166,6 +166,7 @@ static int meminfo_read_proc(char *page,
 		"Writeback:    %8lu kB\n"
 		"AnonPages:    %8lu kB\n"
 		"Mapped:       %8lu kB\n"
+		"Mlock:        %8lu KB\n"
 		"Slab:         %8lu kB\n"
 		"SReclaimable: %8lu kB\n"
 		"SUnreclaim:   %8lu kB\n"
@@ -196,6 +197,7 @@ static int meminfo_read_proc(char *page,
 		K(global_page_state(NR_WRITEBACK)),
 		K(global_page_state(NR_ANON_PAGES)),
 		K(global_page_state(NR_FILE_MAPPED)),
+		K(global_page_state(NR_MLOCK)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE) +
 				global_page_state(NR_SLAB_UNRECLAIMABLE)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE)),
Index: current/include/linux/mmzone.h
===================================================================
--- current.orig/include/linux/mmzone.h	2007-02-05 11:30:47.000000000 -0800
+++ current/include/linux/mmzone.h	2007-02-05 11:45:12.000000000 -0800
@@ -47,17 +47,16 @@ struct zone_padding {
 #endif
 
 enum zone_stat_item {
-	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
 	NR_INACTIVE,
 	NR_ACTIVE,
+	NR_MLOCK,	/* Mlocked pages */
 	NR_ANON_PAGES,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
 	NR_FILE_PAGES,
 	NR_FILE_DIRTY,
 	NR_WRITEBACK,
-	/* Second 128 byte cacheline */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
 	NR_PAGETABLE,		/* used for pagetables */
Index: current/mm/vmstat.c
===================================================================
--- current.orig/mm/vmstat.c	2007-02-05 11:30:47.000000000 -0800
+++ current/mm/vmstat.c	2007-02-05 11:43:38.000000000 -0800
@@ -434,6 +434,7 @@ static const char * const vmstat_text[] 
 	"nr_free_pages",
 	"nr_active",
 	"nr_inactive",
+	"nr_mlock",
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 4/7] Logic to move mlocked pages
  2007-02-05 20:52 [RFC 0/7] Move mlocked pages off the LRU and track them Christoph Lameter
                   ` (2 preceding siblings ...)
  2007-02-05 20:52 ` [RFC 3/7] Add NR_MLOCK ZVC Christoph Lameter
@ 2007-02-05 20:52 ` Christoph Lameter
  2007-02-05 20:53 ` [RFC 5/7] Consolidate new anonymous page code paths Christoph Lameter
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2007-02-05 20:52 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Christoph Hellwig, Arjan van de Ven, Nigel Cunningham,
	Martin J. Bligh, Peter Zijlstra, Nick Piggin, Christoph Lameter,
	Matt Mackall, Rik van Riel, KAMEZAWA Hiroyuki

Add logic to lazily remove/add mlocked pages from LRU

This is the core of the patchset. It adds the necessary logic to
remove mlocked pages from the LRU and put them back later. Basic idea
by Andrew Morton and others.

During reclaim we attempt to unmap pages. In order to do so we have
to scan all vmas that a page belongs to to check if VM_LOCKED is set.

If we find that this is the case for a page then we remove the page from
the LRU and mark it with SetMlocked so that we know that we need to put
the page back to the LRU later should the mlocked state be cleared.

We put the pages back in two places:

zap_pte_range: 	Pages are removed from a vma. If a page is mlocked then we
	add it back to the LRU. If other vmas with VM_LOCKED set have mapped
	the page then we will discover that later during reclaim and move
	the page off the LRU again.

munlock/munlockall: We scan all pages in the vma and do the
	same as in zap_pte_range.

We also have to modify the page migration logic to handle PageMlocked
pages. We simply clear the PageMlocked bit and then we can treat
the page as a regular page from the LRU.

Note that this is a lazy accounting for mlocked pages. NR_MLOCK may
increase as the system discovers more mlocked pages. Some of the later
patches opportunistically move pages off the LRU earlier avoiding
some of the delayed accounting. However, the scheme is fundamentally
lazy and one cannot count on NR_MLOCK to reflect the actual number of
mlocked pages. It is the number of so far *discovered* mlocked pages
which may be less than the actual number of mlocked pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: current/mm/memory.c
===================================================================
--- current.orig/mm/memory.c	2007-02-05 11:38:35.000000000 -0800
+++ current/mm/memory.c	2007-02-05 11:57:28.000000000 -0800
@@ -682,6 +682,8 @@ static unsigned long zap_pte_range(struc
 				file_rss--;
 			}
 			page_remove_rmap(page, vma);
+			if (PageMlocked(page) && vma->vm_flags & VM_LOCKED)
+				lru_cache_add_mlock(page);
 			tlb_remove_page(tlb, page);
 			continue;
 		}
Index: current/mm/migrate.c
===================================================================
--- current.orig/mm/migrate.c	2007-02-05 11:30:47.000000000 -0800
+++ current/mm/migrate.c	2007-02-05 11:47:23.000000000 -0800
@@ -58,6 +58,11 @@ int isolate_lru_page(struct page *page, 
 			else
 				del_page_from_inactive_list(zone, page);
 			list_add_tail(&page->lru, pagelist);
+		} else
+		if (PageMlocked(page)) {
+			get_page(page);
+			ClearPageMlocked(page);
+			list_add_tail(&page->lru, pagelist);
 		}
 		spin_unlock_irq(&zone->lru_lock);
 	}
Index: current/mm/mlock.c
===================================================================
--- current.orig/mm/mlock.c	2007-02-05 11:30:47.000000000 -0800
+++ current/mm/mlock.c	2007-02-05 11:47:23.000000000 -0800
@@ -10,7 +10,7 @@
 #include <linux/mm.h>
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
-
+#include <linux/swap.h>
 
 static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	unsigned long start, unsigned long end, unsigned int newflags)
@@ -63,6 +63,24 @@ success:
 		pages = -pages;
 		if (!(newflags & VM_IO))
 			ret = make_pages_present(start, end);
+	} else {
+		unsigned long addr;
+
+		/*
+		 * We are clearing VM_LOCKED. Feed all pages back via
+		 * to the LRU via lru_cache_add_mlock()
+		 */
+		for (addr = start; addr < end; addr += PAGE_SIZE) {
+			/*
+			 * No need to get a page reference. mmap_sem
+			 * writelock is held.
+			 */
+			struct page *page = follow_page(vma, start, 0);
+
+			if (PageMlocked(page))
+				lru_cache_add_mlock(page);
+			cond_resched();
+		}
 	}
 
 	mm->locked_vm -= pages;
Index: current/mm/vmscan.c
===================================================================
--- current.orig/mm/vmscan.c	2007-02-05 11:30:47.000000000 -0800
+++ current/mm/vmscan.c	2007-02-05 11:57:40.000000000 -0800
@@ -516,10 +516,11 @@ static unsigned long shrink_page_list(st
 		if (page_mapped(page) && mapping) {
 			switch (try_to_unmap(page, 0)) {
 			case SWAP_FAIL:
-			case SWAP_MLOCK:
 				goto activate_locked;
 			case SWAP_AGAIN:
 				goto keep_locked;
+			case SWAP_MLOCK:
+				goto mlocked;
 			case SWAP_SUCCESS:
 				; /* try to free the page below */
 			}
@@ -594,6 +595,13 @@ free_it:
 			__pagevec_release_nonlru(&freed_pvec);
 		continue;
 
+mlocked:
+		ClearPageActive(page);
+		unlock_page(page);
+		__inc_zone_page_state(page, NR_MLOCK);
+		SetPageMlocked(page);
+		continue;
+
 activate_locked:
 		SetPageActive(page);
 		pgactivate++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 5/7] Consolidate new anonymous page code paths
  2007-02-05 20:52 [RFC 0/7] Move mlocked pages off the LRU and track them Christoph Lameter
                   ` (3 preceding siblings ...)
  2007-02-05 20:52 ` [RFC 4/7] Logic to move mlocked pages Christoph Lameter
@ 2007-02-05 20:53 ` Christoph Lameter
  2007-02-05 20:53 ` [RFC 6/7] Avoid putting new mlocked anonymous pages on LRU Christoph Lameter
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2007-02-05 20:53 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Christoph Hellwig, Peter Zijlstra, Martin J. Bligh,
	Arjan van de Ven, Nick Piggin, Matt Mackall, Christoph Lameter,
	Nigel Cunningham, Rik van Riel, KAMEZAWA Hiroyuki

Consolidate code to add an anonymous page in memory.c

There are two location in which we add anonymous pages. Both
implement the same logic. Create a new function add_anon_page()
to have a common code path.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: current/mm/memory.c
===================================================================
--- current.orig/mm/memory.c	2007-02-05 12:31:49.000000000 -0800
+++ current/mm/memory.c	2007-02-05 12:32:34.000000000 -0800
@@ -900,6 +900,17 @@ unsigned long zap_page_range(struct vm_a
 }
 
 /*
+ * Add a new anonymous page
+ */
+static void add_anon_page(struct vm_area_struct *vma, struct page *page,
+				unsigned long address)
+{
+	inc_mm_counter(vma->vm_mm, anon_rss);
+	lru_cache_add_active(page);
+	page_add_new_anon_rmap(page, vma, address);
+}
+
+/*
  * Do a quick page-table lookup for a single page.
  */
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
@@ -2103,9 +2114,7 @@ static int do_anonymous_page(struct mm_s
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto release;
-		inc_mm_counter(mm, anon_rss);
-		lru_cache_add_active(page);
-		page_add_new_anon_rmap(page, vma, address);
+		add_anon_page(vma, page, address);
 	} else {
 		/* Map the ZERO_PAGE - vm_page_prot is readonly */
 		page = ZERO_PAGE(address);
@@ -2249,11 +2258,9 @@ retry:
 		if (write_access)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		set_pte_at(mm, address, page_table, entry);
-		if (anon) {
-			inc_mm_counter(mm, anon_rss);
-			lru_cache_add_active(new_page);
-			page_add_new_anon_rmap(new_page, vma, address);
-		} else {
+		if (anon)
+			add_anon_page(vma, new_page, address);
+		else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
 			if (write_access) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 6/7] Avoid putting new mlocked anonymous pages on LRU
  2007-02-05 20:52 [RFC 0/7] Move mlocked pages off the LRU and track them Christoph Lameter
                   ` (4 preceding siblings ...)
  2007-02-05 20:53 ` [RFC 5/7] Consolidate new anonymous page code paths Christoph Lameter
@ 2007-02-05 20:53 ` Christoph Lameter
  2007-02-05 20:53 ` [RFC 7/7] Opportunistically move mlocked pages off the LRU Christoph Lameter
  2007-02-06 16:04 ` [RFC 0/7] Move mlocked pages off the LRU and track them Lee Schermerhorn
  7 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2007-02-05 20:53 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Christoph Hellwig, Arjan van de Ven, Nigel Cunningham,
	Martin J. Bligh, Peter Zijlstra, Nick Piggin, Christoph Lameter,
	Matt Mackall, Rik van Riel, KAMEZAWA Hiroyuki

Mark new anonymous pages mlocked if they are in a mlocked VMA.

Avoid putting pages onto the LRU that are allocated in a VMA
with VM_LOCKED set. NR_MLOCK will be more accurate.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: current/mm/memory.c
===================================================================
--- current.orig/mm/memory.c	2007-02-05 12:32:34.000000000 -0800
+++ current/mm/memory.c	2007-02-05 12:31:57.000000000 -0800
@@ -906,7 +906,15 @@ static void add_anon_page(struct vm_area
 				unsigned long address)
 {
 	inc_mm_counter(vma->vm_mm, anon_rss);
-	lru_cache_add_active(page);
+	if (vma->vm_flags & VM_LOCKED) {
+		/*
+		 * Page is new and therefore not on the LRU
+		 * so we can directly mark it as mlocked
+		 */
+		SetPageMlocked(page);
+		inc_zone_page_state(page, NR_MLOCK);
+	} else
+		lru_cache_add_active(page);
 	page_add_new_anon_rmap(page, vma, address);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 7/7] Opportunistically move mlocked pages off the LRU
  2007-02-05 20:52 [RFC 0/7] Move mlocked pages off the LRU and track them Christoph Lameter
                   ` (5 preceding siblings ...)
  2007-02-05 20:53 ` [RFC 6/7] Avoid putting new mlocked anonymous pages on LRU Christoph Lameter
@ 2007-02-05 20:53 ` Christoph Lameter
  2007-02-06 16:04 ` [RFC 0/7] Move mlocked pages off the LRU and track them Lee Schermerhorn
  7 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2007-02-05 20:53 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, Christoph Hellwig, Peter Zijlstra, Martin J. Bligh,
	Arjan van de Ven, Nick Piggin, Matt Mackall, Christoph Lameter,
	Nigel Cunningham, Rik van Riel, KAMEZAWA Hiroyuki

Opportunistically move mlocked pages off the LRU

Add a new function try_to_mlock() that attempts to
move a page off the LRU and marks it mlocked.

This function can then be used in various code paths to move
pages off the LRU immediately. Early discovery will make NR_MLOCK
track the actual number of mlocked pages in the system more closely.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: current/mm/memory.c
===================================================================
--- current.orig/mm/memory.c	2007-02-05 12:00:30.000000000 -0800
+++ current/mm/memory.c	2007-02-05 12:01:52.000000000 -0800
@@ -919,6 +919,30 @@ static void add_anon_page(struct vm_area
 }
 
 /*
+ * Opportunistically move the page off the LRU
+ * if possible. If we do not succeed then the LRU
+ * scans will take the page off.
+ */
+void try_to_set_mlocked(struct page *page)
+{
+	struct zone *zone;
+	unsigned long flags;
+
+	if (!PageLRU(page) || PageMlocked(page))
+		return;
+
+	zone = page_zone(page);
+	if (spin_trylock_irqsave(&zone->lru_lock, flags)) {
+		if (PageLRU(page) && !PageMlocked(page)) {
+			ClearPageLRU(page);
+			list_del(&page->lru);
+			SetPageMlocked(page);
+			__inc_zone_page_state(page, NR_MLOCK);
+		}
+		spin_unlock_irqrestore(&zone->lru_lock, flags);
+	}
+}
+/*
  * Do a quick page-table lookup for a single page.
  */
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
@@ -978,6 +1002,8 @@ struct page *follow_page(struct vm_area_
 			set_page_dirty(page);
 		mark_page_accessed(page);
 	}
+	if (vma->vm_flags & VM_LOCKED)
+		try_to_set_mlocked(page);
 unlock:
 	pte_unmap_unlock(ptep, ptl);
 out:
@@ -2271,6 +2297,8 @@ retry:
 		else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(new_page);
+			if (vma->vm_flags & VM_LOCKED)
+				try_to_set_mlocked(new_page);
 			if (write_access) {
 				dirty_page = new_page;
 				get_page(dirty_page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/7] Move mlocked pages off the LRU and track them
  2007-02-05 20:52 [RFC 0/7] Move mlocked pages off the LRU and track them Christoph Lameter
                   ` (6 preceding siblings ...)
  2007-02-05 20:53 ` [RFC 7/7] Opportunistically move mlocked pages off the LRU Christoph Lameter
@ 2007-02-06 16:04 ` Lee Schermerhorn
  2007-02-06 16:50   ` Larry Woodman
  2007-02-06 19:51   ` Andrew Morton
  7 siblings, 2 replies; 12+ messages in thread
From: Lee Schermerhorn @ 2007-02-06 16:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, akpm, Christoph Hellwig, Arjan van de Ven,
	Nigel Cunningham, Martin J. Bligh, Peter Zijlstra, Nick Piggin,
	Matt Mackall, Rik van Riel, KAMEZAWA Hiroyuki, Larry Woodman

On Mon, 2007-02-05 at 12:52 -0800, Christoph Lameter wrote:
> [RFC] Remove mlocked pages from the LRU and track them
> 
> The patchset removes mlocked pages from the LRU and maintains a counter
> for the number of discovered mlocked pages.
> 
> This is a lazy scheme for accounting for mlocked pages. The pages
> may only be discovered to be mlocked during reclaim. However, we attempt
> to detect mlocked pages at various other opportune moments. So in general
> the mlock counter is not far off the number of actual mlocked pages in
> the system.
> 
> Patch against 2.6.20-rc6-mm3
> 
> Known problems to be resolved:
> - Page state bit used to mark a page mlocked is not available on i386 with
>   NUMA.
> - Note tested on SMP, UP. Need to catch a plane in 2 hours.
> 
> Tested on:
> IA64 NUMA 12p

Note that anon [and shmem] pages in excess of available swap are
effectively mlocked().  In the field, we have seen non-NUMA x86_64
systems with 64-128GB [16-32million 4k pages] with little to no
swap--big data base servers.  The majority of the memory is dedicated to
large data base shared memory areas.  The remaining is divided between
program anon and page cache [executable, libs] pages and any other page
cache pages used by data base utilities, system daemons, ...

The system runs fine until someone runs a backup [or multiple, as there
are multiple data base instances running].  This over commits memory and
we end up with all cpus in reclaim, contending for the zone lru lock,
and walking an active list of 10s of millions of pages looking for pages
to reclaim.  The reclaim logic spends a lot of time walking the lru
lists, nominating shmem pages [the majority of pages on the list] for
reclaim, only to find in shrink_pages() that it can't move the page to
swap.  So, it puts it back on the list to be retried by the other cpus
once they obtain the zone lru lock.  System appears to be hung for long
periods of time.

There are a lot of behaviors in the reclaim code that exacerbate the
problems when we get into this mode, but the long lists of unswappable
anon/shmem pages is the major culprit.  One of the guys at Red Hat has
tried a "proof of concept" patch to move all anon/shmem pages in excess
of swap space to "wired list" [currently global, per node/zone in
progress] and it seems to alleviate the problem.  

So, Christoph's patch addresses a real problem that we've seen.
Unfortunately, not all data base applications lock their shmem areas
into memory.  Excluding pages from consideration for reclaim that can't
possibly be swapped out due to lack of swap space seems a natural
extension of this concept.  I expect that many Christoph's customers run
with swap space that is much smaller than system memory and would
benefit from this extension.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/7] Move mlocked pages off the LRU and track them
  2007-02-06 16:04 ` [RFC 0/7] Move mlocked pages off the LRU and track them Lee Schermerhorn
@ 2007-02-06 16:50   ` Larry Woodman
  2007-02-06 19:51   ` Andrew Morton
  1 sibling, 0 replies; 12+ messages in thread
From: Larry Woodman @ 2007-02-06 16:50 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, linux-mm, akpm, Christoph Hellwig,
	Arjan van de Ven, Nigel Cunningham, Martin J. Bligh,
	Peter Zijlstra, Nick Piggin, Matt Mackall, Rik van Riel,
	KAMEZAWA Hiroyuki

Lee Schermerhorn wrote:

>On Mon, 2007-02-05 at 12:52 -0800, Christoph Lameter wrote:
>  
>
>>[RFC] Remove mlocked pages from the LRU and track them
>>
>>The patchset removes mlocked pages from the LRU and maintains a counter
>>for the number of discovered mlocked pages.
>>
>>This is a lazy scheme for accounting for mlocked pages. The pages
>>may only be discovered to be mlocked during reclaim. However, we attempt
>>to detect mlocked pages at various other opportune moments. So in general
>>the mlock counter is not far off the number of actual mlocked pages in
>>the system.
>>
>>Patch against 2.6.20-rc6-mm3
>>
>>Known problems to be resolved:
>>- Page state bit used to mark a page mlocked is not available on i386 with
>>  NUMA.
>>- Note tested on SMP, UP. Need to catch a plane in 2 hours.
>>
>>Tested on:
>>IA64 NUMA 12p
>>    
>>
>
>Note that anon [and shmem] pages in excess of available swap are
>effectively mlocked().  In the field, we have seen non-NUMA x86_64
>systems with 64-128GB [16-32million 4k pages] with little to no
>swap--big data base servers.  The majority of the memory is dedicated to
>large data base shared memory areas.  The remaining is divided between
>program anon and page cache [executable, libs] pages and any other page
>cache pages used by data base utilities, system daemons, ...
>
>The system runs fine until someone runs a backup [or multiple, as there
>are multiple data base instances running].  This over commits memory and
>we end up with all cpus in reclaim, contending for the zone lru lock,
>and walking an active list of 10s of millions of pages looking for pages
>to reclaim.  The reclaim logic spends a lot of time walking the lru
>lists, nominating shmem pages [the majority of pages on the list] for
>reclaim, only to find in shrink_pages() that it can't move the page to
>swap.  So, it puts it back on the list to be retried by the other cpus
>once they obtain the zone lru lock.  System appears to be hung for long
>periods of time.
>
Its actually more complicated that this.  The pagecache pages from the 
backup are
placed on the head of inactive list when they are initialized via 
add_to_page_cache(). 
If they get referenced again, mark_page_accessed() moves then to the 
tail of the
active list.  At this point they wont be considered for deactivation and 
reclamation
until every other page in the system is.  They probably should go on the 
head of the
inactive list if they are not mapped, for at least one more time.  Once 
we start adding
clean, unmapped pagecache pages behind anonymous and mapped system V shared
memory pages on the active list we start swapping while there are better 
pages to select
for reclamation.  This prevents kswapd from keeping the free pages above 
min and
then every caller of __alloc_pages() starts reclaiming memory on every 
zone in the
zone list.  Once the system is in that state the whole distress 
algorithm in shrink_active_list()
doesnt work right any more because it uses prev_priority as the distress 
shift.  This
means that if one caller of shrink_active_list() is deactivating mapped 
anonymous memoy
because the priority got that high(actually low) every caller is 
deactivating mapped
anonymous memoy.  The system degrades to a point where both the active 
and inactive
lists contain a mixture of dirty and clean, mapped and unmapped, 
pagecache and
anonymous memory and every CPU is down in try_to_free_pages() mixing it even
further.

>
>There are a lot of behaviors in the reclaim code that exacerbate the
>problems when we get into this mode, but the long lists of unswappable
>anon/shmem pages is the major culprit.  One of the guys at Red Hat has
>tried a "proof of concept" patch to move all anon/shmem pages in excess
>of swap space to "wired list" [currently global, per node/zone in
>progress] and it seems to alleviate the problem.  
>
>So, Christoph's patch addresses a real problem that we've seen.
>Unfortunately, not all data base applications lock their shmem areas
>into memory.  Excluding pages from consideration for reclaim that can't
>possibly be swapped out due to lack of swap space seems a natural
>extension of this concept.  I expect that many Christoph's customers run
>with swap space that is much smaller than system memory and would
>benefit from this extension.
>
>Lee
>
>  
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/7] Move mlocked pages off the LRU and track them
  2007-02-06 16:04 ` [RFC 0/7] Move mlocked pages off the LRU and track them Lee Schermerhorn
  2007-02-06 16:50   ` Larry Woodman
@ 2007-02-06 19:51   ` Andrew Morton
  2007-02-07 10:51     ` Larry Woodman
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2007-02-06 19:51 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, linux-mm, Christoph Hellwig, Arjan van de Ven,
	Nigel Cunningham, Martin J. Bligh, Peter Zijlstra, Nick Piggin,
	Matt Mackall, Rik van Riel, KAMEZAWA Hiroyuki, Larry Woodman

On Tue, 06 Feb 2007 11:04:42 -0500
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> Note that anon [and shmem] pages in excess of available swap are
> effectively mlocked().  In the field, we have seen non-NUMA x86_64
> systems with 64-128GB [16-32million 4k pages] with little to no
> swap--big data base servers.  The majority of the memory is dedicated to
> large data base shared memory areas.  The remaining is divided between
> program anon and page cache [executable, libs] pages and any other page
> cache pages used by data base utilities, system daemons, ...
> 
> The system runs fine until someone runs a backup [or multiple, as there
> are multiple data base instances running].  This over commits memory and
> we end up with all cpus in reclaim, contending for the zone lru lock,
> and walking an active list of 10s of millions of pages looking for pages
> to reclaim.  The reclaim logic spends a lot of time walking the lru
> lists, nominating shmem pages [the majority of pages on the list] for
> reclaim, only to find in shrink_pages() that it can't move the page to
> swap.  So, it puts it back on the list to be retried by the other cpus
> once they obtain the zone lru lock.  System appears to be hung for long
> periods of time.
> 
> There are a lot of behaviors in the reclaim code that exacerbate the
> problems when we get into this mode, but the long lists of unswappable
> anon/shmem pages is the major culprit.  One of the guys at Red Hat has
> tried a "proof of concept" patch to move all anon/shmem pages in excess
> of swap space to "wired list" [currently global, per node/zone in
> progress] and it seems to alleviate the problem.  
> 
> So, Christoph's patch addresses a real problem that we've seen.
> Unfortunately, not all data base applications lock their shmem areas
> into memory.  Excluding pages from consideration for reclaim that can't
> possibly be swapped out due to lack of swap space seems a natural
> extension of this concept.  I expect that many Christoph's customers run
> with swap space that is much smaller than system memory and would
> benefit from this extension.

Yeah.

The scanner at present tries to handle out-of-swap by moving these pages
onto the active list (shrink_page_list) then keeping them there
(shrink_active_list) so it _should_ be the case that the performance
problems which you're observing are due to active list scanning.  Is that
correct?

If not, something's busted.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/7] Move mlocked pages off the LRU and track them
  2007-02-06 19:51   ` Andrew Morton
@ 2007-02-07 10:51     ` Larry Woodman
  0 siblings, 0 replies; 12+ messages in thread
From: Larry Woodman @ 2007-02-07 10:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee Schermerhorn, Christoph Lameter, linux-mm, Christoph Hellwig,
	Arjan van de Ven, Nigel Cunningham, Martin J. Bligh,
	Peter Zijlstra, Nick Piggin, Matt Mackall, Rik van Riel,
	KAMEZAWA Hiroyuki

Andrew Morton wrote:

>On Tue, 06 Feb 2007 11:04:42 -0500
>Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
>
>  
>
>>Note that anon [and shmem] pages in excess of available swap are
>>effectively mlocked().  In the field, we have seen non-NUMA x86_64
>>systems with 64-128GB [16-32million 4k pages] with little to no
>>swap--big data base servers.  The majority of the memory is dedicated to
>>large data base shared memory areas.  The remaining is divided between
>>program anon and page cache [executable, libs] pages and any other page
>>cache pages used by data base utilities, system daemons, ...
>>
>>The system runs fine until someone runs a backup [or multiple, as there
>>are multiple data base instances running].  This over commits memory and
>>we end up with all cpus in reclaim, contending for the zone lru lock,
>>and walking an active list of 10s of millions of pages looking for pages
>>to reclaim.  The reclaim logic spends a lot of time walking the lru
>>lists, nominating shmem pages [the majority of pages on the list] for
>>reclaim, only to find in shrink_pages() that it can't move the page to
>>swap.  So, it puts it back on the list to be retried by the other cpus
>>once they obtain the zone lru lock.  System appears to be hung for long
>>periods of time.
>>
>>There are a lot of behaviors in the reclaim code that exacerbate the
>>problems when we get into this mode, but the long lists of unswappable
>>anon/shmem pages is the major culprit.  One of the guys at Red Hat has
>>tried a "proof of concept" patch to move all anon/shmem pages in excess
>>of swap space to "wired list" [currently global, per node/zone in
>>progress] and it seems to alleviate the problem.  
>>
>>So, Christoph's patch addresses a real problem that we've seen.
>>Unfortunately, not all data base applications lock their shmem areas
>>into memory.  Excluding pages from consideration for reclaim that can't
>>possibly be swapped out due to lack of swap space seems a natural
>>extension of this concept.  I expect that many Christoph's customers run
>>with swap space that is much smaller than system memory and would
>>benefit from this extension.
>>    
>>
>
>Yeah.
>
>The scanner at present tries to handle out-of-swap by moving these pages
>onto the active list (shrink_page_list) then keeping them there
>(shrink_active_list) so it _should_ be the case that the performance
>problems which you're observing are due to active list scanning.  Is that
>correct?
>
>If not, something's busted.
>

This is true but when mark_page_accessed() activates referenced 
pagecache pages it
mixes them with the non-swapable anonymous and system V shared memory pages
on the active list.   This combined with lots of heavy filesystem 
writing prevent kswapd
from keeping up with the memory demmand so the free list(s) fall below 
zone->pages_min
and every call to __alloc_pages() results in calling 
try_to_free_pages().  Once all CPUs
are scanning and trying to reclaim the system chokes, especially on 
systems with lots
of CPUs and lots of RAM.

Larry Woodman


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-02-07 10:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-05 20:52 [RFC 0/7] Move mlocked pages off the LRU and track them Christoph Lameter
2007-02-05 20:52 ` [RFC 1/7] Make try_to_unmap return a special exit code Christoph Lameter
2007-02-05 20:52 ` [RFC 2/7] Add PageMlocked() page state bit and lru infrastructure Christoph Lameter
2007-02-05 20:52 ` [RFC 3/7] Add NR_MLOCK ZVC Christoph Lameter
2007-02-05 20:52 ` [RFC 4/7] Logic to move mlocked pages Christoph Lameter
2007-02-05 20:53 ` [RFC 5/7] Consolidate new anonymous page code paths Christoph Lameter
2007-02-05 20:53 ` [RFC 6/7] Avoid putting new mlocked anonymous pages on LRU Christoph Lameter
2007-02-05 20:53 ` [RFC 7/7] Opportunistically move mlocked pages off the LRU Christoph Lameter
2007-02-06 16:04 ` [RFC 0/7] Move mlocked pages off the LRU and track them Lee Schermerhorn
2007-02-06 16:50   ` Larry Woodman
2007-02-06 19:51   ` Andrew Morton
2007-02-07 10:51     ` Larry Woodman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.