[PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

* [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
@ 2007-01-22  7:09 yunfeng zhang
  2007-01-22 10:21 ` Pavel Machek
  2007-01-22 20:00 ` Al Boldi
  0 siblings, 2 replies; 27+ messages in thread
From: yunfeng zhang @ 2007-01-22  7:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rik van Riel, Pavel Machek, Al Boldi

My patch is based on my new idea to Linux swap subsystem, you can find more in
Documentation/vm_pps.txt which isn't only patch illustration but also file
changelog. In brief, SwapDaemon should scan and reclaim pages on
UserSpace::vmalist other than current zone::active/inactive. The change will
conspicuously enhance swap subsystem performance by

1) SwapDaemon can collect the statistic of process acessing pages and by it
   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
   current Linux legacy swap subsystem.
2) Page-fault can issue better readahead requests since history data shows all
   related pages have conglomerating affinity. In contrast, Linux page-fault
   readaheads the pages relative to the SwapSpace position of current
   page-fault page.
3) It's conformable to POSIX madvise API family.
4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
   strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
   the only one from down to up.

Other problems asked about my pps are
1) There isn't new lock order in my pps, it's compliant to Linux lock order
   defined in mm/rmap.c.
2) When a memory inode is low, you can set scan_control::reclaim_node to let my
   kppsd to reclaim the memory inode page.

       Signed-off-by: Yunfeng Zhang <zyf.zeroos@gmail.com>

Index: linux-2.6.19/Documentation/vm_pps.txt
===================================================================

--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.19/Documentation/vm_pps.txt	2007-01-22 13:52:04.973820224 +0800
@@ -0,0 +1,237 @@
+                         Pure Private Page System (pps)
+                     Copyright by Yunfeng Zhang on GFDL 1.2
+                              zyf.zeroos@gmail.com
+                              December 24-26, 2006
+
+// Purpose <([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section <How to Reclaim
+Pages more Efficiently> and how I patch it into Linux 2.6.19 in section
+<Pure Private Page System -- pps>.
+// }])>
+
+// How to Reclaim Pages more Efficiently <([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) Page table, zone and memory inode layer (architecture-dependent).
+Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but
+here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
+   strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
+   the only one from down to up.
+
+Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a
+system on memory node::active_list/inactive_list.
+
+I've finished a patch, see section <Pure Private Page System -- pps>. Note, it
+ISN'T perfect.
+// }])>
+
+// Pure Private Page System -- pps  <([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole job should be consist of two parts, here is the first --
+PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS)
+scheduled in future. Of course, if both are done, it will empty Linux legacy
+page system.
+
+In fact, pps is centered on how to better collect and unmap process private
+pages, the whole process is divided into six stages -- <Stage Definition>. PPS
+uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma)
+of mm/vmscan.c. Other sections show the remain aspects of pps
+1) <Data Definition> is basic data definition.
+2) <Concurrent racers of Shrinking pps> is focused on synchronization.
+3) <Private Page Lifecycle of pps> how private pages enter in/go off pps.
+4) <VMA Lifecycle of pps> which VMA is belonging to pps.
+5) <Others about pps> new daemon thread kppsd, pps statistic data etc.
+
+I'm also glad to highlight my a new idea -- dftlb which is described in
+section <Delay to Flush TLB>.
+// }])>
+
+// Delay to Flush TLB (dftlb) <([{
+Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in
+brief, when we want to unmap a page from the page table of a process, why we
+send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we
+can insert flushing tasks into timer interrupt route to implement a
+free-charged TLB flushing.
+
+The trick is implemented in
+1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c.
+2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute
+   flushing tasks.
+3) all data are defined in include/linux/mm.h.
+4) dftlb is done on stage 1 and 2 of vmscan.c:shrink_pvma_scan_ptes.
+
+The restriction of dftlb. Following conditions must be met
+1) atomic cmpxchg instruction.
+2) atomically set the access bit after CPU touches a pte firstly.
+3) To some architectures, vma parameter of flush_tlb_range is maybe important,
+   if it's true, since it's possible that the vma of a TLB flushing task has
+   gone when a CPU starts to execute the task in timer interrupt, so don't use
+   dftlb.
+combine stage 1 with stage 2, and send IPI immediately in fill_in_tlb_tasks.
+
+dftlb increases mm_struct::mm_users to prevent the mm from being freed when
+other CPU works on it.
+// }])>
+
+// Stage Definition <([{
+The whole process of private page page-out is divided into six stages
+shrink_pvma_scan_ptes of mm/vmscan.c, the code groups the similar ptes/pages to
+a series.
+1) PTE to untouched PTE (clear access bit), append flushing tasks to dftlb.
+---) Other CPUs do flushing tasks in their timer interrupt.
+2) Resume from 1, convert untouched PTE to UnmappedPTE (cmpxchg).
+3) Link SwapEntry to PrivatePage of every UnmappedPTE.
+4) Flush PrivatePage to its disk SwapPage.
+5) Reclaimed the page and shift UnmappedPTE to SwappedPTE.
+6) SwappedPTE stage (Null operation).
+// }])>
+
+// Data Definition <([{
+New VMA flag (VM_PURE_PRIVATE) is appended into VMA in include/linux/mm.h.
+
+New PTE type (UnmappedPTE) is appended into PTE system in
+include/asm-i386/pgtable.h. Its prototype is
+struct UnmappedPTE {
+    int present : 1; // must be 0.
+    ...
+    int pageNum : 20;
+};
+The new PTE has a feature, it keeps a link to its PrivatePage and prevent the
+page from being visited by CPU, so you can use it in <Stage Definition> as a
+middleware.
+// }])>
+
+// Concurrent Racers of Shrinking pps <([{
+shrink_private_vma of mm/vmscan.c uses init_mm.mmlist to scan all swappable
+mm_struct instances, during the process of scaning and reclamation, it
+readlockes mm_struct::mmap_sem, which brings some potential concurrent racers
+1) mm/swapfile.c    pps_swapoff (swapoff API)
+2) mm/memory.c  do_wp_page, handle_pte_fault::unmapped_pte, do_anonymous_page,
+   do_swap_page (page-fault)
+3) mm/memory.c  get_user_pages  (sometimes core need share PrivatePage with us)
+
+There isn't new lock order defined in pps, that is, it's compliable to Linux
+lock order.
+// }])>
+
+// Others about pps <([{
+A new kernel thread -- kppsd is introduced in mm/vmscan.c, its task is to
+execute the stages of pps periodically, note an appropriate timeout ticks is
+necessary so we can give application a chance to re-map back its PrivatePage
+from UnmappedPTE to PTE, that is, show their conglomeration affinity.
+
+kppsd can be controlled by new fields -- scan_control::may_reclaim/reclaim_node
+may_reclaim = 1 means starting reclamation (stage 5).  reclaim_node = (node
+number) is used when a memory node is low. Caller should set them to wakeup_sc,
+then wake up kppsd (vmscan.c:balance_pgdat). Note, if kppsd is started due to
+timeout, it doesn't do stage 5 at all (vmscan.c:kppsd). Other alive legacy
+fields are gfp_mask, may_writepage and may_swap.
+
+PPS statistic data is appended to /proc/meminfo entry, its prototype is in
+include/linux/mm.h.
+// }])>
+
+// Private Page Lifecycle of pps <([{
+All pages belonging to pps are called as pure private page, its PTE type is PTE
+or UnmappedPTE. Note, Linux fork API potentially make PrivatePage shared by
+multiple processes, so is excluded from pps.
+
+IN (NOTE, when a pure private page enters into pps, it's also trimmed from
+Linux legacy page system by commeting lru_cache_add_active clause)
+1) fs/exec.c	install_arg_pages	(argument pages)
+2) mm/memory	do_anonymous_page, do_wp_page, do_swap_page	(page fault)
+3) mm/swap_state.c	read_swap_cache_async	(swap pages)
+
+OUT
+1) mm/vmscan.c  shrink_pvma_scan_ptes   (stage 5, reclaim a private page)
+2) mm/memory    zap_pte_range   (free a page)
+3) kernel/fork.c	dup_mmap	(if someone uses fork, migrate all pps pages
+   back to let Linux legacy page system manage them)
+
+When a pure private page is in pps, it can be visited simultaneously by
+page-fault and SwapDaemon.
+// }])>
+
+// VMA Lifecycle of pps <([{
+When a PrivateVMA enters into pps, it's or-ed a new flag -- VM_PURE_PRIVATE in
+memory.c:enter_pps, you can also find which VMAs are fit with pps in it. The
+flag is used mainly in the shrink_private_vma of mm/vmscan.c.  Other fields are
+left untouched.
+
+IN.
+1) fs/exec.c	setup_arg_pages	(StackVMA)
+2) mm/mmap.c	do_mmap_pgoff, do_brk	(DataVMA)
+3) mm/mmap.c	split_vma, copy_vma	(in some cases, we need copy a VMA from an
+   exist VMA)
+
+OUT.
+1) kernel/fork.c	dup_mmap	(if someone uses fork, return the vma back to
+   Linux legacy system)
+2) mm/mmap.c	remove_vma, vma_adjust	(destroy VMA)
+3) mm/mmap.c	do_mmap_pgoff	(delete VMA when some errors occur)
+
+The VMAs of pps can coexist with madvise, mlock, mprotect, mmap and munmap,
+that is why new VMA created from mmap.c:split_vma can re-enter into pps.
+// }])>
+
+// Postscript <([{
+Note, some circumstances aren't tested due to hardware restriction e.g. SMP
+dftlb. So there is no guanrantee in my dftlb code and EVEN my idea.
+
+Here are some improvements about pps
+1) In fact, I recommend one-to-one private model -- PrivateVMA, (PTE,
+   UnmappedPTE) and (PrivatePage, DiskSwapPage) which is described in my OS and
+   the above hyperlink of Linux kernel mail list. Current Linux core supports a
+   trick -- COW on PrivatePage which is used by fork API, the API should be
+   used rarely, POSIX thread library, vfork/execve are enough to application,
+   but as the result, it potentially makes PrivatePage shared, so I think it's
+   unnecessary to Linux, do copy-on-calling if someone really need it. If you
+   agree it, you will find UnmappedPTE + PrivatePage IS swap cache of Linux,
+   and swap_info_struct::swap_map should be bitmap other than (short int)map.
+   So it's a compromise to use Linux legacy SwapCache in my pps. That's why my
+   patch is called pps -- pure private (page) system.
+2) SwapSpace should provide more flexible interfaces, shrink_pvma_scan_ptes
+   need allocate swap entries in batch, exactly, allocate a batch of fake
+   continual swap entries, see memory.c:pps_swapin_readahead. In fact, the
+   interface should be overloaded, that is, swap file should has a different
+   strategy versus swap partition.
+
+If Linux kernel group can't make a schedule to re-write their memory code,
+however, pps maybe is the best solution until now.
+// }])>
+// vim: foldmarker=<([{,}])> foldmethod=marker et
Index: linux-2.6.19/fs/exec.c
===================================================================
--- linux-2.6.19.orig/fs/exec.c	2007-01-22 11:39:50.000000000 +0800
+++ linux-2.6.19/fs/exec.c	2007-01-22 11:40:07.000000000 +0800
@@ -321,8 +321,9 @@
 		pte_unmap_unlock(pte, ptl);
 		goto out;
 	}
+	atomic_inc(&pps_info.total);
+	atomic_inc(&pps_info.pte_count);
 	inc_mm_counter(mm, anon_rss);
-	lru_cache_add_active(page);
 	set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte(
 					page, vma->vm_page_prot))));
 	page_add_new_anon_rmap(page, vma, address);
@@ -437,6 +438,7 @@
 			kmem_cache_free(vm_area_cachep, mpnt);
 			return ret;
 		}
+		enter_pps(mm, mpnt);
 		mm->stack_vm = mm->total_vm = vma_pages(mpnt);
 	}

Index: linux-2.6.19/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.19.orig/fs/proc/proc_misc.c	2007-01-22 11:39:50.000000000 +0800
+++ linux-2.6.19/fs/proc/proc_misc.c	2007-01-22 11:40:07.000000000 +0800
@@ -182,7 +182,11 @@
 		"Committed_AS: %8lu kB\n"
 		"VmallocTotal: %8lu kB\n"
 		"VmallocUsed:  %8lu kB\n"
-		"VmallocChunk: %8lu kB\n",
+		"VmallocChunk: %8lu kB\n"
+		"PPS Total:    %8d kB\n"
+		"PPS PTE:      %8d kB\n"
+		"PPS Unmapped: %8d kB\n"
+		"PPS Swapped:  %8d kB\n",
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
@@ -213,7 +217,11 @@
 		K(committed),
 		(unsigned long)VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
-		vmi.largest_chunk >> 10
+		vmi.largest_chunk >> 10,
+		K(pps_info.total.counter),
+		K(pps_info.pte_count.counter),
+		K(pps_info.unmapped_count.counter),
+		K(pps_info.swapped_count.counter)
 		);

 		len += hugetlb_report_meminfo(page + len);
Index: linux-2.6.19/include/asm-i386/mmu_context.h
===================================================================
--- linux-2.6.19.orig/include/asm-i386/mmu_context.h	2007-01-22
11:39:50.000000000 +0800
+++ linux-2.6.19/include/asm-i386/mmu_context.h	2007-01-22
11:40:07.000000000 +0800
@@ -32,6 +32,9 @@
 		/* stop flush ipis for the previous mm */
 		cpu_clear(cpu, prev->cpu_vm_mask);
 #ifdef CONFIG_SMP
+		// vmscan.c::end_tlb_tasks maybe had copied cpu_vm_mask before we leave
+		// prev, so let's flush the trace of prev of delay_tlb_tasks.
+		timer_flush_tlb_tasks(NULL);
 		per_cpu(cpu_tlbstate, cpu).state = TLBSTATE_OK;
 		per_cpu(cpu_tlbstate, cpu).active_mm = next;
 #endif
Index: linux-2.6.19/include/asm-i386/pgtable-2level.h
===================================================================
--- linux-2.6.19.orig/include/asm-i386/pgtable-2level.h	2007-01-22
11:39:50.000000000 +0800
+++ linux-2.6.19/include/asm-i386/pgtable-2level.h	2007-01-22
11:40:07.000000000 +0800
@@ -48,21 +48,21 @@
 }

 /*
- * Bits 0, 6 and 7 are taken, split up the 29 bits of offset
+ * Bits 0, 5, 6 and 7 are taken, split up the 28 bits of offset
  * into this range:
  */
-#define PTE_FILE_MAX_BITS	29
+#define PTE_FILE_MAX_BITS	28

 #define pte_to_pgoff(pte) \
-	((((pte).pte_low >> 1) & 0x1f ) + (((pte).pte_low >> 8) << 5 ))
+	((((pte).pte_low >> 1) & 0xf ) + (((pte).pte_low >> 8) << 4 ))

 #define pgoff_to_pte(off) \
-	((pte_t) { (((off) & 0x1f) << 1) + (((off) >> 5) << 8) + _PAGE_FILE })
+	((pte_t) { (((off) & 0xf) << 1) + (((off) >> 4) << 8) + _PAGE_FILE })

 /* Encode and de-code a swap entry */
-#define __swp_type(x)			(((x).val >> 1) & 0x1f)
+#define __swp_type(x)			(((x).val >> 1) & 0xf)
 #define __swp_offset(x)			((x).val >> 8)
-#define __swp_entry(type, offset)	((swp_entry_t) { ((type) << 1) |
((offset) << 8) })
+#define __swp_entry(type, offset)	((swp_entry_t) { ((type & 0xf) <<
1) | ((offset) << 8) | _PAGE_SWAPPED })
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { (x).val })

Index: linux-2.6.19/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.19.orig/include/asm-i386/pgtable.h	2007-01-22
11:39:51.000000000 +0800
+++ linux-2.6.19/include/asm-i386/pgtable.h	2007-01-22 11:40:07.000000000 +0800
@@ -121,7 +121,11 @@
 #define _PAGE_UNUSED3	0x800

 /* If _PAGE_PRESENT is clear, we use these: */
-#define _PAGE_FILE	0x040	/* nonlinear file mapping, saved PTE; unset:swap */
+#define _PAGE_UNMAPPED	0x020	/* a special PTE type, hold its page reference
+								   even it's unmapped, see more from
+								   Documentation/vm_pps.txt. */
+#define _PAGE_SWAPPED 0x040 /* swapped PTE. */
+#define _PAGE_FILE	0x060	/* nonlinear file mapping, saved PTE; */
 #define _PAGE_PROTNONE	0x080	/* if the user mapped it with PROT_NONE;
 				   pte_present gives true */
 #ifdef CONFIG_X86_PAE
@@ -227,7 +231,9 @@
 /*
  * The following only works if pte_present() is not true.
  */
-static inline int pte_file(pte_t pte)		{ return (pte).pte_low & _PAGE_FILE; }
+static inline int pte_unmapped(pte_t pte)	{ return ((pte).pte_low &
0x60) == _PAGE_UNMAPPED; }
+static inline int pte_swapped(pte_t pte)	{ return ((pte).pte_low &
0x60) == _PAGE_SWAPPED; }
+static inline int pte_file(pte_t pte)		{ return ((pte).pte_low &
0x60) == _PAGE_FILE; }

 static inline pte_t pte_rdprotect(pte_t pte)	{ (pte).pte_low &=
~_PAGE_USER; return pte; }
 static inline pte_t pte_exprotect(pte_t pte)	{ (pte).pte_low &=
~_PAGE_USER; return pte; }
Index: linux-2.6.19/include/linux/mm.h
===================================================================
--- linux-2.6.19.orig/include/linux/mm.h	2007-01-22 11:39:50.000000000 +0800
+++ linux-2.6.19/include/linux/mm.h	2007-01-22 11:40:07.000000000 +0800
@@ -168,6 +168,8 @@
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
 #define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
 #define VM_INSERTPAGE	0x02000000	/* The vma has had
"vm_insert_page()" done on it */
+#define VM_PURE_PRIVATE	0x04000000	/* Is the vma is only belonging to a mm,
+									   see more from Documentation/vm_pps.txt */

 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -1165,5 +1167,33 @@

 __attribute__((weak)) const char *arch_vma_name(struct vm_area_struct *vma);

+struct pps_info {
+	atomic_t total;
+	atomic_t pte_count; // stage 1 and 2.
+	atomic_t unmapped_count; // stage 3 and 4.
+	atomic_t swapped_count; // stage 6.
+};
+extern struct pps_info pps_info;
+
+/* vmscan.c::delay flush TLB */
+struct delay_tlb_task
+{
+	struct mm_struct* mm;
+	cpumask_t cpu_mask;
+	struct vm_area_struct* vma[32];
+	unsigned long start[32];
+	unsigned long end[32];
+};
+extern struct delay_tlb_task delay_tlb_tasks[32];
+
+// The prototype of the function is fit with the "func" of "int
+// smp_call_function (void (*func) (void *info), void *info, int retry, int
+// wait);" of include/linux/smp.h of 2.6.16.29. Call it with NULL.
+void timer_flush_tlb_tasks(void* data /* = NULL */);
+
+void enter_pps(struct mm_struct* mm, struct vm_area_struct* vma);
+void leave_pps(struct vm_area_struct* vma, int migrate_flag);
+
+#define MAX_SERIES_LENGTH 8
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
Index: linux-2.6.19/include/linux/swapops.h
===================================================================
--- linux-2.6.19.orig/include/linux/swapops.h	2007-01-22
11:39:50.000000000 +0800
+++ linux-2.6.19/include/linux/swapops.h	2007-01-22 11:40:07.000000000 +0800
@@ -50,7 +50,7 @@
 {
 	swp_entry_t arch_entry;

-	BUG_ON(pte_file(pte));
+	BUG_ON(!pte_swapped(pte));
 	arch_entry = __pte_to_swp_entry(pte);
 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
@@ -64,7 +64,7 @@
 	swp_entry_t arch_entry;

 	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
-	BUG_ON(pte_file(__swp_entry_to_pte(arch_entry)));
+	BUG_ON(!pte_swapped(__swp_entry_to_pte(arch_entry)));
 	return __swp_entry_to_pte(arch_entry);
 }

Index: linux-2.6.19/kernel/fork.c
===================================================================
--- linux-2.6.19.orig/kernel/fork.c	2007-01-22 11:39:50.000000000 +0800
+++ linux-2.6.19/kernel/fork.c	2007-01-22 11:40:07.000000000 +0800
@@ -241,6 +241,7 @@
 		tmp = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 		if (!tmp)
 			goto fail_nomem;
+		leave_pps(mpnt, 1);
 		*tmp = *mpnt;
 		pol = mpol_copy(vma_policy(mpnt));
 		retval = PTR_ERR(pol);
Index: linux-2.6.19/kernel/timer.c
===================================================================
--- linux-2.6.19.orig/kernel/timer.c	2007-01-22 11:39:50.000000000 +0800
+++ linux-2.6.19/kernel/timer.c	2007-01-22 11:40:07.000000000 +0800
@@ -1115,6 +1115,10 @@
 		rcu_check_callbacks(cpu, user_tick);
 	scheduler_tick();
  	run_posix_cpu_timers(p);
+
+#ifdef SMP
+	timer_flush_tlb_tasks(NULL);
+#endif
 }

 /*
Index: linux-2.6.19/mm/fremap.c
===================================================================
--- linux-2.6.19.orig/mm/fremap.c	2007-01-22 11:39:50.000000000 +0800
+++ linux-2.6.19/mm/fremap.c	2007-01-22 11:40:07.000000000 +0800
@@ -37,7 +37,7 @@
 			page_cache_release(page);
 		}
 	} else {
-		if (!pte_file(pte))
+		if (pte_swapped(pte))
 			free_swap_and_cache(pte_to_swp_entry(pte));
 		pte_clear_not_present_full(mm, addr, ptep, 0);
 	}
Index: linux-2.6.19/mm/memory.c
===================================================================
--- linux-2.6.19.orig/mm/memory.c	2007-01-22 11:39:50.000000000 +0800
+++ linux-2.6.19/mm/memory.c	2007-01-22 11:40:07.000000000 +0800
@@ -435,7 +435,7 @@

 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
-		if (!pte_file(pte)) {
+		if (pte_swapped(pte)) {
 			swp_entry_t entry = pte_to_swp_entry(pte);

 			swap_duplicate(entry);
@@ -628,6 +628,9 @@
 	spinlock_t *ptl;
 	int file_rss = 0;
 	int anon_rss = 0;
+	int pps_pte = 0;
+	int pps_unmapped = 0;
+	int pps_swapped = 0;

 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -672,6 +675,13 @@
 						addr) != page->index)
 				set_pte_at(mm, addr, pte,
 					   pgoff_to_pte(page->index));
+			if (vma->vm_flags & VM_PURE_PRIVATE) {
+				if (page != ZERO_PAGE(addr)) {
+					if (PageWriteback(page))
+						lru_cache_add_active(page);
+					pps_pte++;
+				}
+			}
 			if (PageAnon(page))
 				anon_rss--;
 			else {
@@ -691,12 +701,31 @@
 		 */
 		if (unlikely(details))
 			continue;
-		if (!pte_file(ptent))
+		if (pte_unmapped(ptent)) {
+			struct page *page;
+			page = pfn_to_page(pte_pfn(ptent));
+			BUG_ON(page == ZERO_PAGE(addr));
+			if (PageWriteback(page))
+				lru_cache_add_active(page);
+			pps_unmapped++;
+			ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+			tlb_remove_page(tlb, page);
+			anon_rss--;
+			continue;
+		}
+		if (pte_swapped(ptent)) {
+			if (vma->vm_flags & VM_PURE_PRIVATE)
+				pps_swapped++;
 			free_swap_and_cache(pte_to_swp_entry(ptent));
+		}
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 	} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));

 	add_mm_rss(mm, file_rss, anon_rss);
+	atomic_sub(pps_pte + pps_unmapped, &pps_info.total);
+	atomic_sub(pps_pte, &pps_info.pte_count);
+	atomic_sub(pps_unmapped, &pps_info.unmapped_count);
+	atomic_sub(pps_swapped, &pps_info.swapped_count);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);

@@ -955,7 +984,8 @@
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
 			set_page_dirty(page);
-		mark_page_accessed(page);
+		if (!(vma->vm_flags & VM_PURE_PRIVATE))
+			mark_page_accessed(page);
 	}
 unlock:
 	pte_unmap_unlock(ptep, ptl);
@@ -1606,7 +1636,12 @@
 		ptep_clear_flush(vma, address, page_table);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
-		lru_cache_add_active(new_page);
+		if (!(vma->vm_flags & VM_PURE_PRIVATE))
+			lru_cache_add_active(new_page);
+		else {
+			atomic_inc(&pps_info.total);
+			atomic_inc(&pps_info.pte_count);
+		}
 		page_add_new_anon_rmap(new_page, vma, address);

 		/* Free the old page.. */
@@ -1975,6 +2010,84 @@
 }

 /*
+ * New read ahead code, mainly for VM_PURE_PRIVATE only.
+ */
+static void pps_swapin_readahead(swp_entry_t entry, unsigned long
addr,struct vm_area_struct *vma, pte_t* pte, pmd_t* pmd)
+{
+	struct page* page;
+	pte_t *prev, *next;
+	swp_entry_t temp;
+	spinlock_t* ptl = pte_lockptr(vma->vm_mm, pmd);
+	int swapType = swp_type(entry);
+	int swapOffset = swp_offset(entry);
+	int readahead = 1, abs;
+
+	if (!(vma->vm_flags & VM_PURE_PRIVATE)) {
+		swapin_readahead(entry, addr, vma);
+		return;
+	}
+
+	page = read_swap_cache_async(entry, vma, addr);
+	if (!page)
+		return;
+	page_cache_release(page);
+
+	// read ahead the whole series, first forward then backward.
+	while (readahead < MAX_SERIES_LENGTH) {
+		next = pte++;
+		if (next - (pte_t*) pmd >= PTRS_PER_PTE)
+			break;
+		spin_lock(ptl);
+        if (!(!pte_present(*next) && pte_swapped(*next))) {
+			spin_unlock(ptl);
+			break;
+		}
+		temp = pte_to_swp_entry(*next);
+		spin_unlock(ptl);
+		if (swp_type(temp) != swapType)
+			break;
+		abs = swp_offset(temp) - swapOffset;
+		abs = abs < 0 ? -abs : abs;
+		swapOffset = swp_offset(temp);
+		if (abs > 8)
+			// the two swap entries are too far, give up!
+			break;
+		page = read_swap_cache_async(temp, vma, addr);
+		if (!page)
+			return;
+		page_cache_release(page);
+		readahead++;
+	}
+
+	swapOffset = swp_offset(entry);
+	while (readahead < MAX_SERIES_LENGTH) {
+		prev = pte--;
+		if (prev - (pte_t*) pmd < 0)
+			break;
+		spin_lock(ptl);
+        if (!(!pte_present(*prev) && pte_swapped(*prev))) {
+			spin_unlock(ptl);
+			break;
+		}
+		temp = pte_to_swp_entry(*prev);
+		spin_unlock(ptl);
+		if (swp_type(temp) != swapType)
+			break;
+		abs = swp_offset(temp) - swapOffset;
+		abs = abs < 0 ? -abs : abs;
+		swapOffset = swp_offset(temp);
+		if (abs > 8)
+			// the two swap entries are too far, give up!
+			break;
+		page = read_swap_cache_async(temp, vma, addr);
+		if (!page)
+			return;
+		page_cache_release(page);
+		readahead++;
+	}
+}
+
+/*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
@@ -2001,7 +2114,7 @@
 	page = lookup_swap_cache(entry);
 	if (!page) {
 		grab_swap_token(); /* Contend for token _before_ read-in */
- 		swapin_readahead(entry, address, vma);
+		pps_swapin_readahead(entry, address, vma, page_table, pmd);
  		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
@@ -2021,7 +2134,8 @@
 	}

 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
-	mark_page_accessed(page);
+	if (!(vma->vm_flags & VM_PURE_PRIVATE))
+		mark_page_accessed(page);
 	lock_page(page);

 	/*
@@ -2033,6 +2147,10 @@

 	if (unlikely(!PageUptodate(page))) {
 		ret = VM_FAULT_SIGBUS;
+		if (vma->vm_flags & VM_PURE_PRIVATE) {
+			lru_cache_add_active(page);
+			mark_page_accessed(page);
+		}
 		goto out_nomap;
 	}

@@ -2053,6 +2171,11 @@
 	if (vm_swap_full())
 		remove_exclusive_swap_page(page);
 	unlock_page(page);
+	if (vma->vm_flags & VM_PURE_PRIVATE) {
+		atomic_dec(&pps_info.swapped_count);
+		atomic_inc(&pps_info.total);
+		atomic_inc(&pps_info.pte_count);
+	}

 	if (write_access) {
 		if (do_wp_page(mm, vma, address,
@@ -2104,8 +2227,13 @@
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto release;
+		if (!(vma->vm_flags & VM_PURE_PRIVATE))
+			lru_cache_add_active(page);
+		else {
+			atomic_inc(&pps_info.total);
+			atomic_inc(&pps_info.pte_count);
+		}
 		inc_mm_counter(mm, anon_rss);
-		lru_cache_add_active(page);
 		page_add_new_anon_rmap(page, vma, address);
 	} else {
 		/* Map the ZERO_PAGE - vm_page_prot is readonly */
@@ -2392,6 +2520,22 @@

 	old_entry = entry = *pte;
 	if (!pte_present(entry)) {
+		if (pte_unmapped(entry)) {
+			BUG_ON(!(vma->vm_flags & VM_PURE_PRIVATE));
+			atomic_dec(&pps_info.unmapped_count);
+			atomic_inc(&pps_info.pte_count);
+			struct page* page = pte_page(entry);
+			pte_t temp_pte = mk_pte(page, vma->vm_page_prot);
+			pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+			if (unlikely(pte_same(*pte, entry))) {
+				page_add_new_anon_rmap(page, vma, address);
+				set_pte_at(mm, address, pte, temp_pte);
+				update_mmu_cache(vma, address, temp_pte);
+				lazy_mmu_prot_update(temp_pte);
+			}
+			pte_unmap_unlock(pte, ptl);
+			return VM_FAULT_MINOR;
+		}
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
 				if (vma->vm_ops->nopage)
@@ -2685,3 +2829,118 @@

 	return buf - old_buf;
 }
+
+static void migrate_back_pte_range(struct mm_struct* mm, pmd_t *pmd, struct
+		vm_area_struct *vma, unsigned long addr, unsigned long end)
+{
+	struct page* page;
+	pte_t entry;
+	pte_t *pte;
+	spinlock_t* ptl;
+	int pps_pte = 0;
+	int pps_unmapped = 0;
+	int pps_swapped = 0;
+
+	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	do {
+		if (!pte_present(*pte) && pte_unmapped(*pte)) {
+			page = pte_page(*pte);
+			entry = mk_pte(page, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			set_pte_at(mm, addr, pte, entry);
+			BUG_ON(page == ZERO_PAGE(addr));
+			page_add_new_anon_rmap(page, vma, addr);
+			lru_cache_add_active(page);
+			pps_unmapped++;
+		} else if (pte_present(*pte)) {
+			page = pte_page(*pte);
+			if (page == ZERO_PAGE(addr))
+				continue;
+			lru_cache_add_active(page);
+			pps_pte++;
+		} else if (!pte_present(*pte) && pte_swapped(*pte))
+			pps_swapped++;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+	pte_unmap_unlock(pte - 1, ptl);
+	lru_add_drain();
+	atomic_sub(pps_pte + pps_unmapped, &pps_info.total);
+	atomic_sub(pps_pte, &pps_info.pte_count);
+	atomic_sub(pps_unmapped, &pps_info.unmapped_count);
+	atomic_sub(pps_swapped, &pps_info.swapped_count);
+}
+
+static void migrate_back_pmd_range(struct mm_struct* mm, pud_t *pud, struct
+		vm_area_struct *vma, unsigned long addr, unsigned long end)
+{
+	pmd_t *pmd;
+	unsigned long next;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		migrate_back_pte_range(mm, pmd, vma, addr, next);
+	} while (pmd++, addr = next, addr != end);
+}
+
+static void migrate_back_pud_range(struct mm_struct* mm, pgd_t *pgd, struct
+		vm_area_struct *vma, unsigned long addr, unsigned long end)
+{
+	pud_t *pud;
+	unsigned long next;
+
+	pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		migrate_back_pmd_range(mm, pud, vma, addr, next);
+	} while (pud++, addr = next, addr != end);
+}
+
+// migrate all pages of pure private vma back to Linux legacy memory
management.
+static void migrate_back_legacy_linux(struct mm_struct* mm, struct
vm_area_struct* vma)
+{
+	pgd_t* pgd;
+	unsigned long next;
+	unsigned long addr = vma->vm_start;
+	unsigned long end = vma->vm_end;
+
+	pgd = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		migrate_back_pud_range(mm, pgd, vma, addr, next);
+	} while (pgd++, addr = next, addr != end);
+}
+
+void enter_pps(struct mm_struct* mm, struct vm_area_struct* vma)
+{
+	int condition = VM_READ | VM_WRITE | VM_EXEC | \
+		 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC | \
+		 VM_GROWSDOWN | VM_GROWSUP | \
+		 VM_LOCKED | VM_SEQ_READ | VM_RAND_READ | VM_DONTCOPY | VM_ACCOUNT | \
+		 VM_PURE_PRIVATE;
+	if (!(vma->vm_flags & ~condition) && vma->vm_file == NULL) {
+		vma->vm_flags |= VM_PURE_PRIVATE;
+		if (list_empty(&mm->mmlist)) {
+			spin_lock(&mmlist_lock);
+			if (list_empty(&mm->mmlist))
+				list_add(&mm->mmlist, &init_mm.mmlist);
+			spin_unlock(&mmlist_lock);
+		}
+	}
+}
+
+void leave_pps(struct vm_area_struct* vma, int migrate_flag)
+{
+	struct mm_struct* mm = vma->vm_mm;
+
+	if (vma->vm_flags & VM_PURE_PRIVATE) {
+		vma->vm_flags &= ~VM_PURE_PRIVATE;
+		if (migrate_flag)
+			migrate_back_legacy_linux(mm, vma);
+	}
+}
Index: linux-2.6.19/mm/mmap.c
===================================================================
--- linux-2.6.19.orig/mm/mmap.c	2007-01-22 11:39:50.000000000 +0800
+++ linux-2.6.19/mm/mmap.c	2007-01-22 11:40:07.000000000 +0800
@@ -229,6 +229,7 @@
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_free(vma_policy(vma));
+	leave_pps(vma, 0);
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
 }
@@ -620,6 +621,7 @@
 			fput(file);
 		mm->map_count--;
 		mpol_free(vma_policy(next));
+		leave_pps(next, 0);
 		kmem_cache_free(vm_area_cachep, next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
@@ -1112,6 +1114,8 @@
 	if ((vm_flags & (VM_SHARED|VM_ACCOUNT)) == (VM_SHARED|VM_ACCOUNT))
 		vma->vm_flags &= ~VM_ACCOUNT;

+	enter_pps(mm, vma);
+
 	/* Can addr have changed??
 	 *
 	 * Answer: Yes, several device drivers can do it in their
@@ -1138,6 +1142,7 @@
 			fput(file);
 		}
 		mpol_free(vma_policy(vma));
+		leave_pps(vma, 0);
 		kmem_cache_free(vm_area_cachep, vma);
 	}
 out:	
@@ -1165,6 +1170,7 @@
 	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
 	charged = 0;
 free_vma:
+	leave_pps(vma, 0);
 	kmem_cache_free(vm_area_cachep, vma);
 unacct_error:
 	if (charged)
@@ -1742,6 +1748,10 @@

 	/* most fields are the same, copy all, and then fixup */
 	*new = *vma;
+	if (new->vm_flags & VM_PURE_PRIVATE) {
+		new->vm_flags &= ~VM_PURE_PRIVATE;
+		enter_pps(mm, new);
+	}

 	if (new_below)
 		new->vm_end = addr;
@@ -1950,6 +1960,7 @@
 	vma->vm_flags = flags;
 	vma->vm_page_prot = protection_map[flags &
 				(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
+	enter_pps(mm, vma);
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 out:
 	mm->total_vm += len >> PAGE_SHIFT;
@@ -2073,6 +2084,10 @@
 				get_file(new_vma->vm_file);
 			if (new_vma->vm_ops && new_vma->vm_ops->open)
 				new_vma->vm_ops->open(new_vma);
+			if (new_vma->vm_flags & VM_PURE_PRIVATE) {
+				new_vma->vm_flags &= ~VM_PURE_PRIVATE;
+				enter_pps(mm, new_vma);
+			}
 			vma_link(mm, new_vma, prev, rb_link, rb_parent);
 		}
 	}
Index: linux-2.6.19/mm/rmap.c
===================================================================
--- linux-2.6.19.orig/mm/rmap.c	2007-01-22 11:39:50.000000000 +0800
+++ linux-2.6.19/mm/rmap.c	2007-01-22 11:40:07.000000000 +0800
@@ -618,6 +618,7 @@
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;

+	BUG_ON(vma->vm_flags & VM_PURE_PRIVATE);
 	address = vma_address(page, vma);
 	if (address == -EFAULT)
 		goto out;
@@ -676,7 +677,7 @@
 #endif
 		}
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
-		BUG_ON(pte_file(*pte));
+		BUG_ON(!pte_swapped(*pte));
 	} else
 #ifdef CONFIG_MIGRATION
 	if (migration) {
Index: linux-2.6.19/mm/swap_state.c
===================================================================
--- linux-2.6.19.orig/mm/swap_state.c	2007-01-22 11:39:50.000000000 +0800
+++ linux-2.6.19/mm/swap_state.c	2007-01-22 11:40:07.000000000 +0800
@@ -354,7 +354,8 @@
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			lru_cache_add_active(new_page);
+			if (vma == NULL || !(vma->vm_flags & VM_PURE_PRIVATE))
+				lru_cache_add_active(new_page);
 			swap_readpage(NULL, new_page);
 			return new_page;
 		}
Index: linux-2.6.19/mm/swapfile.c
===================================================================
--- linux-2.6.19.orig/mm/swapfile.c	2007-01-22 11:39:50.000000000 +0800
+++ linux-2.6.19/mm/swapfile.c	2007-01-22 11:40:07.000000000 +0800
@@ -505,6 +505,166 @@
 }
 #endif

+static int pps_test_swap_type(struct mm_struct* mm, pmd_t* pmd, pte_t* pte, int
+		type, struct page** ret_page)
+{
+	spinlock_t* ptl = pte_lockptr(mm, pmd);
+	swp_entry_t entry;
+	struct page* page;
+
+	spin_lock(ptl);
+	if (!pte_present(*pte) && pte_swapped(*pte)) {
+		entry = pte_to_swp_entry(*pte);
+		if (swp_type(entry) == type) {
+			*ret_page = NULL;
+			spin_unlock(ptl);
+			return 1;
+		}
+	} else {
+		page = pfn_to_page(pte_pfn(*pte));
+		if (PageSwapCache(page)) {
+			entry.val = page_private(page);
+			if (swp_type(entry) == type) {
+				page_cache_get(page);
+				*ret_page = page;
+				spin_unlock(ptl);
+				return 1;
+			}
+		}
+	}
+	spin_unlock(ptl);
+	return 0;
+}
+
+static int pps_swapoff_scan_ptes(struct mm_struct* mm, struct vm_area_struct*
+		vma, pmd_t* pmd, unsigned long addr, unsigned long end, int type)
+{
+	pte_t *pte;
+	struct page* page;
+
+	pte = pte_offset_map(pmd, addr);
+	do {
+		while (pps_test_swap_type(mm, pmd, pte, type, &page)) {
+			if (page == NULL) {
+				switch (__handle_mm_fault(mm, vma, addr, 0)) {
+				case VM_FAULT_SIGBUS:
+				case VM_FAULT_OOM:
+					return -ENOMEM;
+				case VM_FAULT_MINOR:
+				case VM_FAULT_MAJOR:
+					break;
+				default:
+					BUG();
+				}
+			} else {
+				wait_on_page_locked(page);
+				wait_on_page_writeback(page);
+				lock_page(page);
+				if (!PageSwapCache(page)) {
+					unlock_page(page);
+					page_cache_release(page);
+					break;
+				}
+				wait_on_page_writeback(page);
+				delete_from_swap_cache(page);
+				unlock_page(page);
+				page_cache_release(page);
+				break;
+			}
+		}
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+	return 0;
+}
+
+static int pps_swapoff_pmd_range(struct mm_struct* mm, struct vm_area_struct*
+		vma, pud_t* pud, unsigned long addr, unsigned long end, int type)
+{
+	unsigned long next;
+	int ret;
+	pmd_t* pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		ret = pps_swapoff_scan_ptes(mm, vma, pmd, addr, next, type);
+		if (ret == -ENOMEM)
+			return ret;
+	} while (pmd++, addr = next, addr != end);
+	return 0;
+}
+
+static int pps_swapoff_pud_range(struct mm_struct* mm, struct vm_area_struct*
+		vma, pgd_t* pgd, unsigned long addr, unsigned long end, int type)
+{
+	unsigned long next;
+	int ret;
+	pud_t* pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		ret = pps_swapoff_pmd_range(mm, vma, pud, addr, next, type);
+		if (ret == -ENOMEM)
+			return ret;
+	} while (pud++, addr = next, addr != end);
+	return 0;
+}
+
+static int pps_swapoff_pgd_range(struct mm_struct* mm, struct vm_area_struct*
+		vma, int type)
+{
+	unsigned long next;
+	unsigned long addr = vma->vm_start;
+	unsigned long end = vma->vm_end;
+	int ret;
+	pgd_t* pgd = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		ret = pps_swapoff_pud_range(mm, vma, pgd, addr, next, type);
+		if (ret == -ENOMEM)
+			return ret;
+	} while (pgd++, addr = next, addr != end);
+	return 0;
+}
+
+static int pps_swapoff(int type)
+{
+	struct vm_area_struct* vma;
+	struct list_head *pos;
+	struct mm_struct *prev, *mm;
+	int ret = 0;
+
+	prev = mm = &init_mm;
+	pos = &init_mm.mmlist;
+	atomic_inc(&prev->mm_users);
+	spin_lock(&mmlist_lock);
+	while ((pos = pos->next) != &init_mm.mmlist) {
+		mm = list_entry(pos, struct mm_struct, mmlist);
+		if (!atomic_inc_not_zero(&mm->mm_users))
+			continue;
+		spin_unlock(&mmlist_lock);
+		mmput(prev);
+		prev = mm;
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
+			if (!(vma->vm_flags & VM_PURE_PRIVATE))
+				continue;
+			if (vma->vm_flags & VM_LOCKED)
+				continue;
+			ret = pps_swapoff_pgd_range(mm, vma, type);
+			if (ret == -ENOMEM)
+				break;
+		}
+		up_read(&mm->mmap_sem);
+		spin_lock(&mmlist_lock);
+	}
+	spin_unlock(&mmlist_lock);
+	mmput(prev);
+	return ret;
+}
+
 /*
  * No need to decide whether this PTE shares the swap entry with others,
  * just let do_wp_page work it out if a write is requested later - to
@@ -698,6 +858,12 @@
 	int reset_overflow = 0;
 	int shmem;

+	// Let's first read all pps pages back! Note, it's one-to-one mapping.
+	retval = pps_swapoff(type);
+	if (retval == -ENOMEM) // something was wrong.
+		return -ENOMEM;
+	// Now, the remain pages are shared pages, go ahead!
+
 	/*
 	 * When searching mms for an entry, a good strategy is to
 	 * start at the first mm we freed the previous entry from
@@ -918,16 +1084,20 @@
  */
 static void drain_mmlist(void)
 {
-	struct list_head *p, *next;
+	// struct list_head *p, *next;
 	unsigned int i;

 	for (i = 0; i < nr_swapfiles; i++)
 		if (swap_info[i].inuse_pages)
 			return;
+	/*
+	 * Now, init_mm.mmlist list not only is used by SwapDevice but also is used
+	 * by PPS, see Documentation/vm_pps.txt.
 	spin_lock(&mmlist_lock);
 	list_for_each_safe(p, next, &init_mm.mmlist)
 		list_del_init(p);
 	spin_unlock(&mmlist_lock);
+	*/
 }

 /*
Index: linux-2.6.19/mm/vmscan.c
===================================================================
--- linux-2.6.19.orig/mm/vmscan.c	2007-01-22 11:39:50.000000000 +0800
+++ linux-2.6.19/mm/vmscan.c	2007-01-22 13:45:58.501581280 +0800
@@ -66,6 +66,10 @@
 	int swappiness;

 	int all_unreclaimable;
+
+	/* pps control command. See Documentation/vm_pps.txt. */
+	int may_reclaim;
+	int reclaim_node;
 };

 /*
@@ -1097,6 +1101,434 @@
 	return ret;
 }

+// pps fields.
+static wait_queue_head_t kppsd_wait;
+static struct scan_control wakeup_sc;
+struct pps_info pps_info = {
+	.total = ATOMIC_INIT(0),
+	.pte_count = ATOMIC_INIT(0), // stage 1 and 2.
+	.unmapped_count = ATOMIC_INIT(0), // stage 3 and 4.
+	.swapped_count = ATOMIC_INIT(0) // stage 6.
+};
+// pps end.
+
+struct series_t {
+	pte_t orig_ptes[MAX_SERIES_LENGTH];
+	pte_t* ptes[MAX_SERIES_LENGTH];
+	struct page* pages[MAX_SERIES_LENGTH];
+	int series_length;
+	int series_stage;
+} series;
+
+static int get_series_stage(pte_t* pte, int index)
+{
+	series.orig_ptes[index] = *pte;
+	series.ptes[index] = pte;
+	if (pte_present(series.orig_ptes[index])) {
+		struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index]));
+		series.pages[index] = page;
+		if (page == ZERO_PAGE(addr)) // reserved page is exclusive from us.
+			return 7;
+		if (pte_young(series.orig_ptes[index])) {
+			return 1;
+		} else
+			return 2;
+	} else if (pte_unmapped(series.orig_ptes[index])) {
+		struct page* page = pfn_to_page(pte_pfn(series.orig_ptes[index]));
+		series.pages[index] = page;
+		if (!PageSwapCache(page))
+			return 3;
+		else {
+			if (PageWriteback(page) || PageDirty(page))
+				return 4;
+			else
+				return 5;
+		}
+	} else // pte_swapped -- SwappedPTE
+		return 6;
+}
+
+static void find_series(pte_t** start, unsigned long* addr, unsigned long end)
+{
+	int i;
+	int series_stage = get_series_stage((*start)++, 0);
+	*addr += PAGE_SIZE;
+
+	for (i = 1; i < MAX_SERIES_LENGTH && *addr < end; i++, (*start)++,
*addr += PAGE_SIZE) {
+		if (series_stage != get_series_stage(*start, i))
+			break;
+	}
+	series.series_stage = series_stage;
+	series.series_length = i;
+}
+
+struct delay_tlb_task delay_tlb_tasks[32] = { [0 ... 31] = {0} };
+
+void timer_flush_tlb_tasks(void* data)
+{
+	int i;
+#ifdef CONFIG_X86
+	int flag = 0;
+#endif
+	for (i = 0; i < 32; i++) {
+		if (delay_tlb_tasks[i].mm != NULL &&
+				cpu_isset(smp_processor_id(), delay_tlb_tasks[i].mm->cpu_vm_mask) &&
+				cpu_isset(smp_processor_id(), delay_tlb_tasks[i].cpu_mask)) {
+#ifdef CONFIG_X86
+			flag = 1;
+#elif
+			// smp::local_flush_tlb_range(delay_tlb_tasks[i]);
+#endif
+			cpu_clear(smp_processor_id(), delay_tlb_tasks[i].cpu_mask);
+		}
+	}
+#ifdef CONFIG_X86
+	if (flag)
+		local_flush_tlb();
+#endif
+}
+
+static struct delay_tlb_task* delay_task = NULL;
+static int vma_index = 0;
+
+static struct delay_tlb_task* search_free_tlb_tasks_slot(void)
+{
+	struct delay_tlb_task* ret = NULL;
+	int i;
+again:
+	for (i = 0; i < 32; i++) {
+		if (delay_tlb_tasks[i].mm != NULL) {
+			if (cpus_empty(delay_tlb_tasks[i].cpu_mask)) {
+				mmput(delay_tlb_tasks[i].mm);
+				delay_tlb_tasks[i].mm = NULL;
+				ret = &delay_tlb_tasks[i];
+			}
+		} else
+			ret = &delay_tlb_tasks[i];
+	}
+	if (!ret) { // Force flush TLBs.
+		on_each_cpu(timer_flush_tlb_tasks, NULL, 0, 1);
+		goto again;
+	}
+	return ret;
+}
+
+static void init_delay_task(struct mm_struct* mm)
+{
+	cpus_clear(delay_task->cpu_mask);
+	vma_index = 0;
+	delay_task->mm = mm;
+}
+
+/*
+ * We will be working on the mm, so let's force to flush it if necessary.
+ */
+static void start_tlb_tasks(struct mm_struct* mm)
+{
+	int i, flag = 0;
+again:
+	for (i = 0; i < 32; i++) {
+		if (delay_tlb_tasks[i].mm == mm) {
+			if (cpus_empty(delay_tlb_tasks[i].cpu_mask)) {
+				mmput(delay_tlb_tasks[i].mm);
+				delay_tlb_tasks[i].mm = NULL;
+			} else
+				flag = 1;
+		}
+	}
+	if (flag) { // Force flush TLBs.
+		on_each_cpu(timer_flush_tlb_tasks, NULL, 0, 1);
+		goto again;
+	}
+	BUG_ON(delay_task != NULL);
+	delay_task = search_free_tlb_tasks_slot();
+	init_delay_task(mm);
+}
+
+static void end_tlb_tasks(void)
+{
+	atomic_inc(&delay_task->mm->mm_users);
+	delay_task->cpu_mask = delay_task->mm->cpu_vm_mask;
+	delay_task = NULL;
+#ifndef CONFIG_SMP
+	timer_flush_tlb_tasks(NULL);
+#endif
+}
+
+static void fill_in_tlb_tasks(struct vm_area_struct* vma, unsigned long addr,
+		unsigned long end)
+{
+	struct mm_struct* mm;
+	// First, try to combine the task with the previous.
+	if (vma_index != 0 && delay_task->vma[vma_index - 1] == vma &&
+			delay_task->end[vma_index - 1] == addr) {
+		delay_task->end[vma_index - 1] = end;
+		return;
+	}
+fill_it:
+	if (vma_index != 32) {
+		delay_task->vma[vma_index] = vma;
+		delay_task->start[vma_index] = addr;
+		delay_task->end[vma_index] = end;
+		vma_index++;
+		return;
+	}
+	mm = delay_task->mm;
+	end_tlb_tasks();
+
+	delay_task = search_free_tlb_tasks_slot();
+	init_delay_task(mm);
+	goto fill_it;
+}
+
+static void shrink_pvma_scan_ptes(struct scan_control* sc, struct mm_struct*
+		mm, struct vm_area_struct* vma, pmd_t* pmd, unsigned long addr,
+		unsigned long end)
+{
+	int i, statistic;
+	spinlock_t* ptl = pte_lockptr(mm, pmd);
+	pte_t* pte = pte_offset_map(pmd, addr);
+	int anon_rss = 0;
+	struct pagevec freed_pvec;
+	int may_enter_fs = (sc->gfp_mask & (__GFP_FS | __GFP_IO));
+	struct address_space* mapping = &swapper_space;
+
+	pagevec_init(&freed_pvec, 1);
+	do {
+		memset(&series, 0, sizeof(struct series_t));
+		find_series(&pte, &addr, end);
+		if (sc->may_reclaim == 0 && series.series_stage == 5)
+			continue;
+		switch (series.series_stage) {
+			case 1: // PTE -- untouched PTE.
+				for (i = 0; i < series.series_length; i++) {
+					struct page* page = series.pages[i];
+					lock_page(page);
+					spin_lock(ptl);
+					if (unlikely(pte_same(*series.ptes[i], series.orig_ptes[i]))) {
+						if (pte_dirty(*series.ptes[i]))
+							set_page_dirty(page);
+						set_pte_at(mm, addr + i * PAGE_SIZE, series.ptes[i],
+								pte_mkold(pte_mkclean(*series.ptes[i])));
+					}
+					spin_unlock(ptl);
+					unlock_page(page);
+				}
+				fill_in_tlb_tasks(vma, addr, addr + (PAGE_SIZE * series.series_length));
+				break;
+			case 2: // untouched PTE -- UnmappedPTE.
+				/*
+				 * Note in stage 1, we've flushed TLB in fill_in_tlb_tasks, so
+				 * if it's still clear here, we can shift it to Unmapped type.
+				 *
+				 * If some architecture doesn't support atomic cmpxchg
+				 * instruction or can't atomically set the access bit after
+				 * they touch a pte at first, combine stage 1 with stage 2, and
+				 * send IPI immediately in fill_in_tlb_tasks.
+				 */
+				spin_lock(ptl);
+				statistic = 0;
+				for (i = 0; i < series.series_length; i++) {
+					if (unlikely(pte_same(*series.ptes[i], series.orig_ptes[i]))) {
+						pte_t pte_unmapped = series.orig_ptes[i];
+						pte_unmapped.pte_low &= ~_PAGE_PRESENT;
+						pte_unmapped.pte_low |= _PAGE_UNMAPPED;
+						if (cmpxchg(&series.ptes[i]->pte_low,
+									series.orig_ptes[i].pte_low,
+									pte_unmapped.pte_low) !=
+								series.orig_ptes[i].pte_low)
+							continue;
+						page_remove_rmap(series.pages[i], vma);
+						anon_rss--;
+						statistic++;
+					}
+				}
+				atomic_add(statistic, &pps_info.unmapped_count);
+				atomic_sub(statistic, &pps_info.pte_count);
+				spin_unlock(ptl);
+				break;
+			case 3: // Attach SwapPage to PrivatePage.
+				/*
+				 * A better arithmetic should be applied to Linux SwapDevice to
+				 * allocate fake continual SwapPages which are close to each
+				 * other, the offset between two close SwapPages is less than 8.
+				 */
+				if (sc->may_swap) {
+					for (i = 0; i < series.series_length; i++) {
+						lock_page(series.pages[i]);
+						if (!PageSwapCache(series.pages[i])) {
+							if (!add_to_swap(series.pages[i], GFP_ATOMIC)) {
+								unlock_page(series.pages[i]);
+								break;
+							}
+						}
+						unlock_page(series.pages[i]);
+					}
+				}
+				break;
+			case 4: // SwapPage isn't consistent with PrivatePage.
+				/*
+				 * A mini version pageout().
+				 *
+				 * Current swap space can't commit multiple pages together:(
+				 */
+				if (sc->may_writepage && may_enter_fs) {
+					for (i = 0; i < series.series_length; i++) {
+						struct page* page = series.pages[i];
+						int res;
+
+						if (!may_write_to_queue(mapping->backing_dev_info))
+							break;
+						lock_page(page);
+						if (!PageDirty(page) || PageWriteback(page)) {
+							unlock_page(page);
+							continue;
+						}
+						clear_page_dirty_for_io(page);
+						struct writeback_control wbc = {
+							.sync_mode = WB_SYNC_NONE,
+							.nr_to_write = SWAP_CLUSTER_MAX,
+							.nonblocking = 1,
+							.for_reclaim = 1,
+						};
+						page_cache_get(page);
+						SetPageReclaim(page);
+						res = swap_writepage(page, &wbc);
+						if (res < 0) {
+							handle_write_error(mapping, page, res);
+							ClearPageReclaim(page);
+							page_cache_release(page);
+							break;
+						}
+						if (!PageWriteback(page))
+							ClearPageReclaim(page);
+						page_cache_release(page);
+					}
+				}
+				break;
+			case 5: // UnmappedPTE -- SwappedPTE, reclaim PrivatePage.
+				statistic = 0;
+				for (i = 0; i < series.series_length; i++) {
+					struct page* page = series.pages[i];
+					if (!(page_to_nid(page) == sc->reclaim_node ||
+							sc->reclaim_node == -1))
+						continue;
+
+					lock_page(page);
+					spin_lock(ptl);
+					if (!pte_same(*series.ptes[i], series.orig_ptes[i]) ||
+							/* We're racing with get_user_pages. */
+							PageSwapCache(page) ?  page_count(page) > 2 :
+							page_count(page) > 1) {
+						spin_unlock(ptl);
+						unlock_page(page);
+						continue;
+					}
+					statistic++;
+					swp_entry_t entry = { .val = page_private(page) };
+					swap_duplicate(entry);
+					pte_t pte_swp = swp_entry_to_pte(entry);
+					set_pte_at(mm, addr + i * PAGE_SIZE, series.ptes[i], pte_swp);
+					spin_unlock(ptl);
+					if (PageSwapCache(page) && !PageWriteback(page))
+						delete_from_swap_cache(page);
+					unlock_page(page);
+
+					if (!pagevec_add(&freed_pvec, page))
+						__pagevec_release_nonlru(&freed_pvec);
+				}
+				atomic_add(statistic, &pps_info.swapped_count);
+				atomic_sub(statistic, &pps_info.unmapped_count);
+				atomic_sub(statistic, &pps_info.total);
+				break;
+			case 6:
+				// NULL operation!
+				break;
+		}
+	} while (addr < end);
+	add_mm_counter(mm, anon_rss, anon_rss);
+	if (pagevec_count(&freed_pvec))
+		__pagevec_release_nonlru(&freed_pvec);
+}
+
+static void shrink_pvma_pmd_range(struct scan_control* sc, struct mm_struct*
+		mm, struct vm_area_struct* vma, pud_t* pud, unsigned long addr,
+		unsigned long end)
+{
+	unsigned long next;
+	pmd_t* pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		shrink_pvma_scan_ptes(sc, mm, vma, pmd, addr, next);
+	} while (pmd++, addr = next, addr != end);
+}
+
+static void shrink_pvma_pud_range(struct scan_control* sc, struct mm_struct*
+		mm, struct vm_area_struct* vma, pgd_t* pgd, unsigned long addr,
+		unsigned long end)
+{
+	unsigned long next;
+	pud_t* pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		shrink_pvma_pmd_range(sc, mm, vma, pud, addr, next);
+	} while (pud++, addr = next, addr != end);
+}
+
+static void shrink_pvma_pgd_range(struct scan_control* sc, struct mm_struct*
+		mm, struct vm_area_struct* vma)
+{
+	unsigned long next;
+	unsigned long addr = vma->vm_start;
+	unsigned long end = vma->vm_end;
+	pgd_t* pgd = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		shrink_pvma_pud_range(sc, mm, vma, pgd, addr, next);
+	} while (pgd++, addr = next, addr != end);
+}
+
+static void shrink_private_vma(struct scan_control* sc)
+{
+	struct vm_area_struct* vma;
+	struct list_head *pos;
+	struct mm_struct *prev, *mm;
+
+	prev = mm = &init_mm;
+	pos = &init_mm.mmlist;
+	atomic_inc(&prev->mm_users);
+	spin_lock(&mmlist_lock);
+	while ((pos = pos->next) != &init_mm.mmlist) {
+		mm = list_entry(pos, struct mm_struct, mmlist);
+		if (!atomic_inc_not_zero(&mm->mm_users))
+			continue;
+		spin_unlock(&mmlist_lock);
+		mmput(prev);
+		prev = mm;
+		start_tlb_tasks(mm);
+		if (down_read_trylock(&mm->mmap_sem)) {
+			for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
+				if (!(vma->vm_flags & VM_PURE_PRIVATE))
+					continue;
+				if (vma->vm_flags & VM_LOCKED)
+					continue;
+				shrink_pvma_pgd_range(sc, mm, vma);
+			}
+			up_read(&mm->mmap_sem);
+		}
+		end_tlb_tasks();
+		spin_lock(&mmlist_lock);
+	}
+	spin_unlock(&mmlist_lock);
+	mmput(prev);
+}
+
 /*
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at pages_high.
@@ -1144,6 +1576,11 @@
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);

+	wakeup_sc = sc;
+	wakeup_sc.may_reclaim = 1;
+	wakeup_sc.reclaim_node = pgdat->node_id;
+	wake_up_interruptible(&kppsd_wait);
+
 	for (i = 0; i < pgdat->nr_zones; i++)
 		temp_priority[i] = DEF_PRIORITY;

@@ -1723,3 +2160,39 @@
 	return __zone_reclaim(zone, gfp_mask, order);
 }
 #endif
+
+static int kppsd(void* p)
+{
+	struct task_struct *tsk = current;
+	int timeout;
+	DEFINE_WAIT(wait);
+	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE;
+	struct scan_control default_sc;
+	default_sc.gfp_mask = GFP_KERNEL;
+	default_sc.may_writepage = 1;
+	default_sc.may_swap = 1;
+	default_sc.may_reclaim = 0;
+	default_sc.reclaim_node = -1;
+
+	while (1) {
+		try_to_freeze();
+		prepare_to_wait(&kppsd_wait, &wait, TASK_INTERRUPTIBLE);
+		timeout = schedule_timeout(2000);
+		finish_wait(&kppsd_wait, &wait);
+
+		if (timeout)
+			shrink_private_vma(&wakeup_sc);
+		else
+			shrink_private_vma(&default_sc);
+	}
+	return 0;
+}
+
+static int __init kppsd_init(void)
+{
+	init_waitqueue_head(&kppsd_wait);
+	kthread_run(kppsd, NULL, "kppsd");
+	return 0;
+}
+
+module_init(kppsd_init)

^ permalink raw reply	[flat|nested] 27+ messages in thread