linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Increase page fault rate by prezeroing V1 [0/3]: Overview
       [not found] ` <41C20E3E.3070209@yahoo.com.au>
@ 2004-12-21 19:55   ` Christoph Lameter
  2004-12-21 19:56     ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Christoph Lameter
                       ` (4 more replies)
  0 siblings, 5 replies; 87+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel


The patches increasing the page fault rate (introduction of atomic pte operations
and anticipatory prefaulting) do so by reducing the locking overhead and are
therefore mainly of interest for applications running in SMP systems with a high
number of cpus. The single thread performance does just show minor increases.
Only the performance of multi-threaded applications increase significantly.

The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page that is also done in the page fault
handler. Others have seen this too and have tried provide a way to provide
zeroed pages to the page fault handler:

http://marc.theaimsgroup.com/?t=109914559100004&r=1&w=2
http://marc.theaimsgroup.com/?t=109777267500005&r=1&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=104931944213955&w=2

The problem so far has been that simple zeroing of pages simply shifts
the time spend somewhere else. Plus one would not want to zero hot
pages.

This patch addresses those issues by making it more effective to zero pages by:

1. Aggregating zeroing operations to mainly apply to larger order pages
which results in many later order 0 pages to be zeroed in one go.
For that purpose a new achitecture specific function zero_page(page, order)
is introduced.

2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.

The result is a significant increase of the page fault performance even for
single threaded applications:

w/o patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
   4   3    1    0.146s     11.155s  11.030s 69584.896  69566.852

w/patch
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
   1   1    1    0.014s      0.110s   0.012s524292.194 517665.538

This is a performance increase by a factor 8!

The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmark the system will run out of these very
fast but the efficient algorithm for page zeroing still makes this a winner
(8 way system with 6 GB RAM, no hardware zeroing support):

w/o patch:

Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 4   3    1    0.146s     11.155s  11.030s 69584.896  69566.852
 4   3    2    0.170s     14.909s   7.097s 52150.369  98643.687
 4   3    4    0.181s     16.597s   5.079s 46869.167 135642.420
 4   3    8    0.166s     23.239s   4.037s 33599.215 179791.120

w/patch
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 4   3    1    0.183s      2.750s   2.093s268077.996 267952.890
 4   3    2    0.185s      4.876s   2.097s155344.562 263967.292
 4   3    4    0.150s      6.617s   2.097s116205.793 264774.080
 4   3    8    0.186s     13.693s   3.054s 56659.819 221701.073

The patch is composed of 3 parts:

[1/3] Introduce __GFP_ZERO
	Modifies the page allocator to be able to take the __GFP_ZERO flag
	and returns zeroed memory on request. Modifies locations throughout
	the linux sources that retrieve a page and then zeroe it to request
	a zeroed page.
	Adds new low level zero_page functions for i386, ia64 and x86_64.
	(x64_64 untested)

[2/3] Page Zeroing
	Adds management of ZEROED and NOT_ZEROED pages and a background daemon
	called scrubd. scrubd is disable by default but can be enabled
	by writing an order number to /proc/sys/vm/scrub_start. If a page
	is coalesced of that order then the scrub daemon will start zeroing
	until all pages of order /proc/sys/vm/scrub_stop and higher are
	zeroed.

[3/3]	SGI Altix Block Transfer Engine Support
	Implements a driver to shift the zeroing off the cpu into hardware.
	With hardware support there will be minimal impact of zeroing
	on the performance of the system.



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO
  2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
@ 2004-12-21 19:56     ` Christoph Lameter
  2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd Christoph Lameter
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel

This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
to request zeroed pages from the page allocator.

- Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set

- Replace all page zeroing after allocating pages by request for
  zeroed pages.

- Add an arch specific call zero_page to clear pages greater than
  order 0 and a fallback to repeated calles to clear_page if an
  architecture does not support zero_page(address, order) yet.

- Add ia64 zero_page function
- Add i386 zero_page function
- Add x86_64 zero_page function (untested, unverified)

Index: linux-2.6.9/mm/page_alloc.c
===================================================================
--- linux-2.6.9.orig/mm/page_alloc.c	2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c	2004-12-21 10:19:37.000000000 -0800
@@ -575,6 +575,18 @@
 		BUG_ON(bad_range(zone, page));
 		mod_page_state_zone(zone, pgalloc, 1 << order);
 		prep_new_page(page, order);
+
+		if (gfp_flags & __GFP_ZERO) {
+#ifdef CONFIG_HIGHMEM
+			if (PageHighMem(page)) {
+				int n = 1 << order;
+
+				while (n-- >0)
+					clear_highpage(page + n);
+			} else
+#endif
+			zero_page(page_address(page), order);
+		}
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -767,12 +779,9 @@
 	 */
 	BUG_ON(gfp_mask & __GFP_HIGHMEM);

-	page = alloc_pages(gfp_mask, 0);
-	if (page) {
-		void *address = page_address(page);
-		clear_page(address);
-		return (unsigned long) address;
-	}
+	page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+	if (page)
+		return (unsigned long) page_address(page);
 	return 0;
 }

Index: linux-2.6.9/include/linux/gfp.h
===================================================================
--- linux-2.6.9.orig/include/linux/gfp.h	2004-10-18 14:53:44.000000000 -0700
+++ linux-2.6.9/include/linux/gfp.h	2004-12-21 10:19:37.000000000 -0800
@@ -37,6 +37,7 @@
 #define __GFP_NORETRY	0x1000	/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	0x2000	/* Slab internal usage */
 #define __GFP_COMP	0x4000	/* Add compound page metadata */
+#define __GFP_ZERO	0x8000	/* Return zeroed page on success */

 #define __GFP_BITS_SHIFT 16	/* Room for 16 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)

 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c	2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/mm/memory.c	2004-12-21 10:19:37.000000000 -0800
@@ -1445,10 +1445,9 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.9/kernel/profile.c
===================================================================
--- linux-2.6.9.orig/kernel/profile.c	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/kernel/profile.c	2004-12-21 10:19:37.000000000 -0800
@@ -326,17 +326,15 @@
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				return NOTIFY_BAD;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				goto out_free;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
 		}
 		break;
@@ -510,16 +508,14 @@
 		int node = cpu_to_node(cpu);
 		struct page *page;

-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[0]
 				= (struct profile_hit *)page_address(page);
 	}
Index: linux-2.6.9/mm/shmem.c
===================================================================
--- linux-2.6.9.orig/mm/shmem.c	2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/mm/shmem.c	2004-12-21 10:19:37.000000000 -0800
@@ -369,9 +369,8 @@
 		}

 		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
 		if (page) {
-			clear_highpage(page);
 			page->nr_swapped = 0;
 		}
 		spin_lock(&info->lock);
@@ -910,7 +909,7 @@
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp, &pvma, 0);
+	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -926,7 +925,7 @@
 shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
 				 unsigned long idx)
 {
-	return alloc_page(gfp);
+	return alloc_page(gfp | __GFP_ZERO);
 }
 #endif

@@ -1135,7 +1134,6 @@

 		info->alloced++;
 		spin_unlock(&info->lock);
-		clear_highpage(filepage);
 		flush_dcache_page(filepage);
 		SetPageUptodate(filepage);
 	}
Index: linux-2.6.9/mm/hugetlb.c
===================================================================
--- linux-2.6.9.orig/mm/hugetlb.c	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/hugetlb.c	2004-12-21 10:19:37.000000000 -0800
@@ -77,7 +77,6 @@
 struct page *alloc_huge_page(void)
 {
 	struct page *page;
-	int i;

 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page();
@@ -88,8 +87,7 @@
 	spin_unlock(&hugetlb_lock);
 	set_page_count(page, 1);
 	page[1].mapping = (void *)free_huge_page;
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-		clear_highpage(&page[i]);
+	zero_page(page_address(page), HUGETLB_PAGE_ORDER);
 	return page;
 }

Index: linux-2.6.9/arch/ia64/lib/Makefile
===================================================================
--- linux-2.6.9.orig/arch/ia64/lib/Makefile	2004-10-18 14:55:28.000000000 -0700
+++ linux-2.6.9/arch/ia64/lib/Makefile	2004-12-21 10:19:37.000000000 -0800
@@ -6,7 +6,7 @@

 lib-y := __divsi3.o __udivsi3.o __modsi3.o __umodsi3.o			\
 	__divdi3.o __udivdi3.o __moddi3.o __umoddi3.o			\
-	bitop.o checksum.o clear_page.o csum_partial_copy.o copy_page.o	\
+	bitop.o checksum.o clear_page.o zero_page.o csum_partial_copy.o copy_page.o	\
 	clear_user.o strncpy_from_user.o strlen_user.o strnlen_user.o	\
 	flush.o ip_fast_csum.o do_csum.o				\
 	memset.o strlen.o swiotlb.o
Index: linux-2.6.9/include/asm-ia64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/page.h	2004-10-18 14:53:21.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/page.h	2004-12-21 10:19:37.000000000 -0800
@@ -57,6 +57,8 @@
 #  define STRICT_MM_TYPECHECKS

 extern void clear_page (void *page);
+extern void zero_page (void *page, int order);
+
 extern void copy_page (void *to, void *from);

 /*
Index: linux-2.6.9/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgalloc.h	2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgalloc.h	2004-12-21 10:19:37.000000000 -0800
@@ -61,9 +61,7 @@
 	pgd_t *pgd = pgd_alloc_one_fast(mm);

 	if (unlikely(pgd == NULL)) {
-		pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-		if (likely(pgd != NULL))
-			clear_page(pgd);
+		pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	}
 	return pgd;
 }
@@ -107,10 +105,8 @@
 static inline pmd_t*
 pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pmd != NULL))
-		clear_page(pmd);
 	return pmd;
 }

@@ -141,20 +137,16 @@
 static inline struct page *
 pte_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

-	if (likely(pte != NULL))
-		clear_page(page_address(pte));
 	return pte;
 }

 static inline pte_t *
 pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pte != NULL))
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.9/arch/ia64/lib/zero_page.S
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/arch/ia64/lib/zero_page.S	2004-12-21 10:19:37.000000000 -0800
@@ -0,0 +1,84 @@
+/*
+ * Copyright (C) 1999-2002 Hewlett-Packard Co
+ *	Stephane Eranian <eranian@hpl.hp.com>
+ *	David Mosberger-Tang <davidm@hpl.hp.com>
+ * Copyright (C) 2002 Ken Chen <kenneth.w.chen@intel.com>
+ *
+ * 1/06/01 davidm	Tuned for Itanium.
+ * 2/12/02 kchen	Tuned for both Itanium and McKinley
+ * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
+ */
+#include <linux/config.h>
+
+#include <asm/asmmacro.h>
+#include <asm/page.h>
+
+#ifdef CONFIG_ITANIUM
+# define L3_LINE_SIZE	64	// Itanium L3 line size
+# define PREFETCH_LINES	9	// magic number
+#else
+# define L3_LINE_SIZE	128	// McKinley L3 line size
+# define PREFETCH_LINES	12	// magic number
+#endif
+
+#define saved_lc	r2
+#define dst_fetch	r3
+#define dst1		r8
+#define dst2		r9
+#define dst3		r10
+#define dst4		r11
+
+#define dst_last	r31
+#define totsize		r14
+
+GLOBAL_ENTRY(zero_page)
+	.prologue
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
+	.save ar.lc, saved_lc
+	mov saved_lc = ar.lc
+	;;
+	.body
+	adds dst1 = 16, in0
+	mov ar.lc = (PREFETCH_LINES - 1)
+	mov dst_fetch = in0
+	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
+	;;
+.fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
+	adds dst3 = 48, in0		// executing this multiple times is harmless
+	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
+	;;
+	mov ar.lc = r16			// one L3 line per iteration
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
+	;;
+#ifdef CONFIG_ITANIUM
+	// Optimized for Itanium
+1:	stf.spill.nta [dst1] = f0, 64
+	stf.spill.nta [dst2] = f0, 64
+	cmp.lt p8,p0=dst_fetch, dst_last
+	;;
+#else
+	// Optimized for McKinley
+1:	stf.spill.nta [dst1] = f0, 64
+	stf.spill.nta [dst2] = f0, 64
+	stf.spill.nta [dst3] = f0, 64
+	stf.spill.nta [dst4] = f0, 128
+	cmp.lt p8,p0=dst_fetch, dst_last
+	;;
+	stf.spill.nta [dst1] = f0, 64
+	stf.spill.nta [dst2] = f0, 64
+#endif
+	stf.spill.nta [dst3] = f0, 64
+(p8)	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
+	br.cloop.sptk.few 1b
+	;;
+	mov ar.lc = saved_lc		// restore lc
+	br.ret.sptk.many rp
+END(zero_page)
Index: linux-2.6.9/include/asm-i386/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/page.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/asm-i386/page.h	2004-12-21 10:19:37.000000000 -0800
@@ -20,6 +20,7 @@

 #define clear_page(page)	mmx_clear_page((void *)(page))
 #define copy_page(to,from)	mmx_copy_page(to,from)
+#define zero_page(page, order)	mmx_zero_page(page, order)

 #else

@@ -29,6 +30,7 @@
  */

 #define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define zero_page(page, ordeR)	memset((void *)(page), 0, PAGE_SIZE << order)
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)

 #endif
Index: linux-2.6.9/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/page.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/asm-x86_64/page.h	2004-12-21 10:19:37.000000000 -0800
@@ -33,6 +33,7 @@
 #ifndef __ASSEMBLY__

 void clear_page(void *);
+void zero_page(void *, int);
 void copy_page(void *, void *);

 #define clear_user_page(page, vaddr, pg)	clear_page(page)
Index: linux-2.6.9/include/asm-sparc/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-sparc/page.h	2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/include/asm-sparc/page.h	2004-12-21 10:19:37.000000000 -0800
@@ -29,6 +29,7 @@
 #ifndef __ASSEMBLY__

 #define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
+#define zero_page(page,order)	 memset((void *)(page), 0, PAGE_SIZE <<(order))
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 #define clear_user_page(addr, vaddr, page)	\
 	do { 	clear_page(addr);		\
Index: linux-2.6.9/include/asm-s390/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-s390/page.h	2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/include/asm-s390/page.h	2004-12-21 10:19:37.000000000 -0800
@@ -33,6 +33,17 @@
 		      : "+&a" (rp) : : "memory", "cc", "1" );
 }

+static inline void zero_page(void *page, int order)
+{
+	register_pair rp;
+
+	rp.subreg.even = (unsigned long) page;
+	rp.subreg.odd = (unsigned long) 4096 << order;
+        asm volatile ("   slr  1,1\n"
+		      "   mvcl %0,0"
+		      : "+&a" (rp) : : "memory", "cc", "1" );
+}
+
 static inline void copy_page(void *to, void *from)
 {
         if (MACHINE_HAS_MVPG)
Index: linux-2.6.9/arch/i386/lib/mmx.c
===================================================================
--- linux-2.6.9.orig/arch/i386/lib/mmx.c	2004-10-18 14:54:23.000000000 -0700
+++ linux-2.6.9/arch/i386/lib/mmx.c	2004-12-21 10:55:00.000000000 -0800
@@ -161,6 +161,39 @@
 	kernel_fpu_end();
 }

+static void fast_zero_page(void *page, int order)
+{
+	int i;
+
+	kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"  pxor %%mm0, %%mm0\n" : :
+	);
+
+	for(i=0;i<((4096/64) << order);i++)
+	{
+		__asm__ __volatile__ (
+		"  movntq %%mm0, (%0)\n"
+		"  movntq %%mm0, 8(%0)\n"
+		"  movntq %%mm0, 16(%0)\n"
+		"  movntq %%mm0, 24(%0)\n"
+		"  movntq %%mm0, 32(%0)\n"
+		"  movntq %%mm0, 40(%0)\n"
+		"  movntq %%mm0, 48(%0)\n"
+		"  movntq %%mm0, 56(%0)\n"
+		: : "r" (page) : "memory");
+		page+=64;
+	}
+	/* since movntq is weakly-ordered, a "sfence" is needed to become
+	 * ordered again.
+	 */
+	__asm__ __volatile__ (
+		"  sfence \n" : :
+	);
+	kernel_fpu_end();
+}
+
 static void fast_copy_page(void *to, void *from)
 {
 	int i;
@@ -293,6 +326,42 @@
 	kernel_fpu_end();
 }

+static void fast_zero_page(void *page, int order)
+{
+	int i;
+
+	kernel_fpu_begin();
+
+	__asm__ __volatile__ (
+		"  pxor %%mm0, %%mm0\n" : :
+	);
+
+	for(i=0;i<((4096/128) << order);i++)
+	{
+		__asm__ __volatile__ (
+		"  movq %%mm0, (%0)\n"
+		"  movq %%mm0, 8(%0)\n"
+		"  movq %%mm0, 16(%0)\n"
+		"  movq %%mm0, 24(%0)\n"
+		"  movq %%mm0, 32(%0)\n"
+		"  movq %%mm0, 40(%0)\n"
+		"  movq %%mm0, 48(%0)\n"
+		"  movq %%mm0, 56(%0)\n"
+		"  movq %%mm0, 64(%0)\n"
+		"  movq %%mm0, 72(%0)\n"
+		"  movq %%mm0, 80(%0)\n"
+		"  movq %%mm0, 88(%0)\n"
+		"  movq %%mm0, 96(%0)\n"
+		"  movq %%mm0, 104(%0)\n"
+		"  movq %%mm0, 112(%0)\n"
+		"  movq %%mm0, 120(%0)\n"
+		: : "r" (page) : "memory");
+		page+=128;
+	}
+
+	kernel_fpu_end();
+}
+
 static void fast_copy_page(void *to, void *from)
 {
 	int i;
@@ -359,7 +428,7 @@
  *	Favour MMX for page clear and copy.
  */

-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page)
 {
 	int d0, d1;
 	__asm__ __volatile__( \
@@ -369,15 +438,34 @@
 		:"a" (0),"1" (page),"0" (1024)
 		:"memory");
 }
+
+static void slow_zero_page(void * page, int order)
+{
+	int d0, d1;
+	__asm__ __volatile__( \
+		"cld\n\t" \
+		"rep ; stosl" \
+		: "=&c" (d0), "=&D" (d1)
+		:"a" (0),"1" (page),"0" (1024 << order)
+		:"memory");
+}

 void mmx_clear_page(void * page)
 {
 	if(unlikely(in_interrupt()))
-		slow_zero_page(page);
+		slow_clear_page(page);
 	else
 		fast_clear_page(page);
 }

+void mmx_zero_page(void * page, int order)
+{
+	if(unlikely(in_interrupt()))
+		slow_zero_page(page, order);
+	else
+		fast_zero_page(page, order);
+}
+
 static void slow_copy_page(void *to, void *from)
 {
 	int d0, d1, d2;
Index: linux-2.6.9/arch/i386/mm/pgtable.c
===================================================================
--- linux-2.6.9.orig/arch/i386/mm/pgtable.c	2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/i386/mm/pgtable.c	2004-12-21 10:19:37.000000000 -0800
@@ -132,10 +132,7 @@

 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
-	return pte;
+	return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 }

 struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -143,12 +140,10 @@
 	struct page *pte;

 #ifdef CONFIG_HIGHPTE
-	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
 #else
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 #endif
-	if (pte)
-		clear_highpage(pte);
 	return pte;
 }

Index: linux-2.6.9/arch/i386/kernel/i386_ksyms.c
===================================================================
--- linux-2.6.9.orig/arch/i386/kernel/i386_ksyms.c	2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/i386/kernel/i386_ksyms.c	2004-12-21 10:19:37.000000000 -0800
@@ -126,6 +126,7 @@
 #ifdef CONFIG_X86_USE_3DNOW
 EXPORT_SYMBOL(_mmx_memcpy);
 EXPORT_SYMBOL(mmx_clear_page);
+EXPORT_SYMBOL(mmx_zero_page);
 EXPORT_SYMBOL(mmx_copy_page);
 #endif

Index: linux-2.6.9/drivers/block/pktcdvd.c
===================================================================
--- linux-2.6.9.orig/drivers/block/pktcdvd.c	2004-12-17 14:40:12.000000000 -0800
+++ linux-2.6.9/drivers/block/pktcdvd.c	2004-12-21 10:19:37.000000000 -0800
@@ -125,22 +125,19 @@
 	int i;
 	struct packet_data *pkt;

-	pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL);
+	pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL|__GFP_ZERO);
 	if (!pkt)
 		goto no_pkt;
-	memset(pkt, 0, sizeof(struct packet_data));

 	pkt->w_bio = pkt_bio_alloc(PACKET_MAX_SIZE);
 	if (!pkt->w_bio)
 		goto no_bio;

 	for (i = 0; i < PAGES_PER_PACKET; i++) {
-		pkt->pages[i] = alloc_page(GFP_KERNEL);
+		pkt->pages[i] = alloc_page(GFP_KERNEL|__GFP_ZERO);
 		if (!pkt->pages[i])
 			goto no_page;
 	}
-	for (i = 0; i < PAGES_PER_PACKET; i++)
-		clear_page(page_address(pkt->pages[i]));

 	spin_lock_init(&pkt->lock);

Index: linux-2.6.9/arch/x86_64/lib/zero_page.S
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/arch/x86_64/lib/zero_page.S	2004-12-21 10:19:37.000000000 -0800
@@ -0,0 +1,52 @@
+/*
+ * Zero a page.
+ * rdi	page
+ */
+	.globl zero_page
+	.p2align 4
+zero_page:
+	xorl   %eax,%eax
+	movl   $4096/64,%ecx
+	shl	%ecx, %esi
+	.p2align 4
+.Lloop:
+	decl	%ecx
+#define PUT(x) movq %rax,x*8(%rdi)
+	movq %rax,(%rdi)
+	PUT(1)
+	PUT(2)
+	PUT(3)
+	PUT(4)
+	PUT(5)
+	PUT(6)
+	PUT(7)
+	leaq	64(%rdi),%rdi
+	jnz	.Lloop
+	nop
+	ret
+zero_page_end:
+
+	/* C stepping K8 run faster using the string instructions.
+	   It is also a lot simpler. Use this when possible */
+
+#include <asm/cpufeature.h>
+
+	.section .altinstructions,"a"
+	.align 8
+	.quad  zero_page
+	.quad  zero_page_c
+	.byte  X86_FEATURE_K8_C
+	.byte  zero_page_end-clear_page
+	.byte  zero_page_c_end-clear_page_c
+	.previous
+
+	.section .altinstr_replacement,"ax"
+zero_page_c:
+	movl $4096/8,%ecx
+	shl	%ecx, %esi
+	xorl %eax,%eax
+	rep
+	stosq
+	ret
+zero_page_c_end:
+	.previous
Index: linux-2.6.9/arch/x86_64/lib/Makefile
===================================================================
--- linux-2.6.9.orig/arch/x86_64/lib/Makefile	2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/arch/x86_64/lib/Makefile	2004-12-21 10:19:37.000000000 -0800
@@ -7,7 +7,7 @@
 obj-y := io.o

 lib-y := csum-partial.o csum-copy.o csum-wrappers.o delay.o \
-	usercopy.o getuser.o putuser.o  \
+	usercopy.o getuser.o putuser.o zero_page.S \
 	thunk.o clear_page.o copy_page.o bitstr.o bitops.o
 lib-y += memcpy.o memmove.o memset.o copy_user.o

Index: linux-2.6.9/include/asm-x86_64/mmx.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/mmx.h	2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-x86_64/mmx.h	2004-12-21 10:19:37.000000000 -0800
@@ -9,6 +9,7 @@

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
 extern void mmx_clear_page(void *page);
+extern void mmx_zero_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.9/arch/x86_64/kernel/x8664_ksyms.c
===================================================================
--- linux-2.6.9.orig/arch/x86_64/kernel/x8664_ksyms.c	2004-12-17 14:40:11.000000000 -0800
+++ linux-2.6.9/arch/x86_64/kernel/x8664_ksyms.c	2004-12-21 10:19:37.000000000 -0800
@@ -110,6 +110,7 @@

 EXPORT_SYMBOL(copy_page);
 EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(zero_page);

 EXPORT_SYMBOL(cpu_pda);
 #ifdef CONFIG_SMP


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd
  2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
  2004-12-21 19:56     ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Christoph Lameter
@ 2004-12-21 19:57     ` Christoph Lameter
  2005-01-01  2:22       ` Nick Piggin
  2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing Christoph Lameter
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel

o Add page zeroing
o Add scrub daemon
o Add ability to view amount of zeroed information in /proc/meminfo

Index: linux-2.6.9/mm/page_alloc.c
===================================================================
--- linux-2.6.9.orig/mm/page_alloc.c	2004-12-21 10:19:37.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c	2004-12-21 11:01:40.000000000 -0800
@@ -12,6 +12,7 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *          (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Support for page zeroing, Christoph Lameter, SGI, Dec 2004
  */

 #include <linux/config.h>
@@ -32,6 +33,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/nodemask.h>
+#include <linux/scrub.h>

 #include <asm/tlbflush.h>

@@ -179,7 +181,7 @@
  * -- wli
  */

-static inline void __free_pages_bulk (struct page *page, struct page *base,
+static inline int __free_pages_bulk (struct page *page, struct page *base,
 		struct zone *zone, struct free_area *area, unsigned int order)
 {
 	unsigned long page_idx, index, mask;
@@ -192,11 +194,10 @@
 		BUG();
 	index = page_idx >> (1 + order);

-	zone->free_pages += 1 << order;
 	while (order < MAX_ORDER-1) {
 		struct page *buddy1, *buddy2;

-		BUG_ON(area >= zone->free_area + MAX_ORDER);
+		BUG_ON(area >= zone->free_area[ZEROED] + MAX_ORDER);
 		if (!__test_and_change_bit(index, area->map))
 			/*
 			 * the buddy page is still allocated.
@@ -216,6 +217,7 @@
 		page_idx &= mask;
 	}
 	list_add(&(base + page_idx)->lru, &area->free_list);
+	return order;
 }

 static inline void free_pages_check(const char *function, struct page *page)
@@ -258,7 +260,7 @@
 	int ret = 0;

 	base = zone->zone_mem_map;
-	area = zone->free_area + order;
+	area = zone->free_area[NOT_ZEROED] + order;
 	spin_lock_irqsave(&zone->lock, flags);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
@@ -266,7 +268,10 @@
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_pages_bulk list manipulates */
 		list_del(&page->lru);
-		__free_pages_bulk(page, base, zone, area, order);
+		zone->free_pages += 1 << order;
+		if (__free_pages_bulk(page, base, zone, area, order)
+			>= sysctl_scrub_start)
+				wakeup_kscrubd(zone);
 		ret++;
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -288,6 +293,21 @@
 	free_pages_bulk(page_zone(page), 1, &list, order);
 }

+void end_zero_page(struct page *page)
+{
+	unsigned long flags;
+	int order = page->index;
+	struct zone * zone = page_zone(page);
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	zone->zero_pages += 1 << order;
+	__free_pages_bulk(page, zone->zone_mem_map, zone, zone->free_area[ZEROED] + order, order);
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+
 #define MARK_USED(index, order, area) \
 	__change_bit((index) >> (1+(order)), (area)->map)

@@ -366,25 +386,46 @@
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct zone *zone, struct free_area *area, int order)
+{
+	list_del(&page->lru);
+	if (order != MAX_ORDER-1)
+		MARK_USED(page - zone->zone_mem_map, order, area);
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order)
+{
+	unsigned long flags;
+	struct page *page = NULL;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (!list_empty(&area->free_list)) {
+		page = list_entry(area->free_list.next, struct page, lru);
+
+		rmpage(page, zone, area, order);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
 {
 	struct free_area * area;
 	unsigned int current_order;
 	struct page *page;
-	unsigned int index;

 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
+		area = zone->free_area[zero] + current_order;
 		if (list_empty(&area->free_list))
 			continue;

 		page = list_entry(area->free_list.next, struct page, lru);
-		list_del(&page->lru);
-		index = page - zone->zone_mem_map;
-		if (current_order != MAX_ORDER-1)
-			MARK_USED(index, current_order, area);
+		rmpage(page, zone, area, current_order);
 		zone->free_pages -= 1UL << order;
-		return expand(zone, page, index, order, current_order, area);
+		if (zero)
+			zone->zero_pages -= 1UL << order;
+		return expand(zone, page, page - zone->zone_mem_map, order, current_order, area);
 	}

 	return NULL;
@@ -396,7 +437,7 @@
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list, int zero)
 {
 	unsigned long flags;
 	int i;
@@ -405,7 +446,7 @@

 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, zero);
 		if (page == NULL)
 			break;
 		allocated++;
@@ -546,7 +587,9 @@
 {
 	unsigned long flags;
 	struct page *page = NULL;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	int nr_pages = 1 << order;
+	int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+	int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;

 	if (order == 0) {
 		struct per_cpu_pages *pcp;
@@ -555,7 +598,7 @@
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list, zero);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -567,19 +610,30 @@

 	if (page == NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+
+		page = __rmqueue(zone, order, zero);
+
+		/*
+		 * If we failed to obtain a zero and/or unzeroed page
+		 * then we may still be able to obtain the other
+		 * type of page.
+		 */
+		if (!page) {
+			page = __rmqueue(zone, order, !zero);
+			zero = 0;
+		}
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

 	if (page != NULL) {
 		BUG_ON(bad_range(zone, page));
-		mod_page_state_zone(zone, pgalloc, 1 << order);
-		prep_new_page(page, order);
+		mod_page_state_zone(zone, pgalloc, nr_pages);

-		if (gfp_flags & __GFP_ZERO) {
+		if ((gfp_flags & __GFP_ZERO) && !zero) {
 #ifdef CONFIG_HIGHMEM
 			if (PageHighMem(page)) {
-				int n = 1 << order;
+				int n = nr_pages;

 				while (n-- >0)
 					clear_highpage(page + n);
@@ -587,6 +641,7 @@
 #endif
 			zero_page(page_address(page), order);
 		}
+		prep_new_page(page, order);
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -974,7 +1029,7 @@
 }

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat)
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
 {
 	struct zone *zones = pgdat->node_zones;
 	int i;
@@ -982,27 +1037,31 @@
 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		*active += zones[i].nr_active;
 		*inactive += zones[i].nr_inactive;
 		*free += zones[i].free_pages;
+		*zero += zones[i].zero_pages;
 	}
 }

 void get_zone_counts(unsigned long *active,
-		unsigned long *inactive, unsigned long *free)
+		unsigned long *inactive, unsigned long *free, unsigned long *zero)
 {
 	struct pglist_data *pgdat;

 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for_each_pgdat(pgdat) {
-		unsigned long l, m, n;
-		__get_zone_counts(&l, &m, &n, pgdat);
+		unsigned long l, m, n,o;
+		__get_zone_counts(&l, &m, &n, &o, pgdat);
 		*active += l;
 		*inactive += m;
 		*free += n;
+		*zero += o;
 	}
 }

@@ -1039,6 +1098,7 @@

 #define K(x) ((x) << (PAGE_SHIFT-10))

+const char *temp[3] = { "hot", "cold", "zero" };
 /*
  * Show free area list (used inside shift_scroll-lock stuff)
  * We also calculate the percentage fragmentation. We do this by counting the
@@ -1051,6 +1111,7 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;
 	struct zone *zone;

 	for_each_zone(zone) {
@@ -1071,10 +1132,10 @@

 			pageset = zone->pageset + cpu;

-			for (temperature = 0; temperature < 2; temperature++)
+			for (temperature = 0; temperature < 3; temperature++)
 				printk("cpu %d %s: low %d, high %d, batch %d\n",
 					cpu,
-					temperature ? "cold" : "hot",
+					temp[temperature],
 					pageset->pcp[temperature].low,
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch);
@@ -1082,20 +1143,21 @@
 	}

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 	printk("\nFree pages: %11ukB (%ukB HighMem)\n",
 		K(nr_free_pages()),
 		K(nr_free_highpages()));

 	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
-		"unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+		"unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
 		active,
 		inactive,
 		ps.nr_dirty,
 		ps.nr_writeback,
 		ps.nr_unstable,
 		nr_free_pages(),
+		zero,
 		ps.nr_slab,
 		ps.nr_mapped,
 		ps.nr_page_table_pages);
@@ -1146,7 +1208,7 @@
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < MAX_ORDER; order++) {
 			nr = 0;
-			list_for_each(elem, &zone->free_area[order].free_list)
+			list_for_each(elem, &zone->free_area[NOT_ZEROED][order].free_list)
 				++nr;
 			total += nr << order;
 			printk("%lu*%lukB ", nr, K(1UL) << order);
@@ -1470,14 +1532,18 @@
 	for (order = 0; ; order++) {
 		unsigned long bitmap_size;

-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
 		if (order == MAX_ORDER-1) {
-			zone->free_area[order].map = NULL;
+			zone->free_area[NOT_ZEROED][order].map = NULL;
+			zone->free_area[ZEROED][order].map = NULL;
 			break;
 		}

 		bitmap_size = pages_to_bitmap_size(order, size);
-		zone->free_area[order].map =
+		zone->free_area[NOT_ZEROED][order].map =
+		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
+		zone->free_area[ZEROED][order].map =
 		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
 	}
 }
@@ -1503,6 +1569,7 @@

 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->kscrubd_wait);

 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
@@ -1525,6 +1592,7 @@
 		spin_lock_init(&zone->lru_lock);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->zero_pages = 0;

 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;

@@ -1558,6 +1626,13 @@
 			pcp->high = 2 * batch;
 			pcp->batch = 1 * batch;
 			INIT_LIST_HEAD(&pcp->list);
+
+			pcp = &zone->pageset[cpu].pcp[2];	/* zero pages */
+			pcp->count = 0;
+			pcp->low = 0;
+			pcp->high = 2 * batch;
+			pcp->batch = 1 * batch;
+			INIT_LIST_HEAD(&pcp->list);
 		}
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
 				zone_names[j], realsize, batch);
@@ -1687,7 +1762,7 @@
 			unsigned long nr_bufs = 0;
 			struct list_head *elem;

-			list_for_each(elem, &(zone->free_area[order].free_list))
+			list_for_each(elem, &(zone->free_area[NOT_ZEROED][order].free_list))
 				++nr_bufs;
 			seq_printf(m, "%6lu ", nr_bufs);
 		}
Index: linux-2.6.9/include/linux/mmzone.h
===================================================================
--- linux-2.6.9.orig/include/linux/mmzone.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/mmzone.h	2004-12-21 11:01:15.000000000 -0800
@@ -51,7 +51,7 @@
 };

 struct per_cpu_pageset {
-	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
+	struct per_cpu_pages pcp[3];	/* 0: hot.  1: cold  2: cold zeroed pages */
 #ifdef CONFIG_NUMA
 	unsigned long numa_hit;		/* allocated in intended node */
 	unsigned long numa_miss;	/* allocated in non intended node */
@@ -107,10 +107,14 @@
  * ZONE_HIGHMEM	 > 896 MB	only page cache and user processes
  */

+#define NOT_ZEROED 0
+#define ZEROED 1
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
 	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		zero_pages;
 	/*
 	 * protection[] is a pre-calculated number of extra pages that must be
 	 * available in a zone in order for __alloc_pages() to allocate memory
@@ -131,7 +135,7 @@
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	struct free_area	free_area[MAX_ORDER];
+	struct free_area	free_area[2][MAX_ORDER];


 	ZONE_PADDING(_pad1_)
@@ -265,6 +269,9 @@
 	struct pglist_data *pgdat_next;
 	wait_queue_head_t       kswapd_wait;
 	struct task_struct *kswapd;
+
+	wait_queue_head_t       kscrubd_wait;
+	struct task_struct *kscrubd;
 } pg_data_t;

 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -274,9 +281,9 @@
 extern struct pglist_data *pgdat_list;

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat);
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free);
+			unsigned long *free, unsigned long *zero);
 void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone);

Index: linux-2.6.9/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.9.orig/fs/proc/proc_misc.c	2004-12-17 14:40:15.000000000 -0800
+++ linux-2.6.9/fs/proc/proc_misc.c	2004-12-21 11:01:15.000000000 -0800
@@ -158,13 +158,14 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;
 	unsigned long vmtot;
 	unsigned long committed;
 	unsigned long allowed;
 	struct vmalloc_info vmi;

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 /*
  * display in kilobytes.
@@ -187,6 +188,7 @@
 	len = sprintf(page,
 		"MemTotal:     %8lu kB\n"
 		"MemFree:      %8lu kB\n"
+		"MemZero:      %8lu kB\n"
 		"Buffers:      %8lu kB\n"
 		"Cached:       %8lu kB\n"
 		"SwapCached:   %8lu kB\n"
@@ -210,6 +212,7 @@
 		"VmallocChunk: %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
+		K(zero),
 		K(i.bufferram),
 		K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
 		K(total_swapcache_pages),
Index: linux-2.6.9/mm/readahead.c
===================================================================
--- linux-2.6.9.orig/mm/readahead.c	2004-10-18 14:53:11.000000000 -0700
+++ linux-2.6.9/mm/readahead.c	2004-12-21 11:01:15.000000000 -0800
@@ -570,7 +570,8 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;

-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
Index: linux-2.6.9/drivers/base/node.c
===================================================================
--- linux-2.6.9.orig/drivers/base/node.c	2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/drivers/base/node.c	2004-12-21 11:01:15.000000000 -0800
@@ -41,13 +41,15 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;

 	si_meminfo_node(&i, nid);
-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));

 	n = sprintf(buf, "\n"
 		       "Node %d MemTotal:     %8lu kB\n"
 		       "Node %d MemFree:      %8lu kB\n"
+		       "Node %d MemZero:      %8lu kB\n"
 		       "Node %d MemUsed:      %8lu kB\n"
 		       "Node %d Active:       %8lu kB\n"
 		       "Node %d Inactive:     %8lu kB\n"
@@ -57,6 +59,7 @@
 		       "Node %d LowFree:      %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
+		       nid, K(zero),
 		       nid, K(i.totalram - i.freeram),
 		       nid, K(active),
 		       nid, K(inactive),
Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h	2004-12-21 11:01:15.000000000 -0800
@@ -715,6 +715,7 @@
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
+#define PF_KSCRUBD	0x00800000	/* I am kscrubd */

 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.9/mm/Makefile
===================================================================
--- linux-2.6.9.orig/mm/Makefile	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/Makefile	2004-12-21 11:01:15.000000000 -0800
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+			   vmalloc.o scrubd.o

 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o prio_tree.o \
Index: linux-2.6.9/mm/scrubd.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/mm/scrubd.c	2004-12-21 11:01:15.000000000 -0800
@@ -0,0 +1,148 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = MAX_ORDER;		/* Off */
+unsigned int sysctl_scrub_stop = 2;	/* Mininum order of page to zero */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec(table, write, file, buffer, length, ppos);
+	if (sysctl_scrub_start < MAX_ORDER) {
+		struct zone *zone;
+
+		for_each_zone(zone)
+			wakeup_kscrubd(zone);
+	}
+	return 0;
+}
+
+
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+	int order;
+
+	for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+		struct free_area *area = z->free_area[NOT_ZEROED] + order;
+		if (!list_empty(&area->free_list)) {
+			struct page *page = scrubd_rmpage(z, area, order);
+			struct list_head *l;
+
+			if (!page)
+				continue;
+
+			page->index = order;
+
+			list_for_each(l, &zero_drivers) {
+				struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+				unsigned long size = PAGE_SIZE << order;
+
+				if (driver->start(page_address(page), size) == 0) {
+
+					unsigned ticks = (size*HZ)/driver->rate;
+					if (ticks) {
+						/* Wait the minimum time of the transfer */
+						current->state = TASK_INTERRUPTIBLE;
+						schedule_timeout(ticks);
+					}
+					/* Then keep on checking until transfer is complete */
+					while (!driver->check())
+						schedule();
+					goto out;
+				}
+			}
+
+			/* Unable to find a zeroing device that would
+			 * deal with this page so just do it on our own.
+			 * This will likely thrash the cpu caches.
+			 */
+			cond_resched();
+			zero_page(page_address(page), order);
+out:
+			end_zero_page(page);
+			cond_resched();
+			return 1 << order;
+		}
+	}
+	return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+	int i;
+	unsigned long pages_zeroed;
+
+	if (system_state != SYSTEM_RUNNING)
+		return;
+
+	do {
+		pages_zeroed = 0;
+		for (i = 0; i < pgdat->nr_zones; i++) {
+			struct zone *zone = pgdat->node_zones + i;
+
+			pages_zeroed += zero_highest_order_page(zone);
+		}
+	} while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t*)p;
+	struct task_struct *tsk = current;
+	DEFINE_WAIT(wait);
+	cpumask_t cpumask;
+
+	daemonize("kscrubd%d", pgdat->node_id);
+	cpumask = node_to_cpumask(pgdat->node_id);
+	if (!cpus_empty(cpumask))
+		set_cpus_allowed(tsk, cpumask);
+
+	tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+	for ( ; ; ) {
+		if (current->flags & PF_FREEZE)
+			refrigerator(PF_FREEZE);
+		prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+		schedule();
+		finish_wait(&pgdat->kscrubd_wait, &wait);
+
+		scrub_pgdat(pgdat);
+	}
+	return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+	pg_data_t *pgdat;
+	for_each_pgdat(pgdat)
+		pgdat->kscrubd
+		= find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+	return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.9/include/linux/scrub.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/include/linux/scrub.h	2004-12-21 11:01:15.000000000 -0800
@@ -0,0 +1,48 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+        int (*start)(void *, unsigned length);		/* Start bzero transfer */
+	int (*check)(void);				/* Check if bzero is complete */
+	int rate;					/* bzero rate in MB/sec */
+        struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+	list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+	list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+        if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+                return;
+        wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+				      void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page);
+#endif
Index: linux-2.6.9/kernel/sysctl.c
===================================================================
--- linux-2.6.9.orig/kernel/sysctl.c	2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/kernel/sysctl.c	2004-12-21 11:01:15.000000000 -0800
@@ -40,6 +40,7 @@
 #include <linux/times.h>
 #include <linux/limits.h>
 #include <linux/dcache.h>
+#include <linux/scrub.h>
 #include <linux/syscalls.h>

 #include <asm/uaccess.h>
@@ -816,6 +817,24 @@
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_SCRUB_START,
+		.procname	= "scrub_start",
+		.data		= &sysctl_scrub_start,
+		.maxlen		= sizeof(sysctl_scrub_start),
+		.mode		= 0644,
+		.proc_handler	= &scrub_start_handler,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_STOP,
+		.procname	= "scrub_stop",
+		.data		= &sysctl_scrub_stop,
+		.maxlen		= sizeof(sysctl_scrub_stop),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
 	{ .ctl_name = 0 }
 };

Index: linux-2.6.9/include/linux/sysctl.h
===================================================================
--- linux-2.6.9.orig/include/linux/sysctl.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/sysctl.h	2004-12-21 11:01:15.000000000 -0800
@@ -168,6 +168,8 @@
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+	VM_SCRUB_START=30,	/* percentage * 10 at which to start scrubd */
+	VM_SCRUB_STOP=31,	/* percentage * 10 at which to stop scrubd */
 };




^ permalink raw reply	[flat|nested] 87+ messages in thread

* Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing
  2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
  2004-12-21 19:56     ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Christoph Lameter
  2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd Christoph Lameter
@ 2004-12-21 19:57     ` Christoph Lameter
  2004-12-22 12:46       ` Robin Holt
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
  2004-12-24 18:31     ` Increase page fault rate by prezeroing V1 [0/3]: Overview Andrea Arcangeli
  4 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel

o Use the Block Transfer Engine in the Altix SN2 SHub for background zeroing

Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c	2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/bte.c	2004-12-21 11:03:49.000000000 -0800
@@ -4,6 +4,8 @@
  * for more details.
  *
  * Copyright (c) 2000-2003 Silicon Graphics, Inc.  All Rights Reserved.
+ *
+ * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
  */

 #include <linux/config.h>
@@ -20,6 +22,8 @@
 #include <linux/bootmem.h>
 #include <linux/string.h>
 #include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scrub.h>

 #include <asm/sn/bte.h>

@@ -30,7 +34,11 @@
 /* two interfaces on two btes */
 #define MAX_INTERFACES_TO_TRY		4

-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+DEFINE_PER_CPU(u64 *, bte_zero_notify);
+
+#define bte_zero_notify __get_cpu_var(bte_zero_notify)
+
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
 {
 	nodepda_t *tmp_nodepda;

@@ -132,7 +140,6 @@
 			if (bte == NULL) {
 				continue;
 			}
-
 			if (spin_trylock(&bte->spinlock)) {
 				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
 				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
@@ -157,7 +164,7 @@
 		}
 	} while (1);

-	if (notification == NULL) {
+	if (notification == NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
 		/* User does not want to be notified. */
 		bte->most_rcnt_na = &bte->notify;
 	} else {
@@ -192,6 +199,8 @@

 	itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);

+	if (mode & BTE_NOTIFY_AND_GET_POINTER)
+		 *(u64 volatile **)(notification) = &bte->notify;
 	spin_unlock_irqrestore(&bte->spinlock, irq_flags);

 	if (notification != NULL) {
@@ -449,5 +458,31 @@
 		mynodepda->bte_if[i].cleanup_active = 0;
 		mynodepda->bte_if[i].bh_error = 0;
 	}
+}
+
+static int bte_check_bzero(void)
+{
+	return *bte_zero_notify != BTE_WORD_BUSY;
+}
+
+static int bte_start_bzero(void *p, unsigned long len)
+{
+	/* Check limitations.
+		1. System must be running (weird things happen during bootup)
+		2. Size >64KB. Smaller requests cause too much bte traffic
+	 */
+	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
+		return EINVAL;
+
+	return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, &bte_zero_notify);
+}
+
+static struct zero_driver bte_bzero = {
+	.start = bte_start_bzero,
+	.check = bte_check_bzero,
+	.rate = 500000000		/* 500 MB /sec */
+};

+void sn_bte_bzero_init(void) {
+	register_zero_driver(&bte_bzero);
 }
Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c	2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/setup.c	2004-12-21 11:02:35.000000000 -0800
@@ -243,6 +243,7 @@
 	int pxm;
 	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
 	extern void sn_cpu_init(void);
+	extern void sn_bte_bzero_init(void);

 	/*
 	 * If the generic code has enabled vga console support - lets
@@ -333,6 +334,7 @@
 	screen_info = sn_screen_info;

 	sn_timer_init();
+	sn_bte_bzero_init();
 }

 /**
Index: linux-2.6.9/include/asm-ia64/sn/bte.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/sn/bte.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/asm-ia64/sn/bte.h	2004-12-21 11:02:35.000000000 -0800
@@ -48,6 +48,8 @@
 #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
 /* Use a reserved bit to let the caller specify a wait for any BTE */
 #define BTE_WACQUIRE (0x4000)
+/* Return the pointer to the notification cacheline to the user */
+#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
 /* Use the BTE on the node with the destination memory */
 #define BTE_USE_DEST (BTE_WACQUIRE << 1)
 /* Use any available BTE interface on any node for the transfer */


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing
  2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing Christoph Lameter
@ 2004-12-22 12:46       ` Robin Holt
  2004-12-22 19:56         ` Christoph Lameter
  0 siblings, 1 reply; 87+ messages in thread
From: Robin Holt @ 2004-12-22 12:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Luck, Tony, Robin Holt, Adam Litke, linux-ia64,
	torvalds, linux-mm, linux-kernel

We still need to talk.  This is a much smaller patch, which I like.  The
problem I see in my 30 second review is you are doing things per-cpu
when they really need to be done per-node.  It is very likely that
there will be M-Bricks in the system (cranberry2 has one if you want
to test your code out there or you can take any altix and disable the
cpus on a C-Brick).  With M-Bricks, you will essentially limit
yourself to one zero operation per controlling node instead of one
per node.

I think the easy answer is to not have the structure allocated
within bte_copy(), but rather within bte_start_zero and passed
in as the notification address.

Give me a call sometime today (Wed. I am in the office from about
10:00 CDT until around 4:00 CDT)  Maybe we can get this straightened
out quickly.  If you are not calling from the office, email me with
other arrangements.

Thanks,
Robin

On Tue, Dec 21, 2004 at 11:57:57AM -0800, Christoph Lameter wrote:
> o Use the Block Transfer Engine in the Altix SN2 SHub for background zeroing
> 
> Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
> ===================================================================
> --- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c	2004-12-17 14:40:10.000000000 -0800
> +++ linux-2.6.9/arch/ia64/sn/kernel/bte.c	2004-12-21 11:03:49.000000000 -0800
> @@ -4,6 +4,8 @@
>   * for more details.
>   *
>   * Copyright (c) 2000-2003 Silicon Graphics, Inc.  All Rights Reserved.
> + *
> + * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
>   */
> 
>  #include <linux/config.h>
> @@ -20,6 +22,8 @@
>  #include <linux/bootmem.h>
>  #include <linux/string.h>
>  #include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/scrub.h>
> 
>  #include <asm/sn/bte.h>
> 
> @@ -30,7 +34,11 @@
>  /* two interfaces on two btes */
>  #define MAX_INTERFACES_TO_TRY		4
> 
> -static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
> +DEFINE_PER_CPU(u64 *, bte_zero_notify);
> +
> +#define bte_zero_notify __get_cpu_var(bte_zero_notify)
> +
> +static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
>  {
>  	nodepda_t *tmp_nodepda;
> 
> @@ -132,7 +140,6 @@
>  			if (bte == NULL) {
>  				continue;
>  			}
> -
>  			if (spin_trylock(&bte->spinlock)) {
>  				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
>  				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
> @@ -157,7 +164,7 @@
>  		}
>  	} while (1);
> 
> -	if (notification == NULL) {
> +	if (notification == NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
>  		/* User does not want to be notified. */
>  		bte->most_rcnt_na = &bte->notify;
>  	} else {
> @@ -192,6 +199,8 @@
> 
>  	itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);
> 
> +	if (mode & BTE_NOTIFY_AND_GET_POINTER)
> +		 *(u64 volatile **)(notification) = &bte->notify;
>  	spin_unlock_irqrestore(&bte->spinlock, irq_flags);
> 
>  	if (notification != NULL) {
> @@ -449,5 +458,31 @@
>  		mynodepda->bte_if[i].cleanup_active = 0;
>  		mynodepda->bte_if[i].bh_error = 0;
>  	}
> +}
> +
> +static int bte_check_bzero(void)
> +{
> +	return *bte_zero_notify != BTE_WORD_BUSY;
> +}
> +
> +static int bte_start_bzero(void *p, unsigned long len)
> +{
> +	/* Check limitations.
> +		1. System must be running (weird things happen during bootup)
> +		2. Size >64KB. Smaller requests cause too much bte traffic
> +	 */
> +	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
> +		return EINVAL;
> +
> +	return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, &bte_zero_notify);
> +}
> +
> +static struct zero_driver bte_bzero = {
> +	.start = bte_start_bzero,
> +	.check = bte_check_bzero,
> +	.rate = 500000000		/* 500 MB /sec */
> +};
> 
> +void sn_bte_bzero_init(void) {
> +	register_zero_driver(&bte_bzero);
>  }
> Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
> ===================================================================
> --- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c	2004-12-17 14:40:10.000000000 -0800
> +++ linux-2.6.9/arch/ia64/sn/kernel/setup.c	2004-12-21 11:02:35.000000000 -0800
> @@ -243,6 +243,7 @@
>  	int pxm;
>  	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
>  	extern void sn_cpu_init(void);
> +	extern void sn_bte_bzero_init(void);
> 
>  	/*
>  	 * If the generic code has enabled vga console support - lets
> @@ -333,6 +334,7 @@
>  	screen_info = sn_screen_info;
> 
>  	sn_timer_init();
> +	sn_bte_bzero_init();
>  }
> 
>  /**
> Index: linux-2.6.9/include/asm-ia64/sn/bte.h
> ===================================================================
> --- linux-2.6.9.orig/include/asm-ia64/sn/bte.h	2004-12-17 14:40:16.000000000 -0800
> +++ linux-2.6.9/include/asm-ia64/sn/bte.h	2004-12-21 11:02:35.000000000 -0800
> @@ -48,6 +48,8 @@
>  #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
>  /* Use a reserved bit to let the caller specify a wait for any BTE */
>  #define BTE_WACQUIRE (0x4000)
> +/* Return the pointer to the notification cacheline to the user */
> +#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
>  /* Use the BTE on the node with the destination memory */
>  #define BTE_USE_DEST (BTE_WACQUIRE << 1)
>  /* Use any available BTE interface on any node for the transfer */

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing
  2004-12-22 12:46       ` Robin Holt
@ 2004-12-22 19:56         ` Christoph Lameter
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2004-12-22 19:56 UTC (permalink / raw)
  To: Robin Holt
  Cc: Nick Piggin, Luck, Tony, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel

I have done some additional tests with a 128 cpu SMP machine and it shows
that the bte slows things down during memory benchmarking by about 10-20%
although its causing less load when the system is not under high stress.

So its not always win and I may drop bte support in a future version. Can
we talk off list about this since this is mostly an SGI thing?

On Wed, 22 Dec 2004, Robin Holt wrote:

> We still need to talk.  This is a much smaller patch, which I like.  The
> problem I see in my 30 second review is you are doing things per-cpu
> when they really need to be done per-node.  It is very likely that
> there will be M-Bricks in the system (cranberry2 has one if you want
> to test your code out there or you can take any altix and disable the
> cpus on a C-Brick).  With M-Bricks, you will essentially limit
> yourself to one zero operation per controlling node instead of one
> per node.
>
> I think the easy answer is to not have the structure allocated
> within bte_copy(), but rather within bte_start_zero and passed
> in as the notification address.
>
> Give me a call sometime today (Wed. I am in the office from about
> 10:00 CDT until around 4:00 CDT)  Maybe we can get this straightened
> out quickly.  If you are not calling from the office, email me with
> other arrangements.
>
> Thanks,
> Robin
>
> On Tue, Dec 21, 2004 at 11:57:57AM -0800, Christoph Lameter wrote:
> > o Use the Block Transfer Engine in the Altix SN2 SHub for background zeroing
> >
> > Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
> > ===================================================================
> > --- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c	2004-12-17 14:40:10.000000000 -0800
> > +++ linux-2.6.9/arch/ia64/sn/kernel/bte.c	2004-12-21 11:03:49.000000000 -0800
> > @@ -4,6 +4,8 @@
> >   * for more details.
> >   *
> >   * Copyright (c) 2000-2003 Silicon Graphics, Inc.  All Rights Reserved.
> > + *
> > + * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
> >   */
> >
> >  #include <linux/config.h>
> > @@ -20,6 +22,8 @@
> >  #include <linux/bootmem.h>
> >  #include <linux/string.h>
> >  #include <linux/sched.h>
> > +#include <linux/mm.h>
> > +#include <linux/scrub.h>
> >
> >  #include <asm/sn/bte.h>
> >
> > @@ -30,7 +34,11 @@
> >  /* two interfaces on two btes */
> >  #define MAX_INTERFACES_TO_TRY		4
> >
> > -static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
> > +DEFINE_PER_CPU(u64 *, bte_zero_notify);
> > +
> > +#define bte_zero_notify __get_cpu_var(bte_zero_notify)
> > +
> > +static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
> >  {
> >  	nodepda_t *tmp_nodepda;
> >
> > @@ -132,7 +140,6 @@
> >  			if (bte == NULL) {
> >  				continue;
> >  			}
> > -
> >  			if (spin_trylock(&bte->spinlock)) {
> >  				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
> >  				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
> > @@ -157,7 +164,7 @@
> >  		}
> >  	} while (1);
> >
> > -	if (notification == NULL) {
> > +	if (notification == NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
> >  		/* User does not want to be notified. */
> >  		bte->most_rcnt_na = &bte->notify;
> >  	} else {
> > @@ -192,6 +199,8 @@
> >
> >  	itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);
> >
> > +	if (mode & BTE_NOTIFY_AND_GET_POINTER)
> > +		 *(u64 volatile **)(notification) = &bte->notify;
> >  	spin_unlock_irqrestore(&bte->spinlock, irq_flags);
> >
> >  	if (notification != NULL) {
> > @@ -449,5 +458,31 @@
> >  		mynodepda->bte_if[i].cleanup_active = 0;
> >  		mynodepda->bte_if[i].bh_error = 0;
> >  	}
> > +}
> > +
> > +static int bte_check_bzero(void)
> > +{
> > +	return *bte_zero_notify != BTE_WORD_BUSY;
> > +}
> > +
> > +static int bte_start_bzero(void *p, unsigned long len)
> > +{
> > +	/* Check limitations.
> > +		1. System must be running (weird things happen during bootup)
> > +		2. Size >64KB. Smaller requests cause too much bte traffic
> > +	 */
> > +	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
> > +		return EINVAL;
> > +
> > +	return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, &bte_zero_notify);
> > +}
> > +
> > +static struct zero_driver bte_bzero = {
> > +	.start = bte_start_bzero,
> > +	.check = bte_check_bzero,
> > +	.rate = 500000000		/* 500 MB /sec */
> > +};
> >
> > +void sn_bte_bzero_init(void) {
> > +	register_zero_driver(&bte_bzero);
> >  }
> > Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
> > ===================================================================
> > --- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c	2004-12-17 14:40:10.000000000 -0800
> > +++ linux-2.6.9/arch/ia64/sn/kernel/setup.c	2004-12-21 11:02:35.000000000 -0800
> > @@ -243,6 +243,7 @@
> >  	int pxm;
> >  	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
> >  	extern void sn_cpu_init(void);
> > +	extern void sn_bte_bzero_init(void);
> >
> >  	/*
> >  	 * If the generic code has enabled vga console support - lets
> > @@ -333,6 +334,7 @@
> >  	screen_info = sn_screen_info;
> >
> >  	sn_timer_init();
> > +	sn_bte_bzero_init();
> >  }
> >
> >  /**
> > Index: linux-2.6.9/include/asm-ia64/sn/bte.h
> > ===================================================================
> > --- linux-2.6.9.orig/include/asm-ia64/sn/bte.h	2004-12-17 14:40:16.000000000 -0800
> > +++ linux-2.6.9/include/asm-ia64/sn/bte.h	2004-12-21 11:02:35.000000000 -0800
> > @@ -48,6 +48,8 @@
> >  #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
> >  /* Use a reserved bit to let the caller specify a wait for any BTE */
> >  #define BTE_WACQUIRE (0x4000)
> > +/* Return the pointer to the notification cacheline to the user */
> > +#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
> >  /* Use the BTE on the node with the destination memory */
> >  #define BTE_USE_DEST (BTE_WACQUIRE << 1)
> >  /* Use any available BTE interface on any node for the transfer */
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V2 [0/3]: Why and When it works
  2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
                       ` (2 preceding siblings ...)
  2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing Christoph Lameter
@ 2004-12-23 19:29     ` Christoph Lameter
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
                         ` (4 more replies)
  2004-12-24 18:31     ` Increase page fault rate by prezeroing V1 [0/3]: Overview Andrea Arcangeli
  4 siblings, 5 replies; 87+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:29 UTC (permalink / raw)
  Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

Change from V1 to V2:
o Add explanation--and some bench results--as to why and when this optimization works
  and why other approaches have not worked.
o Instead of zero_page(p,order) extend clear_page to take second argument
o Update all architectures to accept second argument for clear_pages
o Extensive removal of all page allocs/clear_page combination from all archs
o Blank / typo fixups
o SGI BTE zero driver update: Use node specific variables instead of cpu specific
  since a cpu may be responsible for multiple nodes.

The patches increasing the page fault rate (introduction of atomic pte operations
and anticipatory prefaulting) do so by reducing the locking overhead and are
therefore mainly of interest for applications running in SMP systems with a high
number of cpus. The single thread performance does just show minor increases.
Only the performance of multi-threaded applications increase significantly.

The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page. This zeroing means that all
cachelines of the faulted page (on Altix that means all 128 cachelines of
128 byte each) must be loaded and later written back. This patch allows to
avoid having to load all cachelines if only a part of the cachelines of
that page is needed immediately after the fault.

Thus the patch will only be effective for sparsely accessed memory which
is typicalfor anonymous memory and pte maps. Prezeroed pages will be used
for those purposes. Unzeroed pages will be used as usual for the other
purposes.

Others have also thought that prezeroing could be a benefit and have tried
provide a way to provide zeroed pages to the page fault handler:

http://marc.theaimsgroup.com/?t=109914559100004&r=1&w=2
http://marc.theaimsgroup.com/?t=109777267500005&r=1&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=104931944213955&w=2

However, these attempt have tried to zero pages soon to be
accessed (and which may already have recently been accessed). Elements of
these pages are thus already in the cache. Approaches like that will only
shift processing a bit and not yield performance benefits.
Prezeroing only makes sense for pages that are not currently needed and
that are not in the cpu caches. Pages that have recently been touched and
that soon will be touched again are better hot zeroed since the zeroing
will largely be done to cachelines already in the cpu caches.

The patch makes prezeroing very effective by:

1. Aggregating zeroing operations to only apply to pages of higher order,
which results in many pages that will later become order 0 to be
zeroed in one go. For that purpose the existing clear_page function is
extended and made to take an additional argument specifying the order of
the page to be cleared.

2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.

The result is a significant increase of the page fault performance even for
single threaded applications:

w/o patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
   4   3    1    0.146s     11.155s  11.030s 69584.896  69566.852

w/patch
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
   1   1    1    0.014s      0.110s   0.012s524292.194 517665.538

The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmarks the system could potentially
run out of zeroed pages but the efficient algorithm for page zeroing still
shows this to be a winner:

(8 way system with 6 GB RAM, no hardware zeroing support)

w/o patch:
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 4   3    1    0.146s     11.155s  11.030s 69584.896  69566.852
 4   3    2    0.170s     14.909s   7.097s 52150.369  98643.687
 4   3    4    0.181s     16.597s   5.079s 46869.167 135642.420
 4   3    8    0.166s     23.239s   4.037s 33599.215 179791.120

w/patch
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 4   3    1    0.183s      2.750s   2.093s268077.996 267952.890
 4   3    2    0.185s      4.876s   2.097s155344.562 263967.292
 4   3    4    0.150s      6.617s   2.097s116205.793 264774.080
 4   3    8    0.186s     13.693s   3.054s 56659.819 221701.073

Note that zeroing of pages makes no sense if the application
touches all cache lines of a page allocated (there is no influence of
prezeroing on benchmarks like lmbench for that reason) since the extensive
caching of modern cpus means that the zeroes written to a hot zeroed page
will then be overwritten by the application in the cpu cache and thus
the zeros will never make it to memory! The test program used above only
touches one 128 byte cache line of a 16k page (ia64).

Here is another test in order to gauge the influence of the number of cache
lines touched on the performance of the prezero enhancements:

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  1    1   1    0.01s      0.12s   0.01s500813.853 497925.891
  1  1    1   2    0.01s      0.11s   0.01s493453.103 472877.725
  1  1    1   4    0.02s      0.10s   0.01s479351.658 471507.415
  1  1    1   8    0.01s      0.13s   0.01s424742.054 416725.013
  1  1    1  16    0.05s      0.12s   0.01s347715.359 336983.834
  1  1    1  32    0.12s      0.13s   0.02s258112.286 256246.731
  1  1    1  64    0.24s      0.14s   0.03s169896.381 168189.283
  1  1    1 128    0.49s      0.14s   0.06s102300.257 101674.435

The benefits of prezeroing become smaller the more cache lines of
a page are touched. Prezeroing can only be effective if memory is not
immediately touched after the anonymous page fault.

The patch is composed of 4 parts:

[1/4] Introduce __GFP_ZERO
	Modifies the page allocator to be able to take the __GFP_ZERO flag
	and returns zeroed memory on request. Modifies locations throughout
	the linux sources that retrieve a page and then zero it to request
	a zeroed page.

[2/4] Architecture specific clear_page updates
	Adds second order argument to clear_page and updates all arches.

Note: The two first pages may be used alone if no zeroing engine is wanted.

[3/4] Page Zeroing
	Adds management of ZEROED and NOT_ZEROED pages and a background daemon
	called scrubd. scrubd is disabled by default but can be enabled
	by writing an order number to /proc/sys/vm/scrub_start. If a page
	is coalesced of that order or higher then the scrub daemon will
	start zeroing until all pages of order /proc/sys/vm/scrub_stop and
	higher are zeroed and then go back to sleep.

	In an SMP environment the scrub daemon is typically
	running on the most idle cpu. Thus a single threaded application running
	on one cpu may have the other cpu zeroing pages for it etc. The scrub
	daemon is hardly noticable and usually finished zeroing quickly since most
	processors are optimized for linear memory filling.

[4/4]	SGI Altix Block Transfer Engine Support
	Implements a driver to shift the zeroing off the cpu into hardware.
	With hardware support there will be minimal impact of zeroing
	on the performance of the system.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
@ 2004-12-23 19:33       ` Christoph Lameter
  2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
                           ` (3 more replies)
  2004-12-23 19:49       ` Prezeroing V2 [0/3]: Why and When it works Arjan van de Ven
                         ` (3 subsequent siblings)
  4 siblings, 4 replies; 87+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:33 UTC (permalink / raw)
  To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
to request zeroed pages from the page allocator.

o Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set

o Replace all page zeroing after allocating pages by request for
  zeroed pages.

o requires arch updates to clear_page in order to function properly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9/mm/page_alloc.c
===================================================================
--- linux-2.6.9.orig/mm/page_alloc.c	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c	2004-12-22 17:23:43.000000000 -0800
@@ -575,6 +575,18 @@
 		BUG_ON(bad_range(zone, page));
 		mod_page_state_zone(zone, pgalloc, 1 << order);
 		prep_new_page(page, order);
+
+		if (gfp_flags & __GFP_ZERO) {
+#ifdef CONFIG_HIGHMEM
+			if (PageHighMem(page)) {
+				int n = 1 << order;
+
+				while (n-- >0)
+					clear_highpage(page + n);
+			} else
+#endif
+			clear_page(page_address(page), order);
+		}
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -767,12 +779,9 @@
 	 */
 	BUG_ON(gfp_mask & __GFP_HIGHMEM);

-	page = alloc_pages(gfp_mask, 0);
-	if (page) {
-		void *address = page_address(page);
-		clear_page(address);
-		return (unsigned long) address;
-	}
+	page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+	if (page)
+		return (unsigned long) page_address(page);
 	return 0;
 }

Index: linux-2.6.9/include/linux/gfp.h
===================================================================
--- linux-2.6.9.orig/include/linux/gfp.h	2004-10-18 14:53:44.000000000 -0700
+++ linux-2.6.9/include/linux/gfp.h	2004-12-22 17:23:43.000000000 -0800
@@ -37,6 +37,7 @@
 #define __GFP_NORETRY	0x1000	/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	0x2000	/* Slab internal usage */
 #define __GFP_COMP	0x4000	/* Add compound page metadata */
+#define __GFP_ZERO	0x8000	/* Return zeroed page on success */

 #define __GFP_BITS_SHIFT 16	/* Room for 16 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)

 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/mm/memory.c	2004-12-22 17:23:43.000000000 -0800
@@ -1445,10 +1445,9 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.9/kernel/profile.c
===================================================================
--- linux-2.6.9.orig/kernel/profile.c	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/kernel/profile.c	2004-12-22 17:23:43.000000000 -0800
@@ -326,17 +326,15 @@
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				return NOTIFY_BAD;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				goto out_free;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
 		}
 		break;
@@ -510,16 +508,14 @@
 		int node = cpu_to_node(cpu);
 		struct page *page;

-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[0]
 				= (struct profile_hit *)page_address(page);
 	}
Index: linux-2.6.9/mm/shmem.c
===================================================================
--- linux-2.6.9.orig/mm/shmem.c	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/mm/shmem.c	2004-12-22 17:23:43.000000000 -0800
@@ -369,9 +369,8 @@
 		}

 		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
 		if (page) {
-			clear_highpage(page);
 			page->nr_swapped = 0;
 		}
 		spin_lock(&info->lock);
@@ -910,7 +909,7 @@
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp, &pvma, 0);
+	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -926,7 +925,7 @@
 shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
 				 unsigned long idx)
 {
-	return alloc_page(gfp);
+	return alloc_page(gfp | __GFP_ZERO);
 }
 #endif

@@ -1135,7 +1134,6 @@

 		info->alloced++;
 		spin_unlock(&info->lock);
-		clear_highpage(filepage);
 		flush_dcache_page(filepage);
 		SetPageUptodate(filepage);
 	}
Index: linux-2.6.9/mm/hugetlb.c
===================================================================
--- linux-2.6.9.orig/mm/hugetlb.c	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/hugetlb.c	2004-12-22 17:23:43.000000000 -0800
@@ -77,7 +77,6 @@
 struct page *alloc_huge_page(void)
 {
 	struct page *page;
-	int i;

 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page();
@@ -88,8 +87,7 @@
 	spin_unlock(&hugetlb_lock);
 	set_page_count(page, 1);
 	page[1].mapping = (void *)free_huge_page;
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-		clear_highpage(&page[i]);
+	clear_page(page_address(page), HUGETLB_PAGE_ORDER);
 	return page;
 }

Index: linux-2.6.9/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgalloc.h	2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -61,9 +61,7 @@
 	pgd_t *pgd = pgd_alloc_one_fast(mm);

 	if (unlikely(pgd == NULL)) {
-		pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-		if (likely(pgd != NULL))
-			clear_page(pgd);
+		pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	}
 	return pgd;
 }
@@ -107,10 +105,8 @@
 static inline pmd_t*
 pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pmd != NULL))
-		clear_page(pmd);
 	return pmd;
 }

@@ -141,20 +137,16 @@
 static inline struct page *
 pte_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

-	if (likely(pte != NULL))
-		clear_page(page_address(pte));
 	return pte;
 }

 static inline pte_t *
 pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pte != NULL))
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.9/arch/i386/mm/pgtable.c
===================================================================
--- linux-2.6.9.orig/arch/i386/mm/pgtable.c	2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/i386/mm/pgtable.c	2004-12-22 17:23:43.000000000 -0800
@@ -132,10 +132,7 @@

 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
-	return pte;
+	return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 }

 struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -143,12 +140,10 @@
 	struct page *pte;

 #ifdef CONFIG_HIGHPTE
-	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
 #else
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 #endif
-	if (pte)
-		clear_highpage(pte);
 	return pte;
 }

Index: linux-2.6.9/drivers/block/pktcdvd.c
===================================================================
--- linux-2.6.9.orig/drivers/block/pktcdvd.c	2004-12-22 16:48:15.000000000 -0800
+++ linux-2.6.9/drivers/block/pktcdvd.c	2004-12-22 17:23:43.000000000 -0800
@@ -125,22 +125,19 @@
 	int i;
 	struct packet_data *pkt;

-	pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL);
+	pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL|__GFP_ZERO);
 	if (!pkt)
 		goto no_pkt;
-	memset(pkt, 0, sizeof(struct packet_data));

 	pkt->w_bio = pkt_bio_alloc(PACKET_MAX_SIZE);
 	if (!pkt->w_bio)
 		goto no_bio;

 	for (i = 0; i < PAGES_PER_PACKET; i++) {
-		pkt->pages[i] = alloc_page(GFP_KERNEL);
+		pkt->pages[i] = alloc_page(GFP_KERNEL|__GFP_ZERO);
 		if (!pkt->pages[i])
 			goto no_page;
 	}
-	for (i = 0; i < PAGES_PER_PACKET; i++)
-		clear_page(page_address(pkt->pages[i]));

 	spin_lock_init(&pkt->lock);

Index: linux-2.6.9/arch/m68k/mm/motorola.c
===================================================================
--- linux-2.6.9.orig/arch/m68k/mm/motorola.c	2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/m68k/mm/motorola.c	2004-12-22 17:23:43.000000000 -0800
@@ -1,4 +1,4 @@
-/*
+*
  * linux/arch/m68k/motorola.c
  *
  * Routines specific to the Motorola MMU, originally from:
@@ -50,7 +50,7 @@

 	ptablep = (pte_t *)alloc_bootmem_low_pages(PAGE_SIZE);

-	clear_page(ptablep);
+	clear_page(ptablep, 0);
 	__flush_page_to_ram(ptablep);
 	flush_tlb_kernel_page(ptablep);
 	nocache_page(ptablep);
@@ -90,7 +90,7 @@
 	if (((unsigned long)last_pgtable & ~PAGE_MASK) == 0) {
 		last_pgtable = (pmd_t *)alloc_bootmem_low_pages(PAGE_SIZE);

-		clear_page(last_pgtable);
+		clear_page(last_pgtable, 0);
 		__flush_page_to_ram(last_pgtable);
 		flush_tlb_kernel_page(last_pgtable);
 		nocache_page(last_pgtable);
Index: linux-2.6.9/include/asm-mips/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-mips/pgalloc.h	2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-mips/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -56,9 +56,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PTE_ORDER);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER);

 	return pte;
 }
Index: linux-2.6.9/arch/alpha/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/alpha/mm/init.c	2004-10-18 14:55:07.000000000 -0700
+++ linux-2.6.9/arch/alpha/mm/init.c	2004-12-22 17:23:43.000000000 -0800
@@ -42,10 +42,9 @@
 {
 	pgd_t *ret, *init;

-	ret = (pgd_t *)__get_free_page(GFP_KERNEL);
+	ret = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 	init = pgd_offset(&init_mm, 0UL);
 	if (ret) {
-		clear_page(ret);
 #ifdef CONFIG_ALPHA_LARGE_VMALLOC
 		memcpy (ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
 			(PTRS_PER_PGD - USER_PTRS_PER_PGD - 1)*sizeof(pgd_t));
@@ -63,9 +62,7 @@
 pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

Index: linux-2.6.9/include/asm-parisc/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-parisc/pgalloc.h	2004-10-18 14:55:28.000000000 -0700
+++ linux-2.6.9/include/asm-parisc/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -120,18 +120,14 @@
 static inline struct page *
 pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
-	if (likely(page != NULL))
-		clear_page(page_address(page));
+	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return page;
 }

 static inline pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (likely(pte != NULL))
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

Index: linux-2.6.9/arch/sh/mm/pg-sh4.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/pg-sh4.c	2004-10-18 14:53:46.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/pg-sh4.c	2004-12-22 17:23:43.000000000 -0800
@@ -34,7 +34,7 @@
 {
 	__set_bit(PG_mapped, &page->flags);
 	if (((address ^ (unsigned long)to) & CACHE_ALIAS) == 0)
-		clear_page(to);
+		clear_page(to, 0);
 	else {
 		pgprot_t pgprot = __pgprot(_PAGE_PRESENT |
 					   _PAGE_RW | _PAGE_CACHABLE |
Index: linux-2.6.9/include/asm-sparc64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-sparc64/pgalloc.h	2004-10-18 14:55:28.000000000 -0700
+++ linux-2.6.9/include/asm-sparc64/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -73,10 +73,9 @@
 		struct page *page;

 		preempt_enable();
-		page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+		page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 		if (page) {
 			ret = (struct page *)page_address(page);
-			clear_page(ret);
 			page->lru.prev = (void *) 2UL;

 			preempt_disable();
Index: linux-2.6.9/include/asm-sh/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-sh/pgalloc.h	2004-10-18 14:54:08.000000000 -0700
+++ linux-2.6.9/include/asm-sh/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -44,9 +44,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);

 	return pte;
 }
@@ -56,9 +54,7 @@
 {
 	struct page *pte;

-   	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+   	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

 	return pte;
 }
Index: linux-2.6.9/include/asm-m32r/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-m32r/pgalloc.h	2004-10-18 14:55:07.000000000 -0700
+++ linux-2.6.9/include/asm-m32r/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -23,10 +23,7 @@
  */
 static __inline__ pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-
-	if (pgd)
-		clear_page(pgd);
+	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);

 	return pgd;
 }
@@ -39,10 +36,7 @@
 static __inline__ pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
-
-	if (pte)
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);

 	return pte;
 }
@@ -50,10 +44,8 @@
 static __inline__ struct page *pte_alloc_one(struct mm_struct *mm,
 	unsigned long address)
 {
-	struct page *pte = alloc_page(GFP_KERNEL);
+	struct page *pte = alloc_page(GFP_KERNEL|__GFP_ZERO);

-	if (pte)
-		clear_page(page_address(pte));

 	return pte;
 }
Index: linux-2.6.9/arch/um/kernel/mem.c
===================================================================
--- linux-2.6.9.orig/arch/um/kernel/mem.c	2004-10-18 14:53:51.000000000 -0700
+++ linux-2.6.9/arch/um/kernel/mem.c	2004-12-22 17:23:43.000000000 -0800
@@ -307,9 +307,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

@@ -317,9 +315,7 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_highpage(pte);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	return pte;
 }

Index: linux-2.6.9/arch/ppc64/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/ppc64/mm/init.c	2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/ppc64/mm/init.c	2004-12-22 17:23:43.000000000 -0800
@@ -761,7 +761,7 @@

 void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
 {
-	clear_page(page);
+	clear_page(page, 0);

 	if (cur_cpu_spec->cpu_features & CPU_FTR_COHERENT_ICACHE)
 		return;
Index: linux-2.6.9/include/asm-sh64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-sh64/pgalloc.h	2004-10-18 14:53:21.000000000 -0700
+++ linux-2.6.9/include/asm-sh64/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -112,9 +112,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT|__GFP_ZERO);

 	return pte;
 }
@@ -123,9 +121,7 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

 	return pte;
 }
@@ -150,9 +146,7 @@
 static __inline__ pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	pmd_t *pmd;
-	pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pmd)
-		clear_page(pmd);
+	pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pmd;
 }

Index: linux-2.6.9/include/asm-cris/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-cris/pgalloc.h	2004-10-18 14:55:06.000000000 -0700
+++ linux-2.6.9/include/asm-cris/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -24,18 +24,14 @@

 extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
  	return pte;
 }

 extern inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *pte;
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	return pte;
 }

Index: linux-2.6.9/arch/ppc/mm/pgtable.c
===================================================================
--- linux-2.6.9.orig/arch/ppc/mm/pgtable.c	2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/ppc/mm/pgtable.c	2004-12-22 17:23:43.000000000 -0800
@@ -85,8 +85,7 @@
 {
 	pgd_t *ret;

-	if ((ret = (pgd_t *)__get_free_pages(GFP_KERNEL, PGDIR_ORDER)) != NULL)
-		clear_pages(ret, PGDIR_ORDER);
+	ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER);
 	return ret;
 }

@@ -102,7 +101,7 @@
 	extern void *early_get_page(void);

 	if (mem_init_done) {
-		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 		if (pte) {
 			struct page *ptepage = virt_to_page(pte);
 			ptepage->mapping = (void *) mm;
@@ -110,8 +109,6 @@
 		}
 	} else
 		pte = (pte_t *)early_get_page();
-	if (pte)
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.9/arch/ppc/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/ppc/mm/init.c	2004-10-18 14:53:43.000000000 -0700
+++ linux-2.6.9/arch/ppc/mm/init.c	2004-12-22 17:23:43.000000000 -0800
@@ -595,7 +595,7 @@
 }
 void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
 {
-	clear_page(page);
+	clear_page(page, 0);
 	clear_bit(PG_arch_1, &pg->flags);
 }

Index: linux-2.6.9/fs/afs/file.c
===================================================================
--- linux-2.6.9.orig/fs/afs/file.c	2004-10-18 14:55:36.000000000 -0700
+++ linux-2.6.9/fs/afs/file.c	2004-12-22 17:23:43.000000000 -0800
@@ -172,7 +172,7 @@
 				      (size_t) PAGE_SIZE);
 		desc.buffer	= kmap(page);

-		clear_page(desc.buffer);
+		clear_page(desc.buffer, 0);

 		/* read the contents of the file from the server into the
 		 * page */
Index: linux-2.6.9/include/asm-alpha/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-alpha/pgalloc.h	2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-alpha/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -40,9 +40,7 @@
 static inline pmd_t *
 pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (ret)
-		clear_page(ret);
+	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return ret;
 }

Index: linux-2.6.9/include/linux/highmem.h
===================================================================
--- linux-2.6.9.orig/include/linux/highmem.h	2004-10-18 14:54:54.000000000 -0700
+++ linux-2.6.9/include/linux/highmem.h	2004-12-22 17:23:43.000000000 -0800
@@ -47,7 +47,7 @@
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
-	clear_page(kaddr);
+	clear_page(kaddr, 0);
 	kunmap_atomic(kaddr, KM_USER0);
 }

Index: linux-2.6.9/arch/sh64/mm/ioremap.c
===================================================================
--- linux-2.6.9.orig/arch/sh64/mm/ioremap.c	2004-10-18 14:54:32.000000000 -0700
+++ linux-2.6.9/arch/sh64/mm/ioremap.c	2004-12-22 17:23:43.000000000 -0800
@@ -399,7 +399,7 @@
 	if (pte_none(*ptep) || !pte_present(*ptep))
 		return;

-	clear_page((void *)ptep);
+	clear_page((void *)ptep, 0);
 	pte_clear(ptep);
 }

Index: linux-2.6.9/include/asm-m68k/motorola_pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-m68k/motorola_pgalloc.h	2004-10-18 14:55:36.000000000 -0700
+++ linux-2.6.9/include/asm-m68k/motorola_pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -12,9 +12,8 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	if (pte) {
-		clear_page(pte);
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
 		nocache_page(pte);
@@ -31,7 +30,7 @@

 static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	pte_t *pte;

 	if(!page)
@@ -39,7 +38,6 @@

 	pte = kmap(page);
 	if (pte) {
-		clear_page(pte);
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
 		nocache_page(pte);
Index: linux-2.6.9/arch/sh/mm/pg-sh7705.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/pg-sh7705.c	2004-12-22 16:48:15.000000000 -0800
+++ linux-2.6.9/arch/sh/mm/pg-sh7705.c	2004-12-22 17:23:43.000000000 -0800
@@ -78,13 +78,13 @@

 	__set_bit(PG_mapped, &page->flags);
 	if (((address ^ (unsigned long)to) & CACHE_ALIAS) == 0) {
-		clear_page(to);
+		clear_page(to, 0);
 		__flush_wback_region(to, PAGE_SIZE);
 	} else {
 		__flush_purge_virtual_region(to,
 					     (void *)(address & 0xfffff000),
 					     PAGE_SIZE);
-		clear_page(to);
+		clear_page(to, 0);
 		__flush_wback_region(to, PAGE_SIZE);
 	}
 }
Index: linux-2.6.9/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/sparc64/mm/init.c	2004-12-22 16:48:15.000000000 -0800
+++ linux-2.6.9/arch/sparc64/mm/init.c	2004-12-22 17:23:43.000000000 -0800
@@ -1687,13 +1687,12 @@
 	 * Set up the zero page, mark it reserved, so that page count
 	 * is not manipulated when freeing the page from user ptes.
 	 */
-	mem_map_zero = alloc_pages(GFP_KERNEL, 0);
+	mem_map_zero = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if (mem_map_zero == NULL) {
 		prom_printf("paging_init: Cannot alloc zero page.\n");
 		prom_halt();
 	}
 	SetPageReserved(mem_map_zero);
-	clear_page(page_address(mem_map_zero));

 	codepages = (((unsigned long) _etext) - ((unsigned long) _start));
 	codepages = PAGE_ALIGN(codepages) >> PAGE_SHIFT;
Index: linux-2.6.9/include/asm-arm/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-arm/pgalloc.h	2004-10-18 14:55:27.000000000 -0700
+++ linux-2.6.9/include/asm-arm/pgalloc.h	2004-12-22 17:23:43.000000000 -0800
@@ -50,9 +50,8 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	if (pte) {
-		clear_page(pte);
 		clean_dcache_area(pte, sizeof(pte_t) * PTRS_PER_PTE);
 		pte += PTRS_PER_PTE;
 	}
@@ -65,10 +64,9 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	if (pte) {
 		void *page = page_address(pte);
-		clear_page(page);
 		clean_dcache_area(page, sizeof(pte_t) * PTRS_PER_PTE);
 	}




^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
@ 2004-12-23 19:33         ` Christoph Lameter
  2004-12-24  8:33           ` Pavel Machek
                             ` (2 more replies)
  2004-12-23 19:34         ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Christoph Lameter
                           ` (2 subsequent siblings)
  3 siblings, 3 replies; 87+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:33 UTC (permalink / raw)
  To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

o Extend clear_page to take an order parameter for all architectures.

Known to work:

ia64
i386

Trivial modification expected to simply work:

arm
cris
h8300
m68k
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um

Modification made but it would be good to have some feedback from the arch maintainers:

x86_64
s390
alpha
sparc64
sh
mips
m32r

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9/include/asm-ia64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/page.h	2004-10-18 14:53:21.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -56,7 +56,7 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
 extern void copy_page (void *to, void *from);

 /*
@@ -65,7 +65,7 @@
  */
 #define clear_user_page(addr, vaddr, page)	\
 do {						\
-	clear_page(addr);			\
+	clear_page(addr, 0);			\
 	flush_dcache_page(page);		\
 } while (0)

Index: linux-2.6.9/include/asm-i386/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/page.h	2004-12-22 16:48:19.000000000 -0800
+++ linux-2.6.9/include/asm-i386/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -18,7 +18,7 @@

 #include <asm/mmx.h>

-#define clear_page(page)	mmx_clear_page((void *)(page))
+#define clear_page(page, order)	mmx_clear_page((void *)(page),order)
 #define copy_page(to,from)	mmx_copy_page(to,from)

 #else
@@ -28,12 +28,12 @@
  *	Maybe the K6-III ?
  */

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)

 #endif

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.9/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/page.h	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/include/asm-x86_64/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -32,10 +32,10 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-void clear_page(void *);
+void clear_page(void *, int);
 void copy_page(void *, void *);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.9/include/asm-sparc/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-sparc/page.h	2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/include/asm-sparc/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -28,10 +28,10 @@

 #ifndef __ASSEMBLY__

-#define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	 memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		sparc_flush_page_to_ram(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.9/include/asm-s390/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-s390/page.h	2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/include/asm-s390/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -22,12 +22,12 @@

 #ifndef __s390x__

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	register_pair rp;

 	rp.subreg.even = (unsigned long) page;
-	rp.subreg.odd = (unsigned long) 4096;
+	rp.subreg.odd = (unsigned long) 4096 << order;
         asm volatile ("   slr  1,1\n"
 		      "   mvcl %0,0"
 		      : "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@

 #else /* __s390x__ */

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
-        asm volatile ("   lgr  2,%0\n"
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+        	asm volatile ("   lgr  2,%0\n"
                       "   lghi 3,4096\n"
                       "   slgr 1,1\n"
                       "   mvcl 2,0"
                       : : "a" ((void *) (page))
 		      : "memory", "cc", "1", "2", "3" );
+		page += PAGE_SIZE;
+	}
 }

 static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@

 #endif /* __s390x__ */

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /* Pure 2^n version of get_order */
Index: linux-2.6.9/arch/i386/lib/mmx.c
===================================================================
--- linux-2.6.9.orig/arch/i386/lib/mmx.c	2004-10-18 14:54:23.000000000 -0700
+++ linux-2.6.9/arch/i386/lib/mmx.c	2004-12-23 07:44:14.000000000 -0800
@@ -128,7 +128,7 @@
  *	other MMX using processors do not.
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -138,7 +138,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/64;i++)
+	for(i=0;i<((4096/64) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@
  *	Generic MMX implementation without K7 specific streaming
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -267,7 +267,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/128;i++)
+	for(i=0;i<((4096/128) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@
  *	Favour MMX for page clear and copy.
  */

-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
 {
 	int d0, d1;
 	__asm__ __volatile__( \
 		"cld\n\t" \
 		"rep ; stosl" \
 		: "=&c" (d0), "=&D" (d1)
-		:"a" (0),"1" (page),"0" (1024)
+		:"a" (0),"1" (page),"0" (1024 << order)
 		:"memory");
 }
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
 {
 	if(unlikely(in_interrupt()))
-		slow_zero_page(page);
+		slow_clear_page(page, order);
 	else
-		fast_clear_page(page);
+		fast_clear_page(page, order);
 }

 static void slow_copy_page(void *to, void *from)
Index: linux-2.6.9/include/asm-x86_64/mmx.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/mmx.h	2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-x86_64/mmx.h	2004-12-23 07:44:14.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.9/arch/ia64/lib/clear_page.S
===================================================================
--- linux-2.6.9.orig/arch/ia64/lib/clear_page.S	2004-10-18 14:53:10.000000000 -0700
+++ linux-2.6.9/arch/ia64/lib/clear_page.S	2004-12-23 07:44:14.000000000 -0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm	Tuned for Itanium.
  * 2/12/02 kchen	Tuned for both Itanium and McKinley
  * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
  */
 #include <linux/config.h>

@@ -29,27 +30,33 @@
 #define dst4		r11

 #define dst_last	r31
+#define totsize		r14

 GLOBAL_ENTRY(clear_page)
 	.prologue
-	.regstk 1,0,0,0
-	mov r16 = PAGE_SIZE/L3_LINE_SIZE-1	// main loop count, -1=repeat/until
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
 	.save ar.lc, saved_lc
 	mov saved_lc = ar.lc
-
+	;;
 	.body
+	adds dst1 = 16, in0
 	mov ar.lc = (PREFETCH_LINES - 1)
 	mov dst_fetch = in0
-	adds dst1 = 16, in0
 	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
 	;;
 .fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
 	adds dst3 = 48, in0		// executing this multiple times is harmless
 	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
 	;;
-	addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
 	mov ar.lc = r16			// one L3 line per iteration
-	adds dst4 = 64, in0
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
 	;;
 #ifdef CONFIG_ITANIUM
 	// Optimized for Itanium
Index: linux-2.6.9/arch/x86_64/lib/clear_page.S
===================================================================
--- linux-2.6.9.orig/arch/x86_64/lib/clear_page.S	2004-10-18 14:54:07.000000000 -0700
+++ linux-2.6.9/arch/x86_64/lib/clear_page.S	2004-12-23 07:44:14.000000000 -0800
@@ -7,6 +7,7 @@
 clear_page:
 	xorl   %eax,%eax
 	movl   $4096/64,%ecx
+	shl	%esi, %ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
@@ -42,6 +43,7 @@
 	.section .altinstr_replacement,"ax"
 clear_page_c:
 	movl $4096/8,%ecx
+	shl	%esi, %ecx
 	xorl %eax,%eax
 	rep
 	stosq
Index: linux-2.6.9/include/asm-sh/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-sh/page.h	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/include/asm-sh/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -36,12 +36,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
 extern void (*copy_page)(void *to, void *from);

 extern void clear_page_slow(void *to);
 extern void copy_page_slow(void *to, void *from);

+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 #if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
 struct page;
 extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@
 extern void __clear_user_page(void *to, void *orig_to);
 extern void __copy_user_page(void *to, void *from, void *orig_to);
 #elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 #elif defined(CONFIG_CPU_SH4)
 struct page;
Index: linux-2.6.9/include/asm-i386/mmx.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/mmx.h	2004-10-18 14:54:27.000000000 -0700
+++ linux-2.6.9/include/asm-i386/mmx.h	2004-12-23 07:44:14.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.9/arch/alpha/lib/clear_page.S
===================================================================
--- linux-2.6.9.orig/arch/alpha/lib/clear_page.S	2004-10-18 14:55:06.000000000 -0700
+++ linux-2.6.9/arch/alpha/lib/clear_page.S	2004-12-23 07:44:14.000000000 -0800
@@ -6,11 +6,10 @@

 	.text
 	.align 4
-	.global clear_page
-	.ent clear_page
-clear_page:
+	.global _clear_page
+	.ent _clear_page
+_clear_page:
 	.prologue 0
-
 	lda	$0,128
 	nop
 	unop
@@ -36,4 +35,4 @@
 	unop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.9/include/asm-sh64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-sh64/page.h	2004-10-18 14:54:07.000000000 -0700
+++ linux-2.6.9/include/asm-sh64/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -50,12 +50,20 @@
 extern void sh64_page_clear(void *page);
 extern void sh64_page_copy(void *from, void *to);

-#define clear_page(page)               sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+		sh64_page_clear(page++, 0);
+	}
+}
+
 #define copy_page(to,from)             sh64_page_copy(from, to)

 #if defined(CONFIG_DCACHE_DISABLED)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	sh_clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #else
Index: linux-2.6.9/include/asm-h8300/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-h8300/page.h	2004-10-18 14:55:06.000000000 -0700
+++ linux-2.6.9/include/asm-h8300/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.9/include/asm-arm/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-arm/page.h	2004-12-22 16:48:19.000000000 -0800
+++ linux-2.6.9/include/asm-arm/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -128,7 +128,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 extern void copy_page(void *to, const void *from);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.9/include/asm-ppc64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-ppc64/page.h	2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/include/asm-ppc64/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -102,12 +102,12 @@
 #define REGION_MASK   (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
 #define REGION_STRIDE (1UL << REGION_SHIFT)

-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, int order)
 {
 	unsigned long lines, line_size;

 	line_size = systemcfg->dCacheL1LineSize;
-	lines = naca->dCacheL1LinesPerPage;
+	lines = naca->dCacheL1LinesPerPage << order;

 	__asm__ __volatile__(
 	"mtctr  	%1	# clear_page\n\
Index: linux-2.6.9/include/asm-m32r/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-m32r/page.h	2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/include/asm-m32r/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -11,10 +11,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- > 0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+
 extern void copy_page(void *to, void *from);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.9/include/asm-alpha/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-alpha/page.h	2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/include/asm-alpha/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -15,8 +15,20 @@

 #define STRICT_MM_TYPECHECKS

-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+	int nr = 1 << order;
+
+	while (nr--)
+	{
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)

 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
Index: linux-2.6.9/arch/mips/mm/pg-sb1.c
===================================================================
--- linux-2.6.9.orig/arch/mips/mm/pg-sb1.c	2004-10-18 14:55:36.000000000 -0700
+++ linux-2.6.9/arch/mips/mm/pg-sb1.c	2004-12-23 07:44:14.000000000 -0800
@@ -42,7 +42,7 @@
 #ifdef CONFIG_SIBYTE_DMA_PAGEOPS
 static inline void clear_page_cpu(void *page)
 #else
-void clear_page(void *page)
+void _clear_page(void *page)
 #endif
 {
 	unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@
 		     IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
 }

-void clear_page(void *page)
+void _clear_page(void *page)
 {
 	int cpu = smp_processor_id();

 	/* if the page is above Kseg0, use old way */
 	if (KSEGX(page) != CAC_BASE)
 		return clear_page_cpu(page);
-
 	page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
 	page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
 	__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@

 #endif

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
 EXPORT_SYMBOL(copy_page);
Index: linux-2.6.9/include/asm-m68k/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-m68k/page.h	2004-10-18 14:55:36.000000000 -0700
+++ linux-2.6.9/include/asm-m68k/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -50,7 +50,7 @@
 		       );
 }

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	unsigned long tmp;
 	unsigned long *sp = page;
@@ -69,16 +69,16 @@
 			     "dbra   %1,1b\n\t"
 			     : "=a" (sp), "=d" (tmp)
 			     : "a" (page), "0" (sp),
-			       "1" ((PAGE_SIZE - 16) / 16 - 1));
+			       "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
 }

 #else
-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, 0)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 #endif

 #define clear_user_page(addr, vaddr, page)	\
-	do {	clear_page(addr);		\
+	do {	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.9/include/asm-mips/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-mips/page.h	2004-12-22 16:48:19.000000000 -0800
+++ linux-2.6.9/include/asm-mips/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -39,7 +39,18 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 extern void copy_page(void * to, void * from);

 extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@
 {
 	extern void (*flush_data_cache_page)(unsigned long addr);

-	clear_page(addr);
+	clear_page(addr, 0);
 	if (pages_do_alias((unsigned long) addr, vaddr))
 		flush_data_cache_page((unsigned long)addr);
 }
Index: linux-2.6.9/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-m68knommu/page.h	2004-10-18 14:54:07.000000000 -0700
+++ linux-2.6.9/include/asm-m68knommu/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.9/include/asm-cris/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-cris/page.h	2004-10-18 14:53:46.000000000 -0700
+++ linux-2.6.9/include/asm-cris/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -15,10 +15,10 @@

 #ifdef __KERNEL__

-#define clear_page(page)        memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      memcpy((void *)(to), (void *)(from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)    clear_page(page)
+#define clear_user_page(page, vaddr, pg)    clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

 /*
Index: linux-2.6.9/include/asm-v850/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-v850/page.h	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/include/asm-v850/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -37,11 +37,11 @@

 #define STRICT_MM_TYPECHECKS

-#define clear_page(page)	memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset ((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to, from)	memcpy ((void *)(to), (void *)from, PAGE_SIZE)

 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.9/include/asm-parisc/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-parisc/page.h	2004-10-18 14:53:43.000000000 -0700
+++ linux-2.6.9/include/asm-parisc/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -13,7 +13,7 @@
 #include <asm/types.h>
 #include <asm/cache.h>

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      copy_user_page_asm((void *)(to), (void *)(from))

 struct page;
Index: linux-2.6.9/arch/arm/mm/copypage-v6.c
===================================================================
--- linux-2.6.9.orig/arch/arm/mm/copypage-v6.c	2004-12-23 07:44:04.000000000 -0800
+++ linux-2.6.9/arch/arm/mm/copypage-v6.c	2004-12-23 07:44:14.000000000 -0800
@@ -47,7 +47,7 @@
  */
 void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
 {
-	clear_page(kaddr);
+	_clear_page(kaddr);
 }

 /*
@@ -116,7 +116,7 @@

 	set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
 	flush_tlb_kernel_page(to);
-	clear_page((void *)to);
+	_clear_page((void *)to);

 	spin_unlock(&v6_lock);
 }
Index: linux-2.6.9/arch/m32r/mm/page.S
===================================================================
--- linux-2.6.9.orig/arch/m32r/mm/page.S	2004-10-18 14:54:31.000000000 -0700
+++ linux-2.6.9/arch/m32r/mm/page.S	2004-12-23 07:44:14.000000000 -0800
@@ -51,7 +51,7 @@
 	jmp	r14

 	.text
-	.global	clear_page
+	.global	_clear_page
 	/*
 	 * clear_page (to)
 	 *
@@ -60,7 +60,7 @@
 	 * 16 * 256
 	 */
 	.align	4
-clear_page:
+_clear_page:
 	ldi	r2, #255
 	ldi	r4, #0
 	ld	r3, @r0		/* cache line allocate */
Index: linux-2.6.9/include/asm-ppc/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-ppc/page.h	2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/include/asm-ppc/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -85,7 +85,7 @@

 struct page;
 extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define  clear_page clear_pages
 extern void copy_page(void *to, void *from);
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.9/arch/alpha/kernel/alpha_ksyms.c
===================================================================
--- linux-2.6.9.orig/arch/alpha/kernel/alpha_ksyms.c	2004-12-22 16:48:13.000000000 -0800
+++ linux-2.6.9/arch/alpha/kernel/alpha_ksyms.c	2004-12-23 07:44:14.000000000 -0800
@@ -88,7 +88,7 @@
 EXPORT_SYMBOL(__memsetw);
 EXPORT_SYMBOL(__constant_c_memset);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(__direct_map_base);
 EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.9/arch/alpha/lib/ev6-clear_page.S
===================================================================
--- linux-2.6.9.orig/arch/alpha/lib/ev6-clear_page.S	2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/arch/alpha/lib/ev6-clear_page.S	2004-12-23 07:44:14.000000000 -0800
@@ -6,9 +6,9 @@

         .text
         .align 4
-        .global clear_page
-        .ent clear_page
-clear_page:
+        .global _clear_page
+        .ent _clear_page
+_clear_page:
         .prologue 0

 	lda	$0,128
@@ -51,4 +51,4 @@
 	nop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.9/arch/sh/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/init.c	2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/init.c	2004-12-23 07:44:14.000000000 -0800
@@ -57,7 +57,7 @@
 #endif

 void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);

 void show_mem(void)
 {
@@ -255,7 +255,7 @@
 	 * later in the boot process if a better method is available.
 	 */
 	copy_page = copy_page_slow;
-	clear_page = clear_page_slow;
+	_clear_page = clear_page_slow;

 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.9/arch/sh/mm/pg-dma.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/pg-dma.c	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/pg-dma.c	2004-12-23 07:44:14.000000000 -0800
@@ -78,7 +78,7 @@
 		return ret;

 	copy_page = copy_page_dma;
-	clear_page = clear_page_dma;
+	_clear_page = clear_page_dma;

 	return ret;
 }
Index: linux-2.6.9/arch/sh/mm/pg-nommu.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/pg-nommu.c	2004-10-18 14:53:51.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/pg-nommu.c	2004-12-23 07:44:14.000000000 -0800
@@ -27,7 +27,7 @@
 static int __init pg_nommu_init(void)
 {
 	copy_page = copy_page_nommu;
-	clear_page = clear_page_nommu;
+	_clear_page = clear_page_nommu;

 	return 0;
 }
Index: linux-2.6.9/arch/mips/mm/pg-r4k.c
===================================================================
--- linux-2.6.9.orig/arch/mips/mm/pg-r4k.c	2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/mips/mm/pg-r4k.c	2004-12-23 07:44:14.000000000 -0800
@@ -39,9 +39,9 @@

 static unsigned int clear_page_array[0x130 / 4];

-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 /*
  * Maximum sizes:
Index: linux-2.6.9/arch/m32r/kernel/m32r_ksyms.c
===================================================================
--- linux-2.6.9.orig/arch/m32r/kernel/m32r_ksyms.c	2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/arch/m32r/kernel/m32r_ksyms.c	2004-12-23 07:44:14.000000000 -0800
@@ -102,7 +102,7 @@
 EXPORT_SYMBOL(memcmp);
 EXPORT_SYMBOL(memscan);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(strcat);
 EXPORT_SYMBOL(strchr);
Index: linux-2.6.9/include/asm-arm26/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-arm26/page.h	2004-10-18 14:54:39.000000000 -0700
+++ linux-2.6.9/include/asm-arm26/page.h	2004-12-23 07:44:14.000000000 -0800
@@ -25,7 +25,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 #define copy_page(to, from)  __copy_user_page(to, from, 0);

 #undef STRICT_MM_TYPECHECKS


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
  2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
@ 2004-12-23 19:34         ` Christoph Lameter
  2004-12-23 19:35         ` Prezeroing V2 [4/4]: Hardware Zeroing through SGI BTE Christoph Lameter
  2004-12-23 20:08         ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Brian Gerst
  3 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:34 UTC (permalink / raw)
  To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

o Add page zeroing
o Add scrub daemon
o Add ability to view amount of zeroed information in /proc/meninfo

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9/mm/page_alloc.c
===================================================================
--- linux-2.6.9.orig/mm/page_alloc.c	2004-12-22 13:31:02.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c	2004-12-22 14:24:56.000000000 -0800
@@ -12,6 +12,7 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *          (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Support for page zeroing, Christoph Lameter, SGI, Dec 2004
  */

 #include <linux/config.h>
@@ -32,6 +33,7 @@
 #include <linux/sysctl.h>
 #include <linux/cpu.h>
 #include <linux/nodemask.h>
+#include <linux/scrub.h>

 #include <asm/tlbflush.h>

@@ -179,7 +181,7 @@
  * -- wli
  */

-static inline void __free_pages_bulk (struct page *page, struct page *base,
+static inline int __free_pages_bulk (struct page *page, struct page *base,
 		struct zone *zone, struct free_area *area, unsigned int order)
 {
 	unsigned long page_idx, index, mask;
@@ -192,11 +194,10 @@
 		BUG();
 	index = page_idx >> (1 + order);

-	zone->free_pages += 1 << order;
 	while (order < MAX_ORDER-1) {
 		struct page *buddy1, *buddy2;

-		BUG_ON(area >= zone->free_area + MAX_ORDER);
+		BUG_ON(area >= zone->free_area[ZEROED] + MAX_ORDER);
 		if (!__test_and_change_bit(index, area->map))
 			/*
 			 * the buddy page is still allocated.
@@ -216,6 +217,7 @@
 		page_idx &= mask;
 	}
 	list_add(&(base + page_idx)->lru, &area->free_list);
+	return order;
 }

 static inline void free_pages_check(const char *function, struct page *page)
@@ -258,7 +260,7 @@
 	int ret = 0;

 	base = zone->zone_mem_map;
-	area = zone->free_area + order;
+	area = zone->free_area[NOT_ZEROED] + order;
 	spin_lock_irqsave(&zone->lock, flags);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
@@ -266,7 +268,10 @@
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_pages_bulk list manipulates */
 		list_del(&page->lru);
-		__free_pages_bulk(page, base, zone, area, order);
+		zone->free_pages += 1 << order;
+		if (__free_pages_bulk(page, base, zone, area, order)
+			>= sysctl_scrub_start)
+				wakeup_kscrubd(zone);
 		ret++;
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -288,6 +293,21 @@
 	free_pages_bulk(page_zone(page), 1, &list, order);
 }

+void end_zero_page(struct page *page)
+{
+	unsigned long flags;
+	int order = page->index;
+	struct zone * zone = page_zone(page);
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	zone->zero_pages += 1 << order;
+	__free_pages_bulk(page, zone->zone_mem_map, zone, zone->free_area[ZEROED] + order, order);
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+
 #define MARK_USED(index, order, area) \
 	__change_bit((index) >> (1+(order)), (area)->map)

@@ -366,25 +386,46 @@
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct zone *zone, struct free_area *area, int order)
+{
+	list_del(&page->lru);
+	if (order != MAX_ORDER-1)
+		MARK_USED(page - zone->zone_mem_map, order, area);
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order)
+{
+	unsigned long flags;
+	struct page *page = NULL;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (!list_empty(&area->free_list)) {
+		page = list_entry(area->free_list.next, struct page, lru);
+
+		rmpage(page, zone, area, order);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
 {
 	struct free_area * area;
 	unsigned int current_order;
 	struct page *page;
-	unsigned int index;

 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
+		area = zone->free_area[zero] + current_order;
 		if (list_empty(&area->free_list))
 			continue;

 		page = list_entry(area->free_list.next, struct page, lru);
-		list_del(&page->lru);
-		index = page - zone->zone_mem_map;
-		if (current_order != MAX_ORDER-1)
-			MARK_USED(index, current_order, area);
+		rmpage(page, zone, area, current_order);
 		zone->free_pages -= 1UL << order;
-		return expand(zone, page, index, order, current_order, area);
+		if (zero)
+			zone->zero_pages -= 1UL << order;
+		return expand(zone, page, page - zone->zone_mem_map, order, current_order, area);
 	}

 	return NULL;
@@ -396,7 +437,7 @@
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list, int zero)
 {
 	unsigned long flags;
 	int i;
@@ -405,7 +446,7 @@

 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, zero);
 		if (page == NULL)
 			break;
 		allocated++;
@@ -546,7 +587,9 @@
 {
 	unsigned long flags;
 	struct page *page = NULL;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	int nr_pages = 1 << order;
+	int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+	int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;

 	if (order == 0) {
 		struct per_cpu_pages *pcp;
@@ -555,7 +598,7 @@
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list, zero);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -567,19 +610,30 @@

 	if (page == NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+
+		page = __rmqueue(zone, order, zero);
+
+		/*
+		 * If we failed to obtain a zero and/or unzeroed page
+		 * then we may still be able to obtain the other
+		 * type of page.
+		 */
+		if (!page) {
+			page = __rmqueue(zone, order, !zero);
+			zero = 0;
+		}
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

 	if (page != NULL) {
 		BUG_ON(bad_range(zone, page));
-		mod_page_state_zone(zone, pgalloc, 1 << order);
-		prep_new_page(page, order);
+		mod_page_state_zone(zone, pgalloc, nr_pages);

-		if (gfp_flags & __GFP_ZERO) {
+		if ((gfp_flags & __GFP_ZERO) && !zero) {
 #ifdef CONFIG_HIGHMEM
 			if (PageHighMem(page)) {
-				int n = 1 << order;
+				int n = nr_pages;

 				while (n-- >0)
 					clear_highpage(page + n);
@@ -587,6 +641,7 @@
 #endif
 			clear_page(page_address(page), order);
 		}
+		prep_new_page(page, order);
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -974,7 +1029,7 @@
 }

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat)
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
 {
 	struct zone *zones = pgdat->node_zones;
 	int i;
@@ -982,27 +1037,31 @@
 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		*active += zones[i].nr_active;
 		*inactive += zones[i].nr_inactive;
 		*free += zones[i].free_pages;
+		*zero += zones[i].zero_pages;
 	}
 }

 void get_zone_counts(unsigned long *active,
-		unsigned long *inactive, unsigned long *free)
+		unsigned long *inactive, unsigned long *free, unsigned long *zero)
 {
 	struct pglist_data *pgdat;

 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for_each_pgdat(pgdat) {
-		unsigned long l, m, n;
-		__get_zone_counts(&l, &m, &n, pgdat);
+		unsigned long l, m, n,o;
+		__get_zone_counts(&l, &m, &n, &o, pgdat);
 		*active += l;
 		*inactive += m;
 		*free += n;
+		*zero += o;
 	}
 }

@@ -1039,6 +1098,7 @@

 #define K(x) ((x) << (PAGE_SHIFT-10))

+const char *temp[3] = { "hot", "cold", "zero" };
 /*
  * Show free area list (used inside shift_scroll-lock stuff)
  * We also calculate the percentage fragmentation. We do this by counting the
@@ -1051,6 +1111,7 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;
 	struct zone *zone;

 	for_each_zone(zone) {
@@ -1071,10 +1132,10 @@

 			pageset = zone->pageset + cpu;

-			for (temperature = 0; temperature < 2; temperature++)
+			for (temperature = 0; temperature < 3; temperature++)
 				printk("cpu %d %s: low %d, high %d, batch %d\n",
 					cpu,
-					temperature ? "cold" : "hot",
+					temp[temperature],
 					pageset->pcp[temperature].low,
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch);
@@ -1082,20 +1143,21 @@
 	}

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 	printk("\nFree pages: %11ukB (%ukB HighMem)\n",
 		K(nr_free_pages()),
 		K(nr_free_highpages()));

 	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
-		"unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+		"unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
 		active,
 		inactive,
 		ps.nr_dirty,
 		ps.nr_writeback,
 		ps.nr_unstable,
 		nr_free_pages(),
+		zero,
 		ps.nr_slab,
 		ps.nr_mapped,
 		ps.nr_page_table_pages);
@@ -1146,7 +1208,7 @@
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < MAX_ORDER; order++) {
 			nr = 0;
-			list_for_each(elem, &zone->free_area[order].free_list)
+			list_for_each(elem, &zone->free_area[NOT_ZEROED][order].free_list)
 				++nr;
 			total += nr << order;
 			printk("%lu*%lukB ", nr, K(1UL) << order);
@@ -1470,14 +1532,18 @@
 	for (order = 0; ; order++) {
 		unsigned long bitmap_size;

-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
 		if (order == MAX_ORDER-1) {
-			zone->free_area[order].map = NULL;
+			zone->free_area[NOT_ZEROED][order].map = NULL;
+			zone->free_area[ZEROED][order].map = NULL;
 			break;
 		}

 		bitmap_size = pages_to_bitmap_size(order, size);
-		zone->free_area[order].map =
+		zone->free_area[NOT_ZEROED][order].map =
+		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
+		zone->free_area[ZEROED][order].map =
 		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
 	}
 }
@@ -1503,6 +1569,7 @@

 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->kscrubd_wait);

 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
@@ -1525,6 +1592,7 @@
 		spin_lock_init(&zone->lru_lock);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->zero_pages = 0;

 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;

@@ -1558,6 +1626,13 @@
 			pcp->high = 2 * batch;
 			pcp->batch = 1 * batch;
 			INIT_LIST_HEAD(&pcp->list);
+
+			pcp = &zone->pageset[cpu].pcp[2];	/* zero pages */
+			pcp->count = 0;
+			pcp->low = 0;
+			pcp->high = 2 * batch;
+			pcp->batch = 1 * batch;
+			INIT_LIST_HEAD(&pcp->list);
 		}
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
 				zone_names[j], realsize, batch);
@@ -1687,7 +1762,7 @@
 			unsigned long nr_bufs = 0;
 			struct list_head *elem;

-			list_for_each(elem, &(zone->free_area[order].free_list))
+			list_for_each(elem, &(zone->free_area[NOT_ZEROED][order].free_list))
 				++nr_bufs;
 			seq_printf(m, "%6lu ", nr_bufs);
 		}
Index: linux-2.6.9/include/linux/mmzone.h
===================================================================
--- linux-2.6.9.orig/include/linux/mmzone.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/mmzone.h	2004-12-22 14:24:56.000000000 -0800
@@ -51,7 +51,7 @@
 };

 struct per_cpu_pageset {
-	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
+	struct per_cpu_pages pcp[3];	/* 0: hot.  1: cold  2: cold zeroed pages */
 #ifdef CONFIG_NUMA
 	unsigned long numa_hit;		/* allocated in intended node */
 	unsigned long numa_miss;	/* allocated in non intended node */
@@ -107,10 +107,14 @@
  * ZONE_HIGHMEM	 > 896 MB	only page cache and user processes
  */

+#define NOT_ZEROED 0
+#define ZEROED 1
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
 	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		zero_pages;
 	/*
 	 * protection[] is a pre-calculated number of extra pages that must be
 	 * available in a zone in order for __alloc_pages() to allocate memory
@@ -131,7 +135,7 @@
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	struct free_area	free_area[MAX_ORDER];
+	struct free_area	free_area[2][MAX_ORDER];


 	ZONE_PADDING(_pad1_)
@@ -265,6 +269,9 @@
 	struct pglist_data *pgdat_next;
 	wait_queue_head_t       kswapd_wait;
 	struct task_struct *kswapd;
+
+	wait_queue_head_t       kscrubd_wait;
+	struct task_struct *kscrubd;
 } pg_data_t;

 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -274,9 +281,9 @@
 extern struct pglist_data *pgdat_list;

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat);
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free);
+			unsigned long *free, unsigned long *zero);
 void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone);

Index: linux-2.6.9/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.9.orig/fs/proc/proc_misc.c	2004-12-17 14:40:15.000000000 -0800
+++ linux-2.6.9/fs/proc/proc_misc.c	2004-12-22 14:24:56.000000000 -0800
@@ -158,13 +158,14 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;
 	unsigned long vmtot;
 	unsigned long committed;
 	unsigned long allowed;
 	struct vmalloc_info vmi;

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 /*
  * display in kilobytes.
@@ -187,6 +188,7 @@
 	len = sprintf(page,
 		"MemTotal:     %8lu kB\n"
 		"MemFree:      %8lu kB\n"
+		"MemZero:      %8lu kB\n"
 		"Buffers:      %8lu kB\n"
 		"Cached:       %8lu kB\n"
 		"SwapCached:   %8lu kB\n"
@@ -210,6 +212,7 @@
 		"VmallocChunk: %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
+		K(zero),
 		K(i.bufferram),
 		K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
 		K(total_swapcache_pages),
Index: linux-2.6.9/mm/readahead.c
===================================================================
--- linux-2.6.9.orig/mm/readahead.c	2004-10-18 14:53:11.000000000 -0700
+++ linux-2.6.9/mm/readahead.c	2004-12-22 14:24:56.000000000 -0800
@@ -570,7 +570,8 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;

-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
Index: linux-2.6.9/drivers/base/node.c
===================================================================
--- linux-2.6.9.orig/drivers/base/node.c	2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/drivers/base/node.c	2004-12-22 14:24:56.000000000 -0800
@@ -41,13 +41,15 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;

 	si_meminfo_node(&i, nid);
-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));

 	n = sprintf(buf, "\n"
 		       "Node %d MemTotal:     %8lu kB\n"
 		       "Node %d MemFree:      %8lu kB\n"
+		       "Node %d MemZero:      %8lu kB\n"
 		       "Node %d MemUsed:      %8lu kB\n"
 		       "Node %d Active:       %8lu kB\n"
 		       "Node %d Inactive:     %8lu kB\n"
@@ -57,6 +59,7 @@
 		       "Node %d LowFree:      %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
+		       nid, K(zero),
 		       nid, K(i.totalram - i.freeram),
 		       nid, K(active),
 		       nid, K(inactive),
Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h	2004-12-22 14:24:56.000000000 -0800
@@ -715,6 +715,7 @@
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
+#define PF_KSCRUBD	0x00800000	/* I am kscrubd */

 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.9/mm/Makefile
===================================================================
--- linux-2.6.9.orig/mm/Makefile	2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/Makefile	2004-12-22 14:24:56.000000000 -0800
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+			   vmalloc.o scrubd.o

 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o prio_tree.o \
Index: linux-2.6.9/mm/scrubd.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/mm/scrubd.c	2004-12-22 14:26:35.000000000 -0800
@@ -0,0 +1,146 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = MAX_ORDER;		/* Off */
+unsigned int sysctl_scrub_stop = 2;	/* Mininum order of page to zero */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec(table, write, file, buffer, length, ppos);
+	if (sysctl_scrub_start < MAX_ORDER) {
+		struct zone *zone;
+
+		for_each_zone(zone)
+			wakeup_kscrubd(zone);
+	}
+	return 0;
+}
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+	int order;
+
+	for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+		struct free_area *area = z->free_area[NOT_ZEROED] + order;
+		if (!list_empty(&area->free_list)) {
+			struct page *page = scrubd_rmpage(z, area, order);
+			struct list_head *l;
+
+			if (!page)
+				continue;
+
+			page->index = order;
+
+			list_for_each(l, &zero_drivers) {
+				struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+				unsigned long size = PAGE_SIZE << order;
+
+				if (driver->start(page_address(page), size) == 0) {
+
+					unsigned ticks = (size*HZ)/driver->rate;
+					if (ticks) {
+						/* Wait the minimum time of the transfer */
+						current->state = TASK_INTERRUPTIBLE;
+						schedule_timeout(ticks);
+					}
+					/* Then keep on checking until transfer is complete */
+					while (!driver->check())
+						schedule();
+					goto out;
+				}
+			}
+
+			/* Unable to find a zeroing device that would
+			 * deal with this page so just do it on our own.
+			 * This will likely thrash the cpu caches.
+			 */
+			cond_resched();
+			clear_page(page_address(page), order);
+out:
+			end_zero_page(page);
+			cond_resched();
+			return 1 << order;
+		}
+	}
+	return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+	int i;
+	unsigned long pages_zeroed;
+
+	if (system_state != SYSTEM_RUNNING)
+		return;
+
+	do {
+		pages_zeroed = 0;
+		for (i = 0; i < pgdat->nr_zones; i++) {
+			struct zone *zone = pgdat->node_zones + i;
+
+			pages_zeroed += zero_highest_order_page(zone);
+		}
+	} while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t*)p;
+	struct task_struct *tsk = current;
+	DEFINE_WAIT(wait);
+	cpumask_t cpumask;
+
+	daemonize("kscrubd%d", pgdat->node_id);
+	cpumask = node_to_cpumask(pgdat->node_id);
+	if (!cpus_empty(cpumask))
+		set_cpus_allowed(tsk, cpumask);
+
+	tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+	for ( ; ; ) {
+		if (current->flags & PF_FREEZE)
+			refrigerator(PF_FREEZE);
+		prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+		schedule();
+		finish_wait(&pgdat->kscrubd_wait, &wait);
+
+		scrub_pgdat(pgdat);
+	}
+	return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+	pg_data_t *pgdat;
+	for_each_pgdat(pgdat)
+		pgdat->kscrubd
+		= find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+	return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.9/include/linux/scrub.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/include/linux/scrub.h	2004-12-22 14:24:56.000000000 -0800
@@ -0,0 +1,48 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+        int (*start)(void *, unsigned long);		/* Start bzero transfer */
+	int (*check)(void);				/* Check if bzero is complete */
+	unsigned long rate;				/* zeroing rate in bytes/sec */
+        struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+	list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+	list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+        if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+                return;
+        wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+				      void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page);
+#endif
Index: linux-2.6.9/kernel/sysctl.c
===================================================================
--- linux-2.6.9.orig/kernel/sysctl.c	2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/kernel/sysctl.c	2004-12-22 14:24:56.000000000 -0800
@@ -40,6 +40,7 @@
 #include <linux/times.h>
 #include <linux/limits.h>
 #include <linux/dcache.h>
+#include <linux/scrub.h>
 #include <linux/syscalls.h>

 #include <asm/uaccess.h>
@@ -816,6 +817,24 @@
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_SCRUB_START,
+		.procname	= "scrub_start",
+		.data		= &sysctl_scrub_start,
+		.maxlen		= sizeof(sysctl_scrub_start),
+		.mode		= 0644,
+		.proc_handler	= &scrub_start_handler,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_STOP,
+		.procname	= "scrub_stop",
+		.data		= &sysctl_scrub_stop,
+		.maxlen		= sizeof(sysctl_scrub_stop),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
 	{ .ctl_name = 0 }
 };

Index: linux-2.6.9/include/linux/sysctl.h
===================================================================
--- linux-2.6.9.orig/include/linux/sysctl.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/sysctl.h	2004-12-22 14:24:56.000000000 -0800
@@ -168,6 +168,8 @@
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+	VM_SCRUB_START=30,	/* percentage * 10 at which to start scrubd */
+	VM_SCRUB_STOP=31,	/* percentage * 10 at which to stop scrubd */
 };




^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V2 [4/4]: Hardware Zeroing through SGI BTE
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
  2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
  2004-12-23 19:34         ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Christoph Lameter
@ 2004-12-23 19:35         ` Christoph Lameter
  2004-12-23 20:08         ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Brian Gerst
  3 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:35 UTC (permalink / raw)
  To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

o Zeroing driver implemented with the Block Transfer Engine in the Altix SN2 SHub

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c	2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/bte.c	2004-12-22 12:48:23.000000000 -0800
@@ -4,6 +4,8 @@
  * for more details.
  *
  * Copyright (c) 2000-2003 Silicon Graphics, Inc.  All Rights Reserved.
+ *
+ * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
  */

 #include <linux/config.h>
@@ -20,6 +22,8 @@
 #include <linux/bootmem.h>
 #include <linux/string.h>
 #include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scrub.h>

 #include <asm/sn/bte.h>

@@ -30,7 +34,7 @@
 /* two interfaces on two btes */
 #define MAX_INTERFACES_TO_TRY		4

-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
 {
 	nodepda_t *tmp_nodepda;

@@ -132,7 +136,6 @@
 			if (bte == NULL) {
 				continue;
 			}
-
 			if (spin_trylock(&bte->spinlock)) {
 				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
 				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
@@ -157,7 +160,7 @@
 		}
 	} while (1);

-	if (notification == NULL) {
+	if (notification == NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
 		/* User does not want to be notified. */
 		bte->most_rcnt_na = &bte->notify;
 	} else {
@@ -192,6 +195,8 @@

 	itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);

+	if (mode & BTE_NOTIFY_AND_GET_POINTER)
+		 *(u64 volatile **)(notification) = &bte->notify;
 	spin_unlock_irqrestore(&bte->spinlock, irq_flags);

 	if (notification != NULL) {
@@ -449,5 +454,37 @@
 		mynodepda->bte_if[i].cleanup_active = 0;
 		mynodepda->bte_if[i].bh_error = 0;
 	}
+}
+
+u64 *bte_zero_notify[MAX_COMPACT_NODES];
+
+static int bte_check_bzero(void)
+{
+	int node = get_nasid();
+
+	return *(bte_zero_notify[node]) != BTE_WORD_BUSY;
+}
+
+static int bte_start_bzero(void *p, unsigned long len)
+{
+	int node = get_nasid();
+
+	/* Check limitations.
+		1. System must be running (weird things happen during bootup)
+		2. Size >64KB. Smaller requests cause too much bte traffic
+	 */
+	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
+		return EINVAL;
+
+	return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, bte_zero_notify+node);
+}
+
+static struct zero_driver bte_bzero = {
+	.start = bte_start_bzero,
+	.check = bte_check_bzero,
+	.rate = 500000000		/* 500 MB /sec */
+};

+void sn_bte_bzero_init(void) {
+	register_zero_driver(&bte_bzero);
 }
Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c	2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/setup.c	2004-12-22 12:28:00.000000000 -0800
@@ -243,6 +243,7 @@
 	int pxm;
 	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
 	extern void sn_cpu_init(void);
+	extern void sn_bte_bzero_init(void);

 	/*
 	 * If the generic code has enabled vga console support - lets
@@ -333,6 +334,7 @@
 	screen_info = sn_screen_info;

 	sn_timer_init();
+	sn_bte_bzero_init();
 }

 /**
Index: linux-2.6.9/include/asm-ia64/sn/bte.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/sn/bte.h	2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/asm-ia64/sn/bte.h	2004-12-22 12:28:00.000000000 -0800
@@ -48,6 +48,8 @@
 #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
 /* Use a reserved bit to let the caller specify a wait for any BTE */
 #define BTE_WACQUIRE (0x4000)
+/* Return the pointer to the notification cacheline to the user */
+#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
 /* Use the BTE on the node with the destination memory */
 #define BTE_USE_DEST (BTE_WACQUIRE << 1)
 /* Use any available BTE interface on any node for the transfer */


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
@ 2004-12-23 19:49       ` Arjan van de Ven
  2004-12-23 20:57       ` Matt Mackall
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 87+ messages in thread
From: Arjan van de Ven @ 2004-12-23 19:49 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel


> The most expensive operation in the page fault handler is (apart of SMP
> locking overhead) the zeroing of the page. This zeroing means that all
> cachelines of the faulted page (on Altix that means all 128 cachelines of
> 128 byte each) must be loaded and later written back. This patch allows to
> avoid having to load all cachelines if only a part of the cachelines of
> that page is needed immediately after the fault.

eh why will all cachelines be loaded? Surely you can avoid the write-
allocate behavior for this case.....



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
                           ` (2 preceding siblings ...)
  2004-12-23 19:35         ` Prezeroing V2 [4/4]: Hardware Zeroing through SGI BTE Christoph Lameter
@ 2004-12-23 20:08         ` Brian Gerst
  2004-12-24 16:24           ` Christoph Lameter
  3 siblings, 1 reply; 87+ messages in thread
From: Brian Gerst @ 2004-12-23 20:08 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

Christoph Lameter wrote:
> This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
> to request zeroed pages from the page allocator.
> 
> o Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set
> 
> o Replace all page zeroing after allocating pages by request for
>   zeroed pages.
> 
> o requires arch updates to clear_page in order to function properly.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 

> @@ -125,22 +125,19 @@
>  	int i;
>  	struct packet_data *pkt;
> 
> -	pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL);
> +	pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL|__GFP_ZERO);
>  	if (!pkt)
>  		goto no_pkt;
> -	memset(pkt, 0, sizeof(struct packet_data));
> 
>  	pkt->w_bio = pkt_bio_alloc(PACKET_MAX_SIZE);
>  	if (!pkt->w_bio)

This part is wrong.  kmalloc() uses the slab allocator instead of 
getting a full page.

--
				Brian Gerst

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
  2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
  2004-12-23 19:49       ` Prezeroing V2 [0/3]: Why and When it works Arjan van de Ven
@ 2004-12-23 20:57       ` Matt Mackall
  2004-12-23 21:01       ` Paul Mackerras
  2004-12-23 21:11       ` Paul Mackerras
  4 siblings, 0 replies; 87+ messages in thread
From: Matt Mackall @ 2004-12-23 20:57 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Thu, Dec 23, 2004 at 11:29:10AM -0800, Christoph Lameter wrote:
> 2. Hardware support for offloading zeroing from the cpu. This avoids
> the invalidation of the cpu caches by extensive zeroing operations.

I'm wondering if it would be possible to use typical video cards for
hardware zeroing. We could set aside a page's worth of zeros in video
memory and then use the card's DMA engines to clear pages on the host.

This could be done in fbdev drivers, which would register a zeroer
with the core.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
                         ` (2 preceding siblings ...)
  2004-12-23 20:57       ` Matt Mackall
@ 2004-12-23 21:01       ` Paul Mackerras
  2004-12-23 21:11       ` Paul Mackerras
  4 siblings, 0 replies; 87+ messages in thread
From: Paul Mackerras @ 2004-12-23 21:01 UTC (permalink / raw)
  To: Christoph Lameter

Christoph Lameter writes:

> The most expensive operation in the page fault handler is (apart of SMP
> locking overhead) the zeroing of the page. This zeroing means that all
> cachelines of the faulted page (on Altix that means all 128 cachelines of
> 128 byte each) must be loaded and later written back. This patch allows to
> avoid having to load all cachelines if only a part of the cachelines of
> that page is needed immediately after the fault.

On ppc64 we avoid having to zero newly-allocated page table pages by
using a slab cache for them, with a constructor function that zeroes
them.  Page table pages naturally end up being full of zeroes when
they are freed, since ptep_get_and_clear, pmd_clear or pgd_clear has
been used on every non-zero entry by that stage.  Thus there is no
extra work required either when allocating them or freeing them.

I don't see any point in your patches for systems which don't have
some magic hardware for zeroing pages.  Your patch seems like a lot of
extra code that only benefits a very small number of machines.

Paul.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
                         ` (3 preceding siblings ...)
  2004-12-23 21:01       ` Paul Mackerras
@ 2004-12-23 21:11       ` Paul Mackerras
  2004-12-23 21:37         ` Andrew Morton
  2004-12-23 21:48         ` Linus Torvalds
  4 siblings, 2 replies; 87+ messages in thread
From: Paul Mackerras @ 2004-12-23 21:11 UTC (permalink / raw)
  To: Christoph Lameter

Christoph Lameter writes:

> The most expensive operation in the page fault handler is (apart of SMP
> locking overhead) the zeroing of the page.

Re-reading this I see that you mean the zeroing of the page that is
mapped into the process address space, not the page table pages.  So
ignore my previous reply.

Do you have any statistics on how often a page fault needs to supply a
page of zeroes versus supplying a copy of an existing page, for real
applications?

In any case, unless you have magic page-zeroing hardware, I am still
inclined to think that zeroing the page at the time of the fault is
the most efficient, since that means the page will be hot in the cache
for the process to use.  If you zero it earlier using CPU stores, it
can only cause more overall memory traffic, as far as I can see.

I did some measurements once on my G5 powermac (running a ppc64 linux
kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
page.  This is real-life elapsed time in the kernel, not just some
cache-hot benchmark measurement.  Thus I don't think your patch will
gain us anything on ppc64.

Paul.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:11       ` Paul Mackerras
@ 2004-12-23 21:37         ` Andrew Morton
  2004-12-23 23:00           ` Paul Mackerras
  2004-12-23 21:48         ` Linus Torvalds
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2004-12-23 21:37 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: clameter, linux-ia64, torvalds, linux-mm, linux-kernel

Paul Mackerras <paulus@samba.org> wrote:
>
> Christoph Lameter writes:
> 
> > The most expensive operation in the page fault handler is (apart of SMP
> > locking overhead) the zeroing of the page.
> 
> Re-reading this I see that you mean the zeroing of the page that is
> mapped into the process address space, not the page table pages.  So
> ignore my previous reply.
> 
> Do you have any statistics on how often a page fault needs to supply a
> page of zeroes versus supplying a copy of an existing page, for real
> applications?

When the workload is a gcc run, the pagefault handler dominates the system
time.  That's the page zeroing.

> In any case, unless you have magic page-zeroing hardware, I am still
> inclined to think that zeroing the page at the time of the fault is
> the most efficient, since that means the page will be hot in the cache
> for the process to use.  If you zero it earlier using CPU stores, it
> can only cause more overall memory traffic, as far as I can see.

x86's movnta instructions provide a way of initialising memory without
trashing the caches and it has pretty good bandwidth, I believe.  We should
wire that up to these patches and see if it speeds things up.

> I did some measurements once on my G5 powermac (running a ppc64 linux
> kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> page.

40GB/s.  Is that straight into L1 or does the measurement include writeback?


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:11       ` Paul Mackerras
  2004-12-23 21:37         ` Andrew Morton
@ 2004-12-23 21:48         ` Linus Torvalds
  2004-12-23 22:34           ` Zwane Mwaikambo
                             ` (2 more replies)
  1 sibling, 3 replies; 87+ messages in thread
From: Linus Torvalds @ 2004-12-23 21:48 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Christoph Lameter, Andrew Morton, linux-ia64, torvalds, linux-mm,
	Kernel Mailing List



On Fri, 24 Dec 2004, Paul Mackerras wrote:
> 
> I did some measurements once on my G5 powermac (running a ppc64 linux
> kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> page.  This is real-life elapsed time in the kernel, not just some
> cache-hot benchmark measurement.  Thus I don't think your patch will
> gain us anything on ppc64.

Well, the thing is, if we really _know_ the machine is idle (and not just 
waiting for something like disk IO), it might be a good idea to just 
pre-zero everything we can.

The question to me is whether we can have a good enough heuristic to
notice that it triggers often enough to matter, but seldom enough that it
really won't disturb anybody.

And "disturb" very much includes things like laptop battery life,
scheduling latencies, memory bus traffic _and_ cache contents. 

And I really don't see a very good heuristic. Maybe it might literally be
something like "five-second load average goes down to zero" (we've got
fixed-point arithmetic with eleven fractional bits, so we can tune just
how close to "zero" we want to get). The load average is system-wide and
takes disk load (which tends to imply latency-critical work) into account,
so that might actually work out reasonably well as a "the system really is
quiescent".

So if we make the "what load is considered low" tunable, a system 
administrator can use that to make it more aggressive. And indeed, you 
might have a cron-job that says "be more aggressive at clearing pages 
between 2AM and 4AM in the morning" or something - if you have so much 
memory that it actually matters if you clear the memory just occasionally.

And the tunable load-average check has another advantage: if you want to 
benchmark it, you can first set it to true zero (basically never), and run 
the benchmark, and then you can set it to something very agressive ("clear 
pages every five seconds regardless of load") and re-run.

Does this sound sane? Christoph - can you try making the "scrub deamon" do 
that? Instead of the "scrub-low" and "scrub-high" (or in _addition_ to 
them), do a "scub-load" thing that takes a scaled integer, and compares it 
with "avenrun[0]" in kernel/timer.c: calc_load() when the average is 
updated every five seconds..

Personally, at least for a desktop usage, I think that the load average 
would work wonderfully well. I know my machines are often at basically 
zero load, and then having low-latency zero-pages when I sit down sounds 
like a good idea. Whether there is _enough_ free memory around for a 
5-second thing to work out well, I have no idea..

		Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:48         ` Linus Torvalds
@ 2004-12-23 22:34           ` Zwane Mwaikambo
  2004-12-24  9:14           ` Arjan van de Ven
  2004-12-24 16:17           ` Christoph Lameter
  2 siblings, 0 replies; 87+ messages in thread
From: Zwane Mwaikambo @ 2004-12-23 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
	linux-mm, Kernel Mailing List

On Thu, 23 Dec 2004, Linus Torvalds wrote:

> Personally, at least for a desktop usage, I think that the load average 
> would work wonderfully well. I know my machines are often at basically 
> zero load, and then having low-latency zero-pages when I sit down sounds 
> like a good idea. Whether there is _enough_ free memory around for a 
> 5-second thing to work out well, I have no idea..

Isn't the basic premise very similar to the following paper;

http://www.usenix.org/publications/library/proceedings/osdi99/full_papers/dougan/dougan_html/dougan.html

In fact i thought ppc32 did something akin to this.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:37         ` Andrew Morton
@ 2004-12-23 23:00           ` Paul Mackerras
  0 siblings, 0 replies; 87+ messages in thread
From: Paul Mackerras @ 2004-12-23 23:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-ia64, torvalds, linux-mm, linux-kernel

Andrew Morton writes:

> When the workload is a gcc run, the pagefault handler dominates the system
> time.  That's the page zeroing.

For a program which uses a lot of heap and doesn't fork, that sounds
reasonable.

> x86's movnta instructions provide a way of initialising memory without
> trashing the caches and it has pretty good bandwidth, I believe.  We should
> wire that up to these patches and see if it speeds things up.

Yes.  I don't know the movnta instruction, but surely, whatever scheme
is used, there has to be a snoop for every cache line's worth of
memory that is zeroed.

The other point is that having the page hot in the cache may well be a
benefit to the program.  Using any sort of cache-bypassing zeroing
might not actually make things faster, when the user time as well as
the system time is taken into account.

> > I did some measurements once on my G5 powermac (running a ppc64 linux
> > kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> > page.
> 
> 40GB/s.  Is that straight into L1 or does the measurement include writeback?

It is the average elapsed time in clear_page, so it would include the
writeback of any cache lines displaced by the zeroing, but not the
writeback of the newly-zeroed cache lines (which we hope will be
modified by the program before they get written back anyway).

This is using the dcbz (data cache block zero) instruction, which
establishes a cache line in modified state with zero contents without
any memory traffic other than a cache line kill transaction sent to
the other CPUs and possible writeback of a dirty cache line displaced
by the newly-zeroed cache line.  The new cache line is established in
the L2 cache, because the L1 is write-through on the G5, and all
stores and dcbz instructions have to go to the L2 cache.

Thus, on the G5 (and POWER4, which is similar) I don't think there
will be much if any benefit from having pre-zeroed cache-cold pages.
We can establish the zero lines in cache much faster using dcbz than
we can by reading them in from main memory.  If the program uses only
a few cache lines out of each new page, then reading them from memory
might be faster, but that seems unlikely.

Paul.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
  2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
@ 2004-12-24  8:33           ` Pavel Machek
  2004-12-24 16:18             ` Christoph Lameter
  2004-12-24 17:05           ` David S. Miller
  2005-01-01 10:24           ` Geert Uytterhoeven
  2 siblings, 1 reply; 87+ messages in thread
From: Pavel Machek @ 2004-12-24  8:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

Hi!

> o Extend clear_page to take an order parameter for all architectures.
> 

I believe you sould leave clear_page() as is, and introduce
clear_pages() with two arguments.
				Pavel

> -extern void clear_page (void *page);
> +extern void clear_page (void *page, int order);
>  extern void copy_page (void *to, void *from);
> 

-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:48         ` Linus Torvalds
  2004-12-23 22:34           ` Zwane Mwaikambo
@ 2004-12-24  9:14           ` Arjan van de Ven
  2004-12-24 18:21             ` Linus Torvalds
  2004-12-24 16:17           ` Christoph Lameter
  2 siblings, 1 reply; 87+ messages in thread
From: Arjan van de Ven @ 2004-12-24  9:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
	linux-mm, Kernel Mailing List


> Personally, at least for a desktop usage, I think that the load average 
> would work wonderfully well. I know my machines are often at basically 
> zero load, and then having low-latency zero-pages when I sit down sounds 
> like a good idea. Whether there is _enough_ free memory around for a 
> 5-second thing to work out well, I have no idea..

problem is.. will it buy you anything if you use the page again
anyway... since such pages will be cold cached now. So for sure some of
it is only shifting latency from kernel side to userspace side, but
readprofile doesn't measure the later so it *looks* better...



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:48         ` Linus Torvalds
  2004-12-23 22:34           ` Zwane Mwaikambo
  2004-12-24  9:14           ` Arjan van de Ven
@ 2004-12-24 16:17           ` Christoph Lameter
  2 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2004-12-24 16:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Andrew Morton, linux-ia64, linux-mm, Kernel Mailing List

On Thu, 23 Dec 2004, Linus Torvalds wrote:

> So if we make the "what load is considered low" tunable, a system
> administrator can use that to make it more aggressive. And indeed, you
> might have a cron-job that says "be more aggressive at clearing pages
> between 2AM and 4AM in the morning" or something - if you have so much
> memory that it actually matters if you clear the memory just occasionally.
>
> And the tunable load-average check has another advantage: if you want to
> benchmark it, you can first set it to true zero (basically never), and run
> the benchmark, and then you can set it to something very agressive ("clear
> pages every five seconds regardless of load") and re-run.
>
> Does this sound sane? Christoph - can you try making the "scrub deamon" do
> that? Instead of the "scrub-low" and "scrub-high" (or in _addition_ to
> them), do a "scub-load" thing that takes a scaled integer, and compares it
> with "avenrun[0]" in kernel/timer.c: calc_load() when the average is
> updated every five seconds..

Sure V3 will have that. So far the impact of zeroing is quite minimal
on IA64 (even without using hardware), the big zeroing happens immediately
after activating it anyways. I have not seen any measurable effect on
benchmarks even with 4G allocations on a 6G machine.

> Personally, at least for a desktop usage, I think that the load average
> would work wonderfully well. I know my machines are often at basically
> zero load, and then having low-latency zero-pages when I sit down sounds
> like a good idea. Whether there is _enough_ free memory around for a
> 5-second thing to work out well, I have no idea..

The CPU can do a couple of Gigs of zeroing per second per CPU and the
zeroing zeros local RAM. On my 6G machine with 8 Cpus it can only
take a fraction of a second to zero all RAM.

Merry Christmas, I am off till now next year. SGI mandatory holiday
shutdown so all addicts have to go cold turkey ;-)


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
  2004-12-24  8:33           ` Pavel Machek
@ 2004-12-24 16:18             ` Christoph Lameter
  2004-12-24 16:27               ` Pavel Machek
  0 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2004-12-24 16:18 UTC (permalink / raw)
  To: Pavel Machek; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Fri, 24 Dec 2004, Pavel Machek wrote:

> Hi!
>
> > o Extend clear_page to take an order parameter for all architectures.
> >
>
> I believe you sould leave clear_page() as is, and introduce
> clear_pages() with two arguments.

Did that in V1 and Andi Kleen complained about it.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal
  2004-12-23 20:08         ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Brian Gerst
@ 2004-12-24 16:24           ` Christoph Lameter
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2004-12-24 16:24 UTC (permalink / raw)
  To: Brian Gerst; +Cc: linux-ia64, linux-mm, linux-kernel

On Thu, 23 Dec 2004, Brian Gerst wrote:

> This part is wrong.  kmalloc() uses the slab allocator instead of
> getting a full page.

Thanks for finding that. V3 will have that fixed.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
  2004-12-24 16:18             ` Christoph Lameter
@ 2004-12-24 16:27               ` Pavel Machek
  2004-12-24 17:02                 ` David S. Miller
  0 siblings, 1 reply; 87+ messages in thread
From: Pavel Machek @ 2004-12-24 16:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

Hi!

> > > o Extend clear_page to take an order parameter for all architectures.
> > >
> >
> > I believe you sould leave clear_page() as is, and introduce
> > clear_pages() with two arguments.
> 
> Did that in V1 and Andi Kleen complained about it.

I do not know what Andi said, but having clear_page clearing two
page*s* seems wrong to me.
								Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
  2004-12-24 16:27               ` Pavel Machek
@ 2004-12-24 17:02                 ` David S. Miller
  0 siblings, 0 replies; 87+ messages in thread
From: David S. Miller @ 2004-12-24 17:02 UTC (permalink / raw)
  To: Pavel Machek; +Cc: clameter, akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Fri, 24 Dec 2004 17:27:45 +0100
Pavel Machek <pavel@ucw.cz> wrote:

> I do not know what Andi said, but having clear_page clearing two
> page*s* seems wrong to me.

It's represented by a single top-level page struct regardless
of it's order, so in that sense it's indeed a single page
no matter it's order.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
  2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
  2004-12-24  8:33           ` Pavel Machek
@ 2004-12-24 17:05           ` David S. Miller
  2004-12-27 22:48             ` David S. Miller
  2005-01-03 17:52             ` Christoph Lameter
  2005-01-01 10:24           ` Geert Uytterhoeven
  2 siblings, 2 replies; 87+ messages in thread
From: David S. Miller @ 2004-12-24 17:05 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Thu, 23 Dec 2004 11:33:59 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> Modification made but it would be good to have some feedback from the arch maintainers:
> 
 ...
> sparc64

I don't see any sparc64 bits in this patch, else I'd
review them :-)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-24  9:14           ` Arjan van de Ven
@ 2004-12-24 18:21             ` Linus Torvalds
  2004-12-24 18:57               ` Arjan van de Ven
  2004-12-27 22:50               ` David S. Miller
  0 siblings, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2004-12-24 18:21 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
	linux-mm, Kernel Mailing List



On Fri, 24 Dec 2004, Arjan van de Ven wrote:
> 
> problem is.. will it buy you anything if you use the page again
> anyway... since such pages will be cold cached now. So for sure some of
> it is only shifting latency from kernel side to userspace side, but
> readprofile doesn't measure the later so it *looks* better...

Absolutely. I would want to see some real benchmarks before we do this.  
Not just some microbenchmark of "how many page faults can we take without
_using_ the page at all".

I agree 100% with you that we shouldn't shift the costs around. Having a
hice hot-spot that we know about is a good thing, and it means that
performance profiles show what the time is really spent on. Often getting
rid of the hotspot just smears out the work over a wider area, making
other optimizations (like trying to make the memory footprint _smaller_
and removing the work entirely that way) totally impossible because now
the performance profile just has a constant background noise and you can't 
tell what the real problem is.

		Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Increase page fault rate by prezeroing V1 [0/3]: Overview
  2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
                       ` (3 preceding siblings ...)
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
@ 2004-12-24 18:31     ` Andrea Arcangeli
  2005-01-03 17:54       ` Christoph Lameter
  4 siblings, 1 reply; 87+ messages in thread
From: Andrea Arcangeli @ 2004-12-24 18:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Luck, Tony, Robin Holt, Adam Litke, linux-ia64,
	torvalds, linux-mm, linux-kernel

Did you notice I already implemented full PG_zero caching here with
prezeroing on top of it?

	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2
	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2-no-zerolist-reserve-1

I was about to push this in SP1, but it was a bit late.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-24 18:21             ` Linus Torvalds
@ 2004-12-24 18:57               ` Arjan van de Ven
  2004-12-27 22:50               ` David S. Miller
  1 sibling, 0 replies; 87+ messages in thread
From: Arjan van de Ven @ 2004-12-24 18:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
	linux-mm, Kernel Mailing List

On Fri, 2004-12-24 at 10:21 -0800, Linus Torvalds wrote:
> 
> On Fri, 24 Dec 2004, Arjan van de Ven wrote:
> > 
> > problem is.. will it buy you anything if you use the page again
> > anyway... since such pages will be cold cached now. So for sure some of
> > it is only shifting latency from kernel side to userspace side, but
> > readprofile doesn't measure the later so it *looks* better...
> 
> Absolutely. I would want to see some real benchmarks before we do this.  
> Not just some microbenchmark of "how many page faults can we take without
> _using_ the page at all".
> 
> I agree 100% with you that we shouldn't shift the costs around. Having a
> hice hot-spot that we know about is a good thing, and it means that
> performance profiles show what the time is really spent on. Often getting
> rid of the hotspot just smears out the work over a wider area, making
> other optimizations (like trying to make the memory footprint _smaller_
> and removing the work entirely that way) totally impossible because now
> the performance profile just has a constant background noise and you can't 
> tell what the real problem is.

I suspect it's even worse.
Think about it; you can spew 4k of zeroes into your L1 cache really fast
(assuming your cpu is smart enough to avoid write-allocate for rep
stosl; not sure which cpus are). I suspect you can do that faster than a
cachemiss or two. And at that point the page is cache hot... so reads
don't miss either.

all this makes me wonder if there is any scenario where this thing will
be a gain, other than cpus that aren't smart enough to avoid the write-
allocate.



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
  2004-12-24 17:05           ` David S. Miller
@ 2004-12-27 22:48             ` David S. Miller
  2005-01-03 17:52             ` Christoph Lameter
  1 sibling, 0 replies; 87+ messages in thread
From: David S. Miller @ 2004-12-27 22:48 UTC (permalink / raw)
  To: David S. Miller
  Cc: clameter, akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Fri, 24 Dec 2004 09:05:39 -0800
"David S. Miller" <davem@davemloft.net> wrote:

> On Thu, 23 Dec 2004 11:33:59 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > Modification made but it would be good to have some feedback from the arch maintainers:
> > 
>  ...
> > sparc64
> 
> I don't see any sparc64 bits in this patch, else I'd
> review them :-)

So I found time to implement the missing sparc64 clear_page()
changes, here they are:

===== arch/sparc64/lib/clear_page.S 1.1 vs edited =====
--- 1.1/arch/sparc64/lib/clear_page.S	2004-08-08 19:54:07 -07:00
+++ edited/arch/sparc64/lib/clear_page.S	2004-12-24 08:53:29 -08:00
@@ -28,9 +28,12 @@
 	.text
 
 	.globl		_clear_page
-_clear_page:		/* %o0=dest */
+_clear_page:		/* %o0=dest, %o1=order */
+	sethi		%hi(PAGE_SIZE/64), %o2
+	clr		%o4
+	or		%o2, %lo(PAGE_SIZE/64), %o2
 	ba,pt		%xcc, clear_page_common
-	 clr		%o4
+	 sllx		%o2, %o1, %o1
 
 	/* This thing is pretty important, it shows up
 	 * on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@ clear_user_page:	/* %o0=dest, %o1=vaddr 
 	flush		%g6
 	wrpr		%o4, 0x0, %pstate
 
+	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		1, %o4
+	or		%o1, %lo(PAGE_SIZE/64), %o1
 
 clear_page_common:
 	VISEntryHalf
 	membar		#StoreLoad | #StoreStore | #LoadStore
 	fzero		%f0
-	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		%o0, %g1		! remember vaddr for tlbflush
 	fzero		%f2
-	or		%o1, %lo(PAGE_SIZE/64), %o1
 	faddd		%f0, %f2, %f4
 	fmuld		%f0, %f2, %f6
 	faddd		%f0, %f2, %f8
===== include/asm-sparc64/page.h 1.19 vs edited =====
--- 1.19/include/asm-sparc64/page.h	2004-07-27 12:54:49 -07:00
+++ edited/include/asm-sparc64/page.h	2004-12-24 08:52:17 -08:00
@@ -14,8 +14,8 @@
 
 #ifndef __ASSEMBLY__
 
-extern void _clear_page(void *page);
-#define clear_page(X)	_clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y)	_clear_page((void *)(X),(Y))
 struct page;
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 #define copy_page(X,Y)	memcpy((void *)(X), (void *)(Y), PAGE_SIZE)

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-24 18:21             ` Linus Torvalds
  2004-12-24 18:57               ` Arjan van de Ven
@ 2004-12-27 22:50               ` David S. Miller
  2004-12-28 11:53                 ` Marcelo Tosatti
  1 sibling, 1 reply; 87+ messages in thread
From: David S. Miller @ 2004-12-27 22:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: arjan, paulus, clameter, akpm, linux-ia64, linux-mm, linux-kernel

On Fri, 24 Dec 2004 10:21:24 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> Absolutely. I would want to see some real benchmarks before we do this.  
> Not just some microbenchmark of "how many page faults can we take without
> _using_ the page at all".

Here's my small contribution.  I did three "make -j3 vmlinux" timed
runs, one running a kernel without the pre-zeroing stuff applied,
one with it applied.  It did shave a few seconds off the build
consistently.  Here is the before:

real	8m35.248s
user	15m54.132s
sys	1m1.098s

real	8m32.202s
user	15m54.329s
sys	1m0.229s

real	8m31.932s
user	15m54.160s
sys	1m0.245s

and here is the after:

real	8m29.375s
user	15m43.296s
sys	0m59.549s

real	8m28.213s
user	15m39.819s
sys	0m58.790s

real	8m26.140s
user	15m44.145s
sys	0m58.872s

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-27 22:50               ` David S. Miller
@ 2004-12-28 11:53                 ` Marcelo Tosatti
  0 siblings, 0 replies; 87+ messages in thread
From: Marcelo Tosatti @ 2004-12-28 11:53 UTC (permalink / raw)
  To: David S. Miller
  Cc: Linus Torvalds, arjan, paulus, clameter, akpm, linux-ia64,
	linux-mm, linux-kernel

On Mon, Dec 27, 2004 at 02:50:57PM -0800, David S. Miller wrote:
> On Fri, 24 Dec 2004 10:21:24 -0800 (PST)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > Absolutely. I would want to see some real benchmarks before we do this.  
> > Not just some microbenchmark of "how many page faults can we take without
> > _using_ the page at all".
> 
> Here's my small contribution.  I did three "make -j3 vmlinux" timed
> runs, one running a kernel without the pre-zeroing stuff applied,
> one with it applied.  It did shave a few seconds off the build
> consistently.  Here is the before:
> 
> real	8m35.248s
> user	15m54.132s
> sys	1m1.098s
> 
> real	8m32.202s
> user	15m54.329s
> sys	1m0.229s
> 
> real	8m31.932s
> user	15m54.160s
> sys	1m0.245s
> 
> and here is the after:
> 
> real	8m29.375s
> user	15m43.296s
> sys	0m59.549s
> 
> real	8m28.213s
> user	15m39.819s
> sys	0m58.790s
> 
> real	8m26.140s
> user	15m44.145s
> sys	0m58.872s

Christopher and other SGI fellows,

Get your patch into STP, once its there we can do some wider x86 benchmarking 
easily.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd
  2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd Christoph Lameter
@ 2005-01-01  2:22       ` Nick Piggin
  2005-01-01  2:55         ` pmarques
  0 siblings, 1 reply; 87+ messages in thread
From: Nick Piggin @ 2005-01-01  2:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel

Christoph Lameter wrote:
> o Add page zeroing
> o Add scrub daemon
> o Add ability to view amount of zeroed information in /proc/meminfo
> 

I quite like how you're handling the page zeroing now. It seems
less intrusive and cleaner in its interface to the page allocator
now.

I think this is pretty close to what I'd be happy with if we decide
to go with zeroing.

Just one small comment - there is a patch in the -mm tree that may
be of use to you; mm-keep-count-of-free-areas.patch is used later
by kswapd to handle and account higher order free areas properly.
You may be able to use it to better implement triggers/watermarks
for the scrub daemon.

Also...

> +
> +/*
> + * zero_highest_order_page takes a page off the freelist
> + * and then hands it off to block zeroing agents.
> + * The cleared pages are added to the back of
> + * the freelist where the page allocator may pick them up.
> + */
> +int zero_highest_order_page(struct zone *z)
> +{
> +	int order;
> +
> +	for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
> +		struct free_area *area = z->free_area[NOT_ZEROED] + order;
> +		if (!list_empty(&area->free_list)) {
> +			struct page *page = scrubd_rmpage(z, area, order);
> +			struct list_head *l;
> +
> +			if (!page)
> +				continue;
> +
> +			page->index = order;
> +
> +			list_for_each(l, &zero_drivers) {
> +				struct zero_driver *driver = list_entry(l, struct zero_driver, list);
> +				unsigned long size = PAGE_SIZE << order;
> +
> +				if (driver->start(page_address(page), size) == 0) {
> +
> +					unsigned ticks = (size*HZ)/driver->rate;
> +					if (ticks) {
> +						/* Wait the minimum time of the transfer */
> +						current->state = TASK_INTERRUPTIBLE;
> +						schedule_timeout(ticks);
> +					}
> +					/* Then keep on checking until transfer is complete */
> +					while (!driver->check())
> +						schedule();
> +					goto out;
> +				}

Would you be better off to just have a driver->zero_me(...) call, with this
logic pushed into those like your BTE which need it? I'm thinking this would
help flexibility if you had say a BTE-thingy that did an interrupt on
completion, or if it was done synchronously by the CPU with cache bypassing
stores.

Also, would there be any use in passing a batch of pages to the zeroing driver?
That may improve performance on some implementations, but could also cut down
the inefficiency in your timeout mechanism due to timer quantization (I guess
probably not much if you are only zeroing quite large areas).

BTW, that while loop is basically a busy-wait. Not a critical problem, but you
may want to renice scrubd to the lowest scheduling priority to be a bit nicer?
(I think you'd want to do that anyway). And put a cpu_relax() call in there?

Just some suggestions.

Nick

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd
  2005-01-01  2:22       ` Nick Piggin
@ 2005-01-01  2:55         ` pmarques
  0 siblings, 0 replies; 87+ messages in thread
From: pmarques @ 2005-01-01  2:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Luck, Tony, Robin Holt, Adam Litke,
	linux-ia64, torvalds, linux-mm, linux-kernel

Quoting Nick Piggin <nickpiggin@yahoo.com.au>:
> [...]
> Would you be better off to just have a driver->zero_me(...) call, with this
> logic pushed into those like your BTE which need it? I'm thinking this would
> help flexibility if you had say a BTE-thingy that did an interrupt on
> completion, or if it was done synchronously by the CPU with cache bypassing
> stores.

It seems that people in this discussion are assuming that PC's don't have
hardware to do this at all.

While there is no _official_ hardware, a bt878 with the brightness setting all
the way down, at 1024 pixels per line, 32 bits per pixel would be able to zero
a full physical page in under 60 microseconds (PAL scanline). It could even
zero a _list_ of pages passed to it and generate an interrupt in the end.

This is just an example, and there might be some problems in the implementation
details that make it impossible to work, but there might also be more hardware
out there that could perform similar functions (graphics cards?).

This might not be worth the bother *at all*, but I can imagine some weird
conversation between two sysadmins:
  "My server is wasting a lot of time handling page faults"
  "Why don't you install a video aquisition board with a bt878 chip? It did
wonders on my server"
  "Yes, I've also weard that a radeon graphics card can really accelerate kernel
compiles"

Well, just my 0.02 euro :)

--
Paulo Marques - www.grupopie.com

"A journey of a thousand miles begins with a single step."
Lao-tzu, The Way of Lao-tzu


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
  2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
  2004-12-24  8:33           ` Pavel Machek
  2004-12-24 17:05           ` David S. Miller
@ 2005-01-01 10:24           ` Geert Uytterhoeven
  2005-01-04 23:12             ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
  2 siblings, 1 reply; 87+ messages in thread
From: Geert Uytterhoeven @ 2005-01-01 10:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

On Thu, 23 Dec 2004, Christoph Lameter wrote:
> o Extend clear_page to take an order parameter for all architectures.

> Index: linux-2.6.9/include/asm-m68k/page.h
> ===================================================================
> --- linux-2.6.9.orig/include/asm-m68k/page.h	2004-10-18 14:55:36.000000000 -0700
> +++ linux-2.6.9/include/asm-m68k/page.h	2004-12-23 07:44:14.000000000 -0800
> @@ -50,7 +50,7 @@
>  		       );
>  }
> 
> -static inline void clear_page(void *page)
> +static inline void clear_page(void *page, int order)
>  {
>  	unsigned long tmp;
>  	unsigned long *sp = page;
> @@ -69,16 +69,16 @@
>  			     "dbra   %1,1b\n\t"
>  			     : "=a" (sp), "=d" (tmp)
>  			     : "a" (page), "0" (sp),
> -			       "1" ((PAGE_SIZE - 16) / 16 - 1));
> +			       "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
>  }
> 
>  #else
> -#define clear_page(page)	memset((page), 0, PAGE_SIZE)
> +#define clear_page(page, 0)	memset((page), 0, PAGE_SIZE << (order))
                            ^
			    order

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
  2004-12-24 17:05           ` David S. Miller
  2004-12-27 22:48             ` David S. Miller
@ 2005-01-03 17:52             ` Christoph Lameter
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-03 17:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Fri, 24 Dec 2004, David S. Miller wrote:

> On Thu, 23 Dec 2004 11:33:59 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > Modification made but it would be good to have some feedback from the arch maintainers:
> >
>  ...
> > sparc64
>
> I don't see any sparc64 bits in this patch, else I'd
> review them :-)
>

Sorry here it is:

Index: linux-2.6.9/include/asm-sparc64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-sparc64/page.h 2004-10-18 14:53:51.000000000 -0700
+++ linux-2.6.9/include/asm-sparc64/page.h      2005-01-03 09:50:16.000000000 -0800
@@ -15,7 +15,17 @@
 #ifndef __ASSEMBLY__

 extern void _clear_page(void *page);
-#define clear_page(X)  _clear_page((void *)(X))
+
+static void inline clear_page(void *page, int order)
+{
+       unsigned int nr = 1 << order;
+
+       while (nr-- > 0) {
+               _clear_page(page);
+               page += PAGE_SIZE;
+       }
+}
+
 struct page;
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 #define copy_page(X,Y) memcpy((void *)(X), (void *)(Y), PAGE_SIZE)


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Increase page fault rate by prezeroing V1 [0/3]: Overview
  2004-12-24 18:31     ` Increase page fault rate by prezeroing V1 [0/3]: Overview Andrea Arcangeli
@ 2005-01-03 17:54       ` Christoph Lameter
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-03 17:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nick Piggin, Luck, Tony, Robin Holt, Adam Litke, linux-ia64,
	torvalds, linux-mm, linux-kernel

On Fri, 24 Dec 2004, Andrea Arcangeli wrote:

> Did you notice I already implemented full PG_zero caching here with
> prezeroing on top of it?
>
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2
> 	http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2-no-zerolist-reserve-1
>
> I was about to push this in SP1, but it was a bit late.

Yes but this did not do the trick and the interface to get zeroed pages is
a bit difficult to handle.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V3 [0/4]: Discussion and i386 performance tests
  2005-01-01 10:24           ` Geert Uytterhoeven
@ 2005-01-04 23:12             ` Christoph Lameter
  2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
                                 ` (3 more replies)
  0 siblings, 4 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:12 UTC (permalink / raw)
  Cc: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

Change from V2 to V3:
o Updates for clear_page on various platforms
o Performance measurements on i386 (2x PIII-450 384M RAM)
o Port patches to 2.6.10-bk7
o Add scrub_load so that a high load prevents scrubd from running
  (So that people may feel better about this approach. Set by
  default to 999 so its off. The typical result of not running kscrubd
  under high loads is to slow the system down even further since zeroing
  large consecutive areas of memory is more efficient than zeroing page
  size chunks. Memory subsystems are typically optimized for linear accesses
  and reach their peak performance if large areas of memory are written to)
o Various fixes

The patches increasing the page fault rate (introduction of atomic pte
operations and anticipatory prefaulting) do so by reducing the locking
overhead and are therefore mainly of interest for applications running in
SMP systems with a high number of cpus. The single thread performance does
just show minor increases. Only the performance of multi-threaded
applications increases significantly.

The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page that is also done in the page fault
handler. This zeroing means that all cachelines of the faulted page (on Altix
that means all 128 cachelines of 128 byte each) must be loaded and later
written back. This patch allows to avoid having to load all cachelines
if only a part of the cachelines of that page is needed immediately after
the fault. Doing so will only be effective for sparsely accessed memory
which is typical for anonymous memory and pte maps. Prezeroed pages will
only be used for those purposes. Unzeroed pages will be used as usual for
file mapping, page caching etc etc.

Others have also thought that prezeroing could be a benefit and have tried
provide a way to provide zeroed pages to the page fault handler:

http://marc.theaimsgroup.com/?t=109914559100004&r=1&w=2
http://marc.theaimsgroup.com/?t=109777267500005&r=1&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=104931944213955&w=2

However, these attempt have tried to zero pages that are like to be used
soon (and that may have recently been accessed). Elements of these pages
are thus already in the cpu caches. Approaches like that will only shift
processing to somewhere else and not bring any performance benefits.
Prezeroing only makes sense for pages that are not currently needed and that
are not in the cpu caches. Pages that have recently been touched and that
soon will be touched again are better hot zeroed since the zeroing will
largely be done to cachelines already in the cpu caches.

The patch makes prezeroing very effective by:

1. Aggregating zeroing operations to only apply to pages of higher order,
which results in many pages that will later become zero 0 to be zeroed in one
step.
For that purpose the existing clear_page function is extended and made to
take an additional argument specifying the order of the page to be cleared.

2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.

The scrub daemon is invoked when a unzeroed page of a certain order has
been generated so that its worth running it. If no higher order pages are
present then the logic will favor hot zeroing rather than simply shifting
processing around. kscrubd typically runs only for a fraction of a second
and sleeps for long periods of time even under memory benchmarking. kscrubd
performs short bursts of zeroing when needed and tries to stay out off the
processor as much as possible.

The result is a significant increase of the page fault performance even for
single threaded applications (i386 2x PIII-450 384M RAM allocating 256M in
each run):

w/o patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  0   1    1    0.006s      0.389s   0.039s157455.320 157070.694
  0   1    2    0.007s      0.607s   0.032s101476.689 190350.885

w/patch
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  0   1    1    0.008s      0.083s   0.009s672151.422 664045.899
  0   1    2    0.005s      0.129s   0.008s459629.796 741857.373

The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmark the system may run out of these very
fast but the efficient algorithm for page zeroing still makes this a winner
(2 way system with 384MB RAM, no hardware zeroing support). In the following
measurement the test is repeated 10 times allocating 256M each in rapid
succession which would deplete the pool of zeroed pages quickly):

w/o patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  0  10    1    0.058s      3.913s   3.097s157335.774 157076.932
  0  10    2    0.063s      6.139s   3.027s100756.788 190572.486

w/patch
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
  0  10    1    0.059s      1.828s   1.089s330913.517 330225.515
  0  10    2    0.082s      1.951s   1.094s307172.100 320680.232

Note that zeroing of pages makes no sense if the application
touches all cache lines of a page allocated (there is no influence of
prezeroing on benchmarks like lmbench for that reason) since the extensive
caching of modern cpus means that the zeroes written to a hot zeroed page
will then be overwritten by the application in the cpu cache and thus
the zeros will never make it to memory! The test program used above only
touches one 128 byte cache line of a 16k page (ia64). Sparsely
populated and accessed areas are typical for lots of applications.

Here is another test in order to gauge the influence of the number of cache
lines touched on the performance of the prezero enhancements:

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  1    1   1    0.01s      0.12s   0.01s500813.853 497925.891
  1  1    1   2    0.01s      0.11s   0.01s493453.103 472877.725
  1  1    1   4    0.02s      0.10s   0.01s479351.658 471507.415
  1  1    1   8    0.01s      0.13s   0.01s424742.054 416725.013
  1  1    1  16    0.05s      0.12s   0.01s347715.359 336983.834
  1  1    1  32    0.12s      0.13s   0.02s258112.286 256246.731
  1  1    1  64    0.24s      0.14s   0.03s169896.381 168189.283
  1  1    1 128    0.49s      0.14s   0.06s102300.257 101674.435

The benefits of prezeroing are reduced to minimal quantities if all
cachelines of a page are touched. Prezeroing can only be effective
if the whole page is not immediately used after the page fault.

The patch is composed of 4 parts:

[1/4] Introduce __GFP_ZERO
	Modifies the page allocator to be able to take the __GFP_ZERO flag
	and returns zeroed memory on request. Modifies locations throughout
	the linux sources that retrieve a page and then zero it to request
	a zeroed page.

[2/4] Architecture specific clear_page updates
	Adds second order argument to clear_page and updates all arches.

Note: The two first pages may be used alone if no zeroing engine is wanted.

[3/4] Page Zeroing
	Adds management of ZEROED and NOT_ZEROED pages and a background daemon
	called scrubd. scrubd is disabled by default but can be enabled
	by writing an order number to /proc/sys/vm/scrub_start. If a page
	is coalesced of that order or higher then the scrub daemon will
	start zeroing until all pages of order /proc/sys/vm/scrub_stop and
	higher are zeroed and then go back to sleep.

	In an SMP environment the scrub daemon is typically
	running on the most idle cpu. Thus a single threaded application running
	on one cpu may have the other cpu zeroing pages for it etc. The scrub
	daemon is hardly noticable and usually finished zeroing quickly since
	most processors are optimized for linear memory filling.

[4/4]	SGI Altix Block Transfer Engine Support
	Implements a driver to shift the zeroing off the cpu into hardware.
	With hardware support there will be minimal impact of zeroing
	on the performance of the system.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-04 23:12             ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
@ 2005-01-04 23:13               ` Christoph Lameter
  2005-01-04 23:45                 ` Dave Hansen
                                   ` (2 more replies)
  2005-01-04 23:14               ` Prezeroing V3 [2/4]: Extension of " Christoph Lameter
                                 ` (2 subsequent siblings)
  3 siblings, 3 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:13 UTC (permalink / raw)
  To: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
to request zeroed pages from the page allocator.

o Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set

o Replace all page zeroing after allocating pages by request for
  zeroed pages.

o requires arch updates to clear_page in order to function properly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-04 12:16:49.000000000 -0800
@@ -584,6 +584,18 @@
 		BUG_ON(bad_range(zone, page));
 		mod_page_state_zone(zone, pgalloc, 1 << order);
 		prep_new_page(page, order);
+
+		if (gfp_flags & __GFP_ZERO) {
+#ifdef CONFIG_HIGHMEM
+			if (PageHighMem(page)) {
+				int n = 1 << order;
+
+				while (n-- >0)
+					clear_highpage(page + n);
+			} else
+#endif
+			clear_page(page_address(page), order);
+		}
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -796,12 +808,9 @@
 	 */
 	BUG_ON(gfp_mask & __GFP_HIGHMEM);

-	page = alloc_pages(gfp_mask, 0);
-	if (page) {
-		void *address = page_address(page);
-		clear_page(address);
-		return (unsigned long) address;
-	}
+	page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+	if (page)
+		return (unsigned long) page_address(page);
 	return 0;
 }

Index: linux-2.6.10/include/linux/gfp.h
===================================================================
--- linux-2.6.10.orig/include/linux/gfp.h	2004-12-24 13:34:27.000000000 -0800
+++ linux-2.6.10/include/linux/gfp.h	2005-01-04 12:16:49.000000000 -0800
@@ -37,6 +37,7 @@
 #define __GFP_NORETRY	0x1000	/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	0x2000	/* Slab internal usage */
 #define __GFP_COMP	0x4000	/* Add compound page metadata */
+#define __GFP_ZERO	0x8000	/* Return zeroed page on success */

 #define __GFP_BITS_SHIFT 16	/* Room for 16 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)

 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c	2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/mm/memory.c	2005-01-04 12:16:49.000000000 -0800
@@ -1650,10 +1650,9 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.10/kernel/profile.c
===================================================================
--- linux-2.6.10.orig/kernel/profile.c	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/kernel/profile.c	2005-01-04 12:16:49.000000000 -0800
@@ -326,17 +326,15 @@
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				return NOTIFY_BAD;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				goto out_free;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
 		}
 		break;
@@ -510,16 +508,14 @@
 		int node = cpu_to_node(cpu);
 		struct page *page;

-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[0]
 				= (struct profile_hit *)page_address(page);
 	}
Index: linux-2.6.10/mm/shmem.c
===================================================================
--- linux-2.6.10.orig/mm/shmem.c	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/mm/shmem.c	2005-01-04 12:16:49.000000000 -0800
@@ -369,9 +369,8 @@
 		}

 		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
 		if (page) {
-			clear_highpage(page);
 			page->nr_swapped = 0;
 		}
 		spin_lock(&info->lock);
@@ -910,7 +909,7 @@
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp, &pvma, 0);
+	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -926,7 +925,7 @@
 shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
 				 unsigned long idx)
 {
-	return alloc_page(gfp);
+	return alloc_page(gfp | __GFP_ZERO);
 }
 #endif

@@ -1135,7 +1134,6 @@

 		info->alloced++;
 		spin_unlock(&info->lock);
-		clear_highpage(filepage);
 		flush_dcache_page(filepage);
 		SetPageUptodate(filepage);
 	}
Index: linux-2.6.10/mm/hugetlb.c
===================================================================
--- linux-2.6.10.orig/mm/hugetlb.c	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/mm/hugetlb.c	2005-01-04 12:16:49.000000000 -0800
@@ -77,7 +77,6 @@
 struct page *alloc_huge_page(void)
 {
 	struct page *page;
-	int i;

 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page();
@@ -88,8 +87,7 @@
 	spin_unlock(&hugetlb_lock);
 	set_page_count(page, 1);
 	page[1].mapping = (void *)free_huge_page;
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-		clear_highpage(&page[i]);
+	clear_page(page_address(page), HUGETLB_PAGE_ORDER);
 	return page;
 }

Index: linux-2.6.10/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgalloc.h	2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -61,9 +61,7 @@
 	pgd_t *pgd = pgd_alloc_one_fast(mm);

 	if (unlikely(pgd == NULL)) {
-		pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-		if (likely(pgd != NULL))
-			clear_page(pgd);
+		pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	}
 	return pgd;
 }
@@ -106,10 +104,8 @@
 static inline pmd_t*
 pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pmd != NULL))
-		clear_page(pmd);
 	return pmd;
 }

@@ -140,20 +136,16 @@
 static inline struct page *
 pte_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

-	if (likely(pte != NULL))
-		clear_page(page_address(pte));
 	return pte;
 }

 static inline pte_t *
 pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pte != NULL))
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.10/arch/i386/mm/pgtable.c
===================================================================
--- linux-2.6.10.orig/arch/i386/mm/pgtable.c	2005-01-04 12:16:39.000000000 -0800
+++ linux-2.6.10/arch/i386/mm/pgtable.c	2005-01-04 12:16:49.000000000 -0800
@@ -140,10 +140,7 @@

 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
-	return pte;
+	return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 }

 struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -151,12 +148,10 @@
 	struct page *pte;

 #ifdef CONFIG_HIGHPTE
-	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
 #else
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 #endif
-	if (pte)
-		clear_highpage(pte);
 	return pte;
 }

Index: linux-2.6.10/arch/m68k/mm/motorola.c
===================================================================
--- linux-2.6.10.orig/arch/m68k/mm/motorola.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/m68k/mm/motorola.c	2005-01-04 12:16:49.000000000 -0800
@@ -1,4 +1,4 @@
-/*
+*
  * linux/arch/m68k/motorola.c
  *
  * Routines specific to the Motorola MMU, originally from:
@@ -50,7 +50,7 @@

 	ptablep = (pte_t *)alloc_bootmem_low_pages(PAGE_SIZE);

-	clear_page(ptablep);
+	clear_page(ptablep, 0);
 	__flush_page_to_ram(ptablep);
 	flush_tlb_kernel_page(ptablep);
 	nocache_page(ptablep);
@@ -90,7 +90,7 @@
 	if (((unsigned long)last_pgtable & ~PAGE_MASK) == 0) {
 		last_pgtable = (pmd_t *)alloc_bootmem_low_pages(PAGE_SIZE);

-		clear_page(last_pgtable);
+		clear_page(last_pgtable, 0);
 		__flush_page_to_ram(last_pgtable);
 		flush_tlb_kernel_page(last_pgtable);
 		nocache_page(last_pgtable);
Index: linux-2.6.10/include/asm-mips/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-mips/pgalloc.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-mips/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -56,9 +56,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PTE_ORDER);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER);

 	return pte;
 }
Index: linux-2.6.10/arch/alpha/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/alpha/mm/init.c	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/arch/alpha/mm/init.c	2005-01-04 12:16:49.000000000 -0800
@@ -42,10 +42,9 @@
 {
 	pgd_t *ret, *init;

-	ret = (pgd_t *)__get_free_page(GFP_KERNEL);
+	ret = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 	init = pgd_offset(&init_mm, 0UL);
 	if (ret) {
-		clear_page(ret);
 #ifdef CONFIG_ALPHA_LARGE_VMALLOC
 		memcpy (ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
 			(PTRS_PER_PGD - USER_PTRS_PER_PGD - 1)*sizeof(pgd_t));
@@ -63,9 +62,7 @@
 pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

Index: linux-2.6.10/include/asm-parisc/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-parisc/pgalloc.h	2004-12-24 13:35:39.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -120,18 +120,14 @@
 static inline struct page *
 pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
-	if (likely(page != NULL))
-		clear_page(page_address(page));
+	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return page;
 }

 static inline pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (likely(pte != NULL))
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

Index: linux-2.6.10/arch/sh/mm/pg-sh4.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-sh4.c	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-sh4.c	2005-01-04 12:16:49.000000000 -0800
@@ -34,7 +34,7 @@
 {
 	__set_bit(PG_mapped, &page->flags);
 	if (((address ^ (unsigned long)to) & CACHE_ALIAS) == 0)
-		clear_page(to);
+		clear_page(to, 0);
 	else {
 		pgprot_t pgprot = __pgprot(_PAGE_PRESENT |
 					   _PAGE_RW | _PAGE_CACHABLE |
Index: linux-2.6.10/include/asm-sparc64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/pgalloc.h	2004-12-24 13:35:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -73,10 +73,9 @@
 		struct page *page;

 		preempt_enable();
-		page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+		page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 		if (page) {
 			ret = (struct page *)page_address(page);
-			clear_page(ret);
 			page->lru.prev = (void *) 2UL;

 			preempt_disable();
Index: linux-2.6.10/include/asm-sh/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh/pgalloc.h	2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/include/asm-sh/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -44,9 +44,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);

 	return pte;
 }
@@ -56,9 +54,7 @@
 {
 	struct page *pte;

-   	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+   	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

 	return pte;
 }
Index: linux-2.6.10/include/asm-m32r/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/pgalloc.h	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -23,10 +23,7 @@
  */
 static __inline__ pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-
-	if (pgd)
-		clear_page(pgd);
+	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);

 	return pgd;
 }
@@ -39,10 +36,7 @@
 static __inline__ pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
-
-	if (pte)
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);

 	return pte;
 }
@@ -50,10 +44,8 @@
 static __inline__ struct page *pte_alloc_one(struct mm_struct *mm,
 	unsigned long address)
 {
-	struct page *pte = alloc_page(GFP_KERNEL);
+	struct page *pte = alloc_page(GFP_KERNEL|__GFP_ZERO);

-	if (pte)
-		clear_page(page_address(pte));

 	return pte;
 }
Index: linux-2.6.10/arch/um/kernel/mem.c
===================================================================
--- linux-2.6.10.orig/arch/um/kernel/mem.c	2005-01-04 12:16:40.000000000 -0800
+++ linux-2.6.10/arch/um/kernel/mem.c	2005-01-04 12:16:49.000000000 -0800
@@ -327,9 +327,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

@@ -337,9 +335,7 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_highpage(pte);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	return pte;
 }

Index: linux-2.6.10/arch/ppc64/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/ppc64/mm/init.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/ppc64/mm/init.c	2005-01-04 12:16:49.000000000 -0800
@@ -761,7 +761,7 @@

 void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
 {
-	clear_page(page);
+	clear_page(page, 0);

 	if (cur_cpu_spec->cpu_features & CPU_FTR_COHERENT_ICACHE)
 		return;
Index: linux-2.6.10/include/asm-sh64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh64/pgalloc.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -112,9 +112,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT|__GFP_ZERO);

 	return pte;
 }
@@ -123,9 +121,7 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

 	return pte;
 }
@@ -150,9 +146,7 @@
 static __inline__ pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	pmd_t *pmd;
-	pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pmd)
-		clear_page(pmd);
+	pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pmd;
 }

Index: linux-2.6.10/include/asm-cris/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/pgalloc.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-cris/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -24,18 +24,14 @@

 extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
  	return pte;
 }

 extern inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *pte;
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	return pte;
 }

Index: linux-2.6.10/arch/ppc/mm/pgtable.c
===================================================================
--- linux-2.6.10.orig/arch/ppc/mm/pgtable.c	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/arch/ppc/mm/pgtable.c	2005-01-04 12:16:49.000000000 -0800
@@ -85,8 +85,7 @@
 {
 	pgd_t *ret;

-	if ((ret = (pgd_t *)__get_free_pages(GFP_KERNEL, PGDIR_ORDER)) != NULL)
-		clear_pages(ret, PGDIR_ORDER);
+	ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER);
 	return ret;
 }

@@ -102,7 +101,7 @@
 	extern void *early_get_page(void);

 	if (mem_init_done) {
-		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 		if (pte) {
 			struct page *ptepage = virt_to_page(pte);
 			ptepage->mapping = (void *) mm;
@@ -110,8 +109,6 @@
 		}
 	} else
 		pte = (pte_t *)early_get_page();
-	if (pte)
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.10/arch/ppc/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/ppc/mm/init.c	2005-01-04 12:16:40.000000000 -0800
+++ linux-2.6.10/arch/ppc/mm/init.c	2005-01-04 12:16:49.000000000 -0800
@@ -594,7 +594,7 @@
 }
 void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
 {
-	clear_page(page);
+	clear_page(page, 0);
 	clear_bit(PG_arch_1, &pg->flags);
 }

Index: linux-2.6.10/fs/afs/file.c
===================================================================
--- linux-2.6.10.orig/fs/afs/file.c	2004-12-24 13:35:59.000000000 -0800
+++ linux-2.6.10/fs/afs/file.c	2005-01-04 12:16:49.000000000 -0800
@@ -172,7 +172,7 @@
 				      (size_t) PAGE_SIZE);
 		desc.buffer	= kmap(page);

-		clear_page(desc.buffer);
+		clear_page(desc.buffer, 0);

 		/* read the contents of the file from the server into the
 		 * page */
Index: linux-2.6.10/include/asm-alpha/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/pgalloc.h	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -40,9 +40,7 @@
 static inline pmd_t *
 pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (ret)
-		clear_page(ret);
+	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return ret;
 }

Index: linux-2.6.10/include/linux/highmem.h
===================================================================
--- linux-2.6.10.orig/include/linux/highmem.h	2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h	2005-01-04 12:16:49.000000000 -0800
@@ -45,7 +45,7 @@
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
-	clear_page(kaddr);
+	clear_page(kaddr, 0);
 	kunmap_atomic(kaddr, KM_USER0);
 }

Index: linux-2.6.10/arch/sh64/mm/ioremap.c
===================================================================
--- linux-2.6.10.orig/arch/sh64/mm/ioremap.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/sh64/mm/ioremap.c	2005-01-04 12:16:49.000000000 -0800
@@ -399,7 +399,7 @@
 	if (pte_none(*ptep) || !pte_present(*ptep))
 		return;

-	clear_page((void *)ptep);
+	clear_page((void *)ptep, 0);
 	pte_clear(ptep);
 }

Index: linux-2.6.10/include/asm-m68k/motorola_pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68k/motorola_pgalloc.h	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/motorola_pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -12,9 +12,8 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	if (pte) {
-		clear_page(pte);
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
 		nocache_page(pte);
@@ -31,7 +30,7 @@

 static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	pte_t *pte;

 	if(!page)
@@ -39,7 +38,6 @@

 	pte = kmap(page);
 	if (pte) {
-		clear_page(pte);
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
 		nocache_page(pte);
Index: linux-2.6.10/arch/sh/mm/pg-sh7705.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-sh7705.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-sh7705.c	2005-01-04 12:16:49.000000000 -0800
@@ -78,13 +78,13 @@

 	__set_bit(PG_mapped, &page->flags);
 	if (((address ^ (unsigned long)to) & CACHE_ALIAS) == 0) {
-		clear_page(to);
+		clear_page(to, 0);
 		__flush_wback_region(to, PAGE_SIZE);
 	} else {
 		__flush_purge_virtual_region(to,
 					     (void *)(address & 0xfffff000),
 					     PAGE_SIZE);
-		clear_page(to);
+		clear_page(to, 0);
 		__flush_wback_region(to, PAGE_SIZE);
 	}
 }
Index: linux-2.6.10/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/sparc64/mm/init.c	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/sparc64/mm/init.c	2005-01-04 12:16:49.000000000 -0800
@@ -1687,13 +1687,12 @@
 	 * Set up the zero page, mark it reserved, so that page count
 	 * is not manipulated when freeing the page from user ptes.
 	 */
-	mem_map_zero = alloc_pages(GFP_KERNEL, 0);
+	mem_map_zero = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if (mem_map_zero == NULL) {
 		prom_printf("paging_init: Cannot alloc zero page.\n");
 		prom_halt();
 	}
 	SetPageReserved(mem_map_zero);
-	clear_page(page_address(mem_map_zero));

 	codepages = (((unsigned long) _etext) - ((unsigned long) _start));
 	codepages = PAGE_ALIGN(codepages) >> PAGE_SHIFT;
Index: linux-2.6.10/include/asm-arm/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/pgalloc.h	2004-12-24 13:35:29.000000000 -0800
+++ linux-2.6.10/include/asm-arm/pgalloc.h	2005-01-04 12:16:49.000000000 -0800
@@ -50,9 +50,8 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	if (pte) {
-		clear_page(pte);
 		clean_dcache_area(pte, sizeof(pte_t) * PTRS_PER_PTE);
 		pte += PTRS_PER_PTE;
 	}
@@ -65,10 +64,9 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	if (pte) {
 		void *page = page_address(pte);
-		clear_page(page);
 		clean_dcache_area(page, sizeof(pte_t) * PTRS_PER_PTE);
 	}

Index: linux-2.6.10/drivers/net/tc35815.c
===================================================================
--- linux-2.6.10.orig/drivers/net/tc35815.c	2004-12-24 13:33:48.000000000 -0800
+++ linux-2.6.10/drivers/net/tc35815.c	2005-01-04 12:16:49.000000000 -0800
@@ -657,7 +657,7 @@
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
 	} else {
-		clear_page(lp->fd_buf);
+		clear_page(lp->fd_buf, 0);
 #ifdef __mips__
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
Index: linux-2.6.10/drivers/block/pktcdvd.c
===================================================================
--- linux-2.6.10.orig/drivers/block/pktcdvd.c	2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10/drivers/block/pktcdvd.c	2005-01-04 12:16:49.000000000 -0800
@@ -135,12 +135,10 @@
 		goto no_bio;

 	for (i = 0; i < PAGES_PER_PACKET; i++) {
-		pkt->pages[i] = alloc_page(GFP_KERNEL);
+		pkt->pages[i] = alloc_page(GFP_KERNEL|| __GFP_ZERO);
 		if (!pkt->pages[i])
 			goto no_page;
 	}
-	for (i = 0; i < PAGES_PER_PACKET; i++)
-		clear_page(page_address(pkt->pages[i]));

 	spin_lock_init(&pkt->lock);



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V3 [2/4]: Extension of clear_page to take an order parameter
  2005-01-04 23:12             ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
  2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
@ 2005-01-04 23:14               ` Christoph Lameter
  2005-01-05 23:25                 ` Christoph Lameter
  2005-01-04 23:15               ` Prezeroing V3 [3/4]: Page zeroing through kscrubd Christoph Lameter
  2005-01-04 23:16               ` Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix Christoph Lameter
  3 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:14 UTC (permalink / raw)
  To: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

o Extend clear_page to take an order parameter for all architectures.

Architecture support:
---------------------

Known to work:

ia64
i386
sparc64
m68k

Trivial modification expected to simply work:

arm
cris
h8300
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um

Modification made but it would be good to have some feedback from the arch maintainers:

x86_64
s390
alpha
sh
mips
m32r

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/include/asm-ia64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/page.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -56,7 +56,7 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
 extern void copy_page (void *to, void *from);

 /*
@@ -65,7 +65,7 @@
  */
 #define clear_user_page(addr, vaddr, page)	\
 do {						\
-	clear_page(addr);			\
+	clear_page(addr, 0);			\
 	flush_dcache_page(page);		\
 } while (0)

Index: linux-2.6.10/include/asm-i386/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/page.h	2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -18,7 +18,7 @@

 #include <asm/mmx.h>

-#define clear_page(page)	mmx_clear_page((void *)(page))
+#define clear_page(page, order)	mmx_clear_page((void *)(page),order)
 #define copy_page(to,from)	mmx_copy_page(to,from)

 #else
@@ -28,12 +28,12 @@
  *	Maybe the K6-III ?
  */

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)

 #endif

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/page.h	2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -32,10 +32,10 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-void clear_page(void *);
+void clear_page(void *, int);
 void copy_page(void *, void *);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-sparc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -28,10 +28,10 @@

 #ifndef __ASSEMBLY__

-#define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	 memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		sparc_flush_page_to_ram(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-s390/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -22,12 +22,12 @@

 #ifndef __s390x__

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	register_pair rp;

 	rp.subreg.even = (unsigned long) page;
-	rp.subreg.odd = (unsigned long) 4096;
+	rp.subreg.odd = (unsigned long) 4096 << order;
         asm volatile ("   slr  1,1\n"
 		      "   mvcl %0,0"
 		      : "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@

 #else /* __s390x__ */

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
-        asm volatile ("   lgr  2,%0\n"
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+        	asm volatile ("   lgr  2,%0\n"
                       "   lghi 3,4096\n"
                       "   slgr 1,1\n"
                       "   mvcl 2,0"
                       : : "a" ((void *) (page))
 		      : "memory", "cc", "1", "2", "3" );
+		page += PAGE_SIZE;
+	}
 }

 static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@

 #endif /* __s390x__ */

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /* Pure 2^n version of get_order */
Index: linux-2.6.10/arch/i386/lib/mmx.c
===================================================================
--- linux-2.6.10.orig/arch/i386/lib/mmx.c	2004-12-24 13:34:48.000000000 -0800
+++ linux-2.6.10/arch/i386/lib/mmx.c	2005-01-04 12:34:03.000000000 -0800
@@ -128,7 +128,7 @@
  *	other MMX using processors do not.
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -138,7 +138,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/64;i++)
+	for(i=0;i<((4096/64) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@
  *	Generic MMX implementation without K7 specific streaming
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -267,7 +267,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/128;i++)
+	for(i=0;i<((4096/128) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@
  *	Favour MMX for page clear and copy.
  */

-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
 {
 	int d0, d1;
 	__asm__ __volatile__( \
 		"cld\n\t" \
 		"rep ; stosl" \
 		: "=&c" (d0), "=&D" (d1)
-		:"a" (0),"1" (page),"0" (1024)
+		:"a" (0),"1" (page),"0" (1024 << order)
 		:"memory");
 }
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
 {
 	if(unlikely(in_interrupt()))
-		slow_zero_page(page);
+		slow_clear_page(page, order);
 	else
-		fast_clear_page(page);
+		fast_clear_page(page, order);
 }

 static void slow_copy_page(void *to, void *from)
Index: linux-2.6.10/include/asm-x86_64/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/mmx.h	2005-01-04 12:34:03.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/ia64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/ia64/lib/clear_page.S	2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10/arch/ia64/lib/clear_page.S	2005-01-04 12:34:03.000000000 -0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm	Tuned for Itanium.
  * 2/12/02 kchen	Tuned for both Itanium and McKinley
  * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
  */
 #include <linux/config.h>

@@ -29,27 +30,33 @@
 #define dst4		r11

 #define dst_last	r31
+#define totsize		r14

 GLOBAL_ENTRY(clear_page)
 	.prologue
-	.regstk 1,0,0,0
-	mov r16 = PAGE_SIZE/L3_LINE_SIZE-1	// main loop count, -1=repeat/until
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
 	.save ar.lc, saved_lc
 	mov saved_lc = ar.lc
-
+	;;
 	.body
+	adds dst1 = 16, in0
 	mov ar.lc = (PREFETCH_LINES - 1)
 	mov dst_fetch = in0
-	adds dst1 = 16, in0
 	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
 	;;
 .fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
 	adds dst3 = 48, in0		// executing this multiple times is harmless
 	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
 	;;
-	addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
 	mov ar.lc = r16			// one L3 line per iteration
-	adds dst4 = 64, in0
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
 	;;
 #ifdef CONFIG_ITANIUM
 	// Optimized for Itanium
Index: linux-2.6.10/arch/x86_64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/x86_64/lib/clear_page.S	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/lib/clear_page.S	2005-01-04 12:34:03.000000000 -0800
@@ -7,6 +7,7 @@
 clear_page:
 	xorl   %eax,%eax
 	movl   $4096/64,%ecx
+	shl	%esi, %ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
@@ -42,6 +43,7 @@
 	.section .altinstr_replacement,"ax"
 clear_page_c:
 	movl $4096/8,%ecx
+	shl	%esi, %ecx
 	xorl %eax,%eax
 	rep
 	stosq
Index: linux-2.6.10/include/asm-sh/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh/page.h	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-sh/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -36,12 +36,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
 extern void (*copy_page)(void *to, void *from);

 extern void clear_page_slow(void *to);
 extern void copy_page_slow(void *to, void *from);

+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 #if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
 struct page;
 extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@
 extern void __clear_user_page(void *to, void *orig_to);
 extern void __copy_user_page(void *to, void *from, void *orig_to);
 #elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 #elif defined(CONFIG_CPU_SH4)
 struct page;
Index: linux-2.6.10/include/asm-i386/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/mmx.h	2005-01-04 12:34:03.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/alpha/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/clear_page.S	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/clear_page.S	2005-01-04 12:34:03.000000000 -0800
@@ -6,11 +6,10 @@

 	.text
 	.align 4
-	.global clear_page
-	.ent clear_page
-clear_page:
+	.global _clear_page
+	.ent _clear_page
+_clear_page:
 	.prologue 0
-
 	lda	$0,128
 	nop
 	unop
@@ -36,4 +35,4 @@
 	unop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/include/asm-sh64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh64/page.h	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -50,12 +50,20 @@
 extern void sh64_page_clear(void *page);
 extern void sh64_page_copy(void *from, void *to);

-#define clear_page(page)               sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+		sh64_page_clear(page++, 0);
+	}
+}
+
 #define copy_page(to,from)             sh64_page_copy(from, to)

 #if defined(CONFIG_DCACHE_DISABLED)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	sh_clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #else
Index: linux-2.6.10/include/asm-h8300/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-h8300/page.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-arm/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-arm/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -128,7 +128,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 extern void copy_page(void *to, const void *from);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-ppc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc64/page.h	2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10/include/asm-ppc64/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -102,12 +102,12 @@
 #define REGION_MASK   (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
 #define REGION_STRIDE (1UL << REGION_SHIFT)

-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, int order)
 {
 	unsigned long lines, line_size;

 	line_size = systemcfg->dCacheL1LineSize;
-	lines = naca->dCacheL1LinesPerPage;
+	lines = naca->dCacheL1LinesPerPage << order;

 	__asm__ __volatile__(
 	"mtctr  	%1	# clear_page\n\
Index: linux-2.6.10/include/asm-m32r/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -11,10 +11,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- > 0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+
 extern void copy_page(void *to, void *from);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-alpha/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/page.h	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -15,8 +15,20 @@

 #define STRICT_MM_TYPECHECKS

-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+	int nr = 1 << order;
+
+	while (nr--)
+	{
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)

 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
Index: linux-2.6.10/arch/mips/mm/pg-sb1.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-sb1.c	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-sb1.c	2005-01-04 12:34:03.000000000 -0800
@@ -42,7 +42,7 @@
 #ifdef CONFIG_SIBYTE_DMA_PAGEOPS
 static inline void clear_page_cpu(void *page)
 #else
-void clear_page(void *page)
+void _clear_page(void *page)
 #endif
 {
 	unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@
 		     IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
 }

-void clear_page(void *page)
+void _clear_page(void *page)
 {
 	int cpu = smp_processor_id();

 	/* if the page is above Kseg0, use old way */
 	if (KSEGX(page) != CAC_BASE)
 		return clear_page_cpu(page);
-
 	page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
 	page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
 	__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@

 #endif

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
 EXPORT_SYMBOL(copy_page);
Index: linux-2.6.10/include/asm-m68k/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68k/page.h	2004-12-24 13:35:49.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -50,7 +50,7 @@
 		       );
 }

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	unsigned long tmp;
 	unsigned long *sp = page;
@@ -69,16 +69,16 @@
 			     "dbra   %1,1b\n\t"
 			     : "=a" (sp), "=d" (tmp)
 			     : "a" (page), "0" (sp),
-			       "1" ((PAGE_SIZE - 16) / 16 - 1));
+			       "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
 }

 #else
-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 #endif

 #define clear_user_page(addr, vaddr, page)	\
-	do {	clear_page(addr);		\
+	do {	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-mips/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-mips/page.h	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/include/asm-mips/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -39,7 +39,18 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 extern void copy_page(void * to, void * from);

 extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@
 {
 	extern void (*flush_data_cache_page)(unsigned long addr);

-	clear_page(addr);
+	clear_page(addr, 0);
 	if (pages_do_alias((unsigned long) addr, vaddr))
 		flush_data_cache_page((unsigned long)addr);
 }
Index: linux-2.6.10/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68knommu/page.h	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-cris/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/page.h	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -15,10 +15,10 @@

 #ifdef __KERNEL__

-#define clear_page(page)        memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      memcpy((void *)(to), (void *)(from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)    clear_page(page)
+#define clear_user_page(page, vaddr, pg)    clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-v850/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-v850/page.h	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-v850/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -37,11 +37,11 @@

 #define STRICT_MM_TYPECHECKS

-#define clear_page(page)	memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset ((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to, from)	memcpy ((void *)(to), (void *)from, PAGE_SIZE)

 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-parisc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-parisc/page.h	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -13,7 +13,7 @@
 #include <asm/types.h>
 #include <asm/cache.h>

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      copy_user_page_asm((void *)(to), (void *)(from))

 struct page;
Index: linux-2.6.10/arch/arm/mm/copypage-v6.c
===================================================================
--- linux-2.6.10.orig/arch/arm/mm/copypage-v6.c	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/arm/mm/copypage-v6.c	2005-01-04 12:34:03.000000000 -0800
@@ -47,7 +47,7 @@
  */
 void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
 {
-	clear_page(kaddr);
+	_clear_page(kaddr);
 }

 /*
@@ -116,7 +116,7 @@

 	set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
 	flush_tlb_kernel_page(to);
-	clear_page((void *)to);
+	_clear_page((void *)to);

 	spin_unlock(&v6_lock);
 }
Index: linux-2.6.10/arch/m32r/mm/page.S
===================================================================
--- linux-2.6.10.orig/arch/m32r/mm/page.S	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/arch/m32r/mm/page.S	2005-01-04 12:34:03.000000000 -0800
@@ -51,7 +51,7 @@
 	jmp	r14

 	.text
-	.global	clear_page
+	.global	_clear_page
 	/*
 	 * clear_page (to)
 	 *
@@ -60,7 +60,7 @@
 	 * 16 * 256
 	 */
 	.align	4
-clear_page:
+_clear_page:
 	ldi	r2, #255
 	ldi	r4, #0
 	ld	r3, @r0		/* cache line allocate */
Index: linux-2.6.10/include/asm-ppc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-ppc/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -85,7 +85,7 @@

 struct page;
 extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define  clear_page clear_pages
 extern void copy_page(void *to, void *from);
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/alpha/kernel/alpha_ksyms.c	2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c	2005-01-04 12:34:03.000000000 -0800
@@ -88,7 +88,7 @@
 EXPORT_SYMBOL(__memsetw);
 EXPORT_SYMBOL(__constant_c_memset);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(__direct_map_base);
 EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.10/arch/alpha/lib/ev6-clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/ev6-clear_page.S	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/ev6-clear_page.S	2005-01-04 12:34:03.000000000 -0800
@@ -6,9 +6,9 @@

         .text
         .align 4
-        .global clear_page
-        .ent clear_page
-clear_page:
+        .global _clear_page
+        .ent _clear_page
+_clear_page:
         .prologue 0

 	lda	$0,128
@@ -51,4 +51,4 @@
 	nop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/arch/sh/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/init.c	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/init.c	2005-01-04 12:34:03.000000000 -0800
@@ -57,7 +57,7 @@
 #endif

 void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);

 void show_mem(void)
 {
@@ -255,7 +255,7 @@
 	 * later in the boot process if a better method is available.
 	 */
 	copy_page = copy_page_slow;
-	clear_page = clear_page_slow;
+	_clear_page = clear_page_slow;

 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.10/arch/sh/mm/pg-dma.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-dma.c	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-dma.c	2005-01-04 12:34:03.000000000 -0800
@@ -78,7 +78,7 @@
 		return ret;

 	copy_page = copy_page_dma;
-	clear_page = clear_page_dma;
+	_clear_page = clear_page_dma;

 	return ret;
 }
Index: linux-2.6.10/arch/sh/mm/pg-nommu.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-nommu.c	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-nommu.c	2005-01-04 12:34:03.000000000 -0800
@@ -27,7 +27,7 @@
 static int __init pg_nommu_init(void)
 {
 	copy_page = copy_page_nommu;
-	clear_page = clear_page_nommu;
+	_clear_page = clear_page_nommu;

 	return 0;
 }
Index: linux-2.6.10/arch/mips/mm/pg-r4k.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-r4k.c	2004-12-24 13:34:49.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-r4k.c	2005-01-04 12:34:03.000000000 -0800
@@ -39,9 +39,9 @@

 static unsigned int clear_page_array[0x130 / 4];

-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 /*
  * Maximum sizes:
Index: linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/m32r/kernel/m32r_ksyms.c	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c	2005-01-04 12:34:03.000000000 -0800
@@ -102,7 +102,7 @@
 EXPORT_SYMBOL(memcmp);
 EXPORT_SYMBOL(memscan);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(strcat);
 EXPORT_SYMBOL(strchr);
Index: linux-2.6.10/include/asm-arm26/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm26/page.h	2004-12-24 13:35:22.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -25,7 +25,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 #define copy_page(to, from)  __copy_user_page(to, from, 0);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-sparc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/page.h	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/page.h	2005-01-04 12:34:03.000000000 -0800
@@ -14,8 +14,8 @@

 #ifndef __ASSEMBLY__

-extern void _clear_page(void *page);
-#define clear_page(X)	_clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y)	_clear_page((void *)(X),(Y))
 struct page;
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 #define copy_page(X,Y)	memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
Index: linux-2.6.10/arch/sparc64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/sparc64/lib/clear_page.S	2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/arch/sparc64/lib/clear_page.S	2005-01-04 12:34:03.000000000 -0800
@@ -28,9 +28,12 @@
 	.text

 	.globl		_clear_page
-_clear_page:		/* %o0=dest */
+_clear_page:		/* %o0=dest, %o1=order */
+	sethi		%hi(PAGE_SIZE/64), %o2
+	clr		%o4
+	or		%o2, %lo(PAGE_SIZE/64), %o2
 	ba,pt		%xcc, clear_page_common
-	 clr		%o4
+	 sllx		%o2, %o1, %o1

 	/* This thing is pretty important, it shows up
 	 * on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@
 	flush		%g6
 	wrpr		%o4, 0x0, %pstate

+	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		1, %o4
+	or		%o1, %lo(PAGE_SIZE/64), %o1

 clear_page_common:
 	VISEntryHalf
 	membar		#StoreLoad | #StoreStore | #LoadStore
 	fzero		%f0
-	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		%o0, %g1		! remember vaddr for tlbflush
 	fzero		%f2
-	or		%o1, %lo(PAGE_SIZE/64), %o1
 	faddd		%f0, %f2, %f4
 	fmuld		%f0, %f2, %f6
 	faddd		%f0, %f2, %f8


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V3 [3/4]: Page zeroing through kscrubd
  2005-01-04 23:12             ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
  2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
  2005-01-04 23:14               ` Prezeroing V3 [2/4]: Extension of " Christoph Lameter
@ 2005-01-04 23:15               ` Christoph Lameter
  2005-01-04 23:16               ` Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix Christoph Lameter
  3 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:15 UTC (permalink / raw)
  To: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

o Add page zeroing
o Add scrub daemon
o Add ability to view amount of zeroed information in /proc/meninfo

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-04 14:17:02.000000000 -0800
@@ -12,6 +12,7 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *          (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Support for page zeroing, Christoph Lameter, SGI, Dec 2004
  */

 #include <linux/config.h>
@@ -33,6 +34,7 @@
 #include <linux/cpu.h>
 #include <linux/nodemask.h>
 #include <linux/vmalloc.h>
+#include <linux/scrub.h>

 #include <asm/tlbflush.h>

@@ -180,7 +182,7 @@
  * -- wli
  */

-static inline void __free_pages_bulk (struct page *page, struct page *base,
+static inline int __free_pages_bulk (struct page *page, struct page *base,
 		struct zone *zone, struct free_area *area, unsigned int order)
 {
 	unsigned long page_idx, index, mask;
@@ -193,11 +195,10 @@
 		BUG();
 	index = page_idx >> (1 + order);

-	zone->free_pages += 1 << order;
 	while (order < MAX_ORDER-1) {
 		struct page *buddy1, *buddy2;

-		BUG_ON(area >= zone->free_area + MAX_ORDER);
+		BUG_ON(area >= zone->free_area[ZEROED] + MAX_ORDER);
 		if (!__test_and_change_bit(index, area->map))
 			/*
 			 * the buddy page is still allocated.
@@ -219,6 +220,7 @@
 	}
 	list_add(&(base + page_idx)->lru, &area->free_list);
 	area->nr_free++;
+	return order;
 }

 static inline void free_pages_check(const char *function, struct page *page)
@@ -261,7 +263,7 @@
 	int ret = 0;

 	base = zone->zone_mem_map;
-	area = zone->free_area + order;
+	area = zone->free_area[NOT_ZEROED] + order;
 	spin_lock_irqsave(&zone->lock, flags);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
@@ -269,7 +271,10 @@
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_pages_bulk list manipulates */
 		list_del(&page->lru);
-		__free_pages_bulk(page, base, zone, area, order);
+		zone->free_pages += 1 << order;
+		if (__free_pages_bulk(page, base, zone, area, order)
+			>= sysctl_scrub_start)
+				wakeup_kscrubd(zone);
 		ret++;
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -291,6 +296,21 @@
 	free_pages_bulk(page_zone(page), 1, &list, order);
 }

+void end_zero_page(struct page *page)
+{
+	unsigned long flags;
+	int order = page->index;
+	struct zone * zone = page_zone(page);
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	zone->zero_pages += 1 << order;
+	__free_pages_bulk(page, zone->zone_mem_map, zone, zone->free_area[ZEROED] + order, order);
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+
 #define MARK_USED(index, order, area) \
 	__change_bit((index) >> (1+(order)), (area)->map)

@@ -370,26 +390,47 @@
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct zone *zone, struct free_area *area, int order)
+{
+	list_del(&page->lru);
+	area->nr_free--;
+	if (order != MAX_ORDER-1)
+		MARK_USED(page - zone->zone_mem_map, order, area);
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order)
+{
+	unsigned long flags;
+	struct page *page = NULL;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (!list_empty(&area->free_list)) {
+		page = list_entry(area->free_list.next, struct page, lru);
+
+		rmpage(page, zone, area, order);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
 {
 	struct free_area * area;
 	unsigned int current_order;
 	struct page *page;
-	unsigned int index;

 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
+		area = zone->free_area[zero] + current_order;
 		if (list_empty(&area->free_list))
 			continue;

 		page = list_entry(area->free_list.next, struct page, lru);
-		list_del(&page->lru);
-		area->nr_free--;
-		index = page - zone->zone_mem_map;
-		if (current_order != MAX_ORDER-1)
-			MARK_USED(index, current_order, area);
+		rmpage(page, zone, area, current_order);
 		zone->free_pages -= 1UL << order;
-		return expand(zone, page, index, order, current_order, area);
+		if (zero)
+			zone->zero_pages -= 1UL << order;
+		return expand(zone, page, page - zone->zone_mem_map, order, current_order, area);
 	}

 	return NULL;
@@ -401,7 +442,7 @@
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list, int zero)
 {
 	unsigned long flags;
 	int i;
@@ -410,7 +451,7 @@

 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, zero);
 		if (page == NULL)
 			break;
 		allocated++;
@@ -457,7 +498,7 @@
 		ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));

 	for (order = MAX_ORDER - 1; order >= 0; --order)
-		list_for_each(curr, &zone->free_area[order].free_list) {
+		list_for_each(curr, &zone->free_area[NOT_ZEROED][order].free_list) {
 			unsigned long start_pfn, i;

 			start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -555,7 +596,9 @@
 {
 	unsigned long flags;
 	struct page *page = NULL;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	int nr_pages = 1 << order;
+	int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+	int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;

 	if (order == 0) {
 		struct per_cpu_pages *pcp;
@@ -564,7 +607,7 @@
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list, zero);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -576,19 +619,30 @@

 	if (page == NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+
+		page = __rmqueue(zone, order, zero);
+
+		/*
+		 * If we failed to obtain a zero and/or unzeroed page
+		 * then we may still be able to obtain the other
+		 * type of page.
+		 */
+		if (!page) {
+			page = __rmqueue(zone, order, !zero);
+			zero = 0;
+		}
+
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

 	if (page != NULL) {
 		BUG_ON(bad_range(zone, page));
-		mod_page_state_zone(zone, pgalloc, 1 << order);
-		prep_new_page(page, order);
+		mod_page_state_zone(zone, pgalloc, nr_pages);

-		if (gfp_flags & __GFP_ZERO) {
+		if ((gfp_flags & __GFP_ZERO) && !zero) {
 #ifdef CONFIG_HIGHMEM
 			if (PageHighMem(page)) {
-				int n = 1 << order;
+				int n = nr_pages;

 				while (n-- >0)
 					clear_highpage(page + n);
@@ -596,6 +650,7 @@
 #endif
 			clear_page(page_address(page), order);
 		}
+		prep_new_page(page, order);
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -622,7 +677,7 @@
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
-		free_pages -= z->free_area[o].nr_free << o;
+		free_pages -= (z->free_area[NOT_ZEROED][o].nr_free + z->free_area[ZEROED][o].nr_free)  << o;

 		/* Require fewer higher order pages to be free */
 		min >>= 1;
@@ -1000,7 +1055,7 @@
 }

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat)
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
 {
 	struct zone *zones = pgdat->node_zones;
 	int i;
@@ -1008,27 +1063,31 @@
 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		*active += zones[i].nr_active;
 		*inactive += zones[i].nr_inactive;
 		*free += zones[i].free_pages;
+		*zero += zones[i].zero_pages;
 	}
 }

 void get_zone_counts(unsigned long *active,
-		unsigned long *inactive, unsigned long *free)
+		unsigned long *inactive, unsigned long *free, unsigned long *zero)
 {
 	struct pglist_data *pgdat;

 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for_each_pgdat(pgdat) {
-		unsigned long l, m, n;
-		__get_zone_counts(&l, &m, &n, pgdat);
+		unsigned long l, m, n,o;
+		__get_zone_counts(&l, &m, &n, &o, pgdat);
 		*active += l;
 		*inactive += m;
 		*free += n;
+		*zero += o;
 	}
 }

@@ -1065,6 +1124,7 @@

 #define K(x) ((x) << (PAGE_SHIFT-10))

+const char *temp[3] = { "hot", "cold", "zero" };
 /*
  * Show free area list (used inside shift_scroll-lock stuff)
  * We also calculate the percentage fragmentation. We do this by counting the
@@ -1077,6 +1137,7 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;
 	struct zone *zone;

 	for_each_zone(zone) {
@@ -1097,10 +1158,10 @@

 			pageset = zone->pageset + cpu;

-			for (temperature = 0; temperature < 2; temperature++)
+			for (temperature = 0; temperature < 3; temperature++)
 				printk("cpu %d %s: low %d, high %d, batch %d\n",
 					cpu,
-					temperature ? "cold" : "hot",
+					temp[temperature],
 					pageset->pcp[temperature].low,
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch);
@@ -1108,20 +1169,21 @@
 	}

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 	printk("\nFree pages: %11ukB (%ukB HighMem)\n",
 		K(nr_free_pages()),
 		K(nr_free_highpages()));

 	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
-		"unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+		"unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
 		active,
 		inactive,
 		ps.nr_dirty,
 		ps.nr_writeback,
 		ps.nr_unstable,
 		nr_free_pages(),
+		zero,
 		ps.nr_slab,
 		ps.nr_mapped,
 		ps.nr_page_table_pages);
@@ -1170,7 +1232,7 @@

 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < MAX_ORDER; order++) {
-			nr = zone->free_area[order].nr_free;
+			nr = zone->free_area[NOT_ZEROED][order].nr_free + zone->free_area[ZEROED][order].nr_free;
 			total += nr << order;
 			printk("%lu*%lukB ", nr, K(1UL) << order);
 		}
@@ -1493,16 +1555,21 @@
 	for (order = 0; ; order++) {
 		unsigned long bitmap_size;

-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
 		if (order == MAX_ORDER-1) {
-			zone->free_area[order].map = NULL;
+			zone->free_area[NOT_ZEROED][order].map = NULL;
+			zone->free_area[ZEROED][order].map = NULL;
 			break;
 		}

 		bitmap_size = pages_to_bitmap_size(order, size);
-		zone->free_area[order].map =
+		zone->free_area[NOT_ZEROED][order].map =
+		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
+		zone->free_area[ZEROED][order].map =
 		  (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
-		zone->free_area[order].nr_free = 0;
+		zone->free_area[NOT_ZEROED][order].nr_free = 0;
+		zone->free_area[ZEROED][order].nr_free = 0;
 	}
 }

@@ -1527,6 +1594,7 @@

 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->kscrubd_wait);
 	pgdat->kswapd_max_order = 0;

 	for (j = 0; j < MAX_NR_ZONES; j++) {
@@ -1550,6 +1618,7 @@
 		spin_lock_init(&zone->lru_lock);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->zero_pages = 0;

 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;

@@ -1583,6 +1652,13 @@
 			pcp->high = 2 * batch;
 			pcp->batch = 1 * batch;
 			INIT_LIST_HEAD(&pcp->list);
+
+			pcp = &zone->pageset[cpu].pcp[2];	/* zero pages */
+			pcp->count = 0;
+			pcp->low = 0;
+			pcp->high = 2 * batch;
+			pcp->batch = 1 * batch;
+			INIT_LIST_HEAD(&pcp->list);
 		}
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
 				zone_names[j], realsize, batch);
@@ -1708,7 +1784,7 @@
 		spin_lock_irqsave(&zone->lock, flags);
 		seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
 		for (order = 0; order < MAX_ORDER; ++order)
-			seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+			seq_printf(m, "%6lu ", zone->free_area[NOT_ZEROED][order].nr_free);
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
Index: linux-2.6.10/include/linux/mmzone.h
===================================================================
--- linux-2.6.10.orig/include/linux/mmzone.h	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/linux/mmzone.h	2005-01-04 14:17:02.000000000 -0800
@@ -52,7 +52,7 @@
 };

 struct per_cpu_pageset {
-	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
+	struct per_cpu_pages pcp[3];	/* 0: hot.  1: cold  2: cold zeroed pages */
 #ifdef CONFIG_NUMA
 	unsigned long numa_hit;		/* allocated in intended node */
 	unsigned long numa_miss;	/* allocated in non intended node */
@@ -108,10 +108,14 @@
  * ZONE_HIGHMEM	 > 896 MB	only page cache and user processes
  */

+#define NOT_ZEROED 0
+#define ZEROED 1
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
 	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		zero_pages;
 	/*
 	 * protection[] is a pre-calculated number of extra pages that must be
 	 * available in a zone in order for __alloc_pages() to allocate memory
@@ -132,7 +136,7 @@
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	struct free_area	free_area[MAX_ORDER];
+	struct free_area	free_area[2][MAX_ORDER];


 	ZONE_PADDING(_pad1_)
@@ -267,6 +271,9 @@
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
+
+	wait_queue_head_t       kscrubd_wait;
+	struct task_struct *kscrubd;
 } pg_data_t;

 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -276,9 +283,9 @@
 extern struct pglist_data *pgdat_list;

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat);
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free);
+			unsigned long *free, unsigned long *zero);
 void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone, int order);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Index: linux-2.6.10/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.10.orig/fs/proc/proc_misc.c	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/fs/proc/proc_misc.c	2005-01-04 14:17:02.000000000 -0800
@@ -158,13 +158,14 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;
 	unsigned long vmtot;
 	unsigned long committed;
 	unsigned long allowed;
 	struct vmalloc_info vmi;

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 /*
  * display in kilobytes.
@@ -187,6 +188,7 @@
 	len = sprintf(page,
 		"MemTotal:     %8lu kB\n"
 		"MemFree:      %8lu kB\n"
+		"MemZero:      %8lu kB\n"
 		"Buffers:      %8lu kB\n"
 		"Cached:       %8lu kB\n"
 		"SwapCached:   %8lu kB\n"
@@ -210,6 +212,7 @@
 		"VmallocChunk: %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
+		K(zero),
 		K(i.bufferram),
 		K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
 		K(total_swapcache_pages),
Index: linux-2.6.10/mm/readahead.c
===================================================================
--- linux-2.6.10.orig/mm/readahead.c	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/mm/readahead.c	2005-01-04 14:17:02.000000000 -0800
@@ -573,7 +573,8 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;

-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
Index: linux-2.6.10/drivers/base/node.c
===================================================================
--- linux-2.6.10.orig/drivers/base/node.c	2005-01-04 14:17:00.000000000 -0800
+++ linux-2.6.10/drivers/base/node.c	2005-01-04 14:17:02.000000000 -0800
@@ -41,13 +41,15 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;

 	si_meminfo_node(&i, nid);
-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));

 	n = sprintf(buf, "\n"
 		       "Node %d MemTotal:     %8lu kB\n"
 		       "Node %d MemFree:      %8lu kB\n"
+		       "Node %d MemZero:      %8lu kB\n"
 		       "Node %d MemUsed:      %8lu kB\n"
 		       "Node %d Active:       %8lu kB\n"
 		       "Node %d Inactive:     %8lu kB\n"
@@ -57,6 +59,7 @@
 		       "Node %d LowFree:      %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
+		       nid, K(zero),
 		       nid, K(i.totalram - i.freeram),
 		       nid, K(active),
 		       nid, K(inactive),
Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h	2005-01-04 14:17:02.000000000 -0800
@@ -715,6 +715,7 @@
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
+#define PF_KSCRUBD	0x00800000	/* I am kscrubd */

 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.10/mm/Makefile
===================================================================
--- linux-2.6.10.orig/mm/Makefile	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/mm/Makefile	2005-01-04 14:17:02.000000000 -0800
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+			   vmalloc.o scrubd.o

 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o prio_tree.o \
Index: linux-2.6.10/mm/scrubd.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/mm/scrubd.c	2005-01-04 14:58:46.000000000 -0800
@@ -0,0 +1,147 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = 7;	/* if a page of this order is coalesed then run kscrubd */
+unsigned int sysctl_scrub_stop = 2;	/* Mininum order of page to zero */
+unsigned int sysctl_scrub_load = 999;	/* Do not run scrubd if load > */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec(table, write, file, buffer, length, ppos);
+	if (sysctl_scrub_start < MAX_ORDER) {
+		struct zone *zone;
+
+		for_each_zone(zone)
+			wakeup_kscrubd(zone);
+	}
+	return 0;
+}
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+	int order;
+
+	for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+		struct free_area *area = z->free_area[NOT_ZEROED] + order;
+		if (!list_empty(&area->free_list)) {
+			struct page *page = scrubd_rmpage(z, area, order);
+			struct list_head *l;
+
+			if (!page)
+				continue;
+
+			page->index = order;
+
+			list_for_each(l, &zero_drivers) {
+				struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+				unsigned long size = PAGE_SIZE << order;
+
+				if (driver->start(page_address(page), size) == 0) {
+
+					unsigned ticks = (size*HZ)/driver->rate;
+					if (ticks) {
+						/* Wait the minimum time of the transfer */
+						current->state = TASK_INTERRUPTIBLE;
+						schedule_timeout(ticks);
+					}
+					/* Then keep on checking until transfer is complete */
+					while (!driver->check())
+						schedule();
+					goto out;
+				}
+			}
+
+			/* Unable to find a zeroing device that would
+			 * deal with this page so just do it on our own.
+			 * This will likely thrash the cpu caches.
+			 */
+			cond_resched();
+			clear_page(page_address(page), order);
+out:
+			end_zero_page(page);
+			cond_resched();
+			return 1 << order;
+		}
+	}
+	return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+	int i;
+	unsigned long pages_zeroed;
+
+	if (system_state != SYSTEM_RUNNING)
+		return;
+
+	do {
+		pages_zeroed = 0;
+		for (i = 0; i < pgdat->nr_zones; i++) {
+			struct zone *zone = pgdat->node_zones + i;
+
+			pages_zeroed += zero_highest_order_page(zone);
+		}
+	} while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t*)p;
+	struct task_struct *tsk = current;
+	DEFINE_WAIT(wait);
+	cpumask_t cpumask;
+
+	daemonize("kscrubd%d", pgdat->node_id);
+	cpumask = node_to_cpumask(pgdat->node_id);
+	if (!cpus_empty(cpumask))
+		set_cpus_allowed(tsk, cpumask);
+
+	tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+	for ( ; ; ) {
+		if (current->flags & PF_FREEZE)
+			refrigerator(PF_FREEZE);
+		prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+		schedule();
+		finish_wait(&pgdat->kscrubd_wait, &wait);
+
+		scrub_pgdat(pgdat);
+	}
+	return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+	pg_data_t *pgdat;
+	for_each_pgdat(pgdat)
+		pgdat->kscrubd
+		= find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+	return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.10/include/linux/scrub.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/include/linux/scrub.h	2005-01-04 14:17:02.000000000 -0800
@@ -0,0 +1,51 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+        int (*start)(void *, unsigned long);		/* Start bzero transfer */
+	int (*check)(void);				/* Check if bzero is complete */
+	unsigned long rate;				/* zeroing rate in bytes/sec */
+        struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+extern unsigned int sysctl_scrub_load;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+	list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+	list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+        if (avenrun[0] >= (unsigned long)sysctl_scrub_load << FSHIFT)
+		return;
+	if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+                return;
+        wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+				      void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page);
+#endif
Index: linux-2.6.10/kernel/sysctl.c
===================================================================
--- linux-2.6.10.orig/kernel/sysctl.c	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/kernel/sysctl.c	2005-01-04 14:17:02.000000000 -0800
@@ -40,6 +40,7 @@
 #include <linux/times.h>
 #include <linux/limits.h>
 #include <linux/dcache.h>
+#include <linux/scrub.h>
 #include <linux/syscalls.h>

 #include <asm/uaccess.h>
@@ -826,6 +827,33 @@
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_SCRUB_START,
+		.procname	= "scrub_start",
+		.data		= &sysctl_scrub_start,
+		.maxlen		= sizeof(sysctl_scrub_start),
+		.mode		= 0644,
+		.proc_handler	= &scrub_start_handler,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_STOP,
+		.procname	= "scrub_stop",
+		.data		= &sysctl_scrub_stop,
+		.maxlen		= sizeof(sysctl_scrub_stop),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_LOAD,
+		.procname	= "scrub_load",
+		.data		= &sysctl_scrub_load,
+		.maxlen		= sizeof(sysctl_scrub_load),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
 	{ .ctl_name = 0 }
 };

Index: linux-2.6.10/include/linux/sysctl.h
===================================================================
--- linux-2.6.10.orig/include/linux/sysctl.h	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/linux/sysctl.h	2005-01-04 14:17:02.000000000 -0800
@@ -169,6 +169,9 @@
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+	VM_SCRUB_START=30,	/* percentage * 10 at which to start scrubd */
+	VM_SCRUB_STOP=31,	/* percentage * 10 at which to stop scrubd */
+	VM_SCRUB_LOAD=31,	/* Load factor at which not to scrub anymore */
 };




^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix
  2005-01-04 23:12             ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
                                 ` (2 preceding siblings ...)
  2005-01-04 23:15               ` Prezeroing V3 [3/4]: Page zeroing through kscrubd Christoph Lameter
@ 2005-01-04 23:16               ` Christoph Lameter
  2005-01-05  2:16                 ` Andi Kleen
  3 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:16 UTC (permalink / raw)
  To: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

o Zeroing driver implemented with the Block Transfer Engine in the Altix
  SN2 SHub.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/arch/ia64/sn/kernel/bte.c
===================================================================
--- linux-2.6.10.orig/arch/ia64/sn/kernel/bte.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/ia64/sn/kernel/bte.c	2005-01-03 13:36:07.000000000 -0800
@@ -4,6 +4,8 @@
  * for more details.
  *
  * Copyright (c) 2000-2003 Silicon Graphics, Inc.  All Rights Reserved.
+ *
+ * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
  */

 #include <linux/config.h>
@@ -20,6 +22,8 @@
 #include <linux/bootmem.h>
 #include <linux/string.h>
 #include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scrub.h>

 #include <asm/sn/bte.h>

@@ -30,7 +34,7 @@
 /* two interfaces on two btes */
 #define MAX_INTERFACES_TO_TRY		4

-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
 {
 	nodepda_t *tmp_nodepda;

@@ -132,7 +136,6 @@
 			if (bte == NULL) {
 				continue;
 			}
-
 			if (spin_trylock(&bte->spinlock)) {
 				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
 				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
@@ -157,7 +160,7 @@
 		}
 	} while (1);

-	if (notification == NULL) {
+	if (notification == NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
 		/* User does not want to be notified. */
 		bte->most_rcnt_na = &bte->notify;
 	} else {
@@ -192,6 +195,8 @@

 	itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);

+	if (mode & BTE_NOTIFY_AND_GET_POINTER)
+		 *(u64 volatile **)(notification) = &bte->notify;
 	spin_unlock_irqrestore(&bte->spinlock, irq_flags);

 	if (notification != NULL) {
@@ -449,5 +454,37 @@
 		mynodepda->bte_if[i].cleanup_active = 0;
 		mynodepda->bte_if[i].bh_error = 0;
 	}
+}
+
+u64 *bte_zero_notify[MAX_COMPACT_NODES];
+
+static int bte_check_bzero(void)
+{
+	int node = get_nasid();
+
+	return *(bte_zero_notify[node]) != BTE_WORD_BUSY;
+}
+
+static int bte_start_bzero(void *p, unsigned long len)
+{
+	int node = get_nasid();
+
+	/* Check limitations.
+		1. System must be running (weird things happen during bootup)
+		2. Size >64KB. Smaller requests cause too much bte traffic
+	 */
+	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
+		return EINVAL;
+
+	return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, bte_zero_notify+node);
+}
+
+static struct zero_driver bte_bzero = {
+	.start = bte_start_bzero,
+	.check = bte_check_bzero,
+	.rate = 500000000		/* 500 MB /sec */
+};

+void sn_bte_bzero_init(void) {
+	register_zero_driver(&bte_bzero);
 }
Index: linux-2.6.10/arch/ia64/sn/kernel/setup.c
===================================================================
--- linux-2.6.10.orig/arch/ia64/sn/kernel/setup.c	2004-12-24 13:34:27.000000000 -0800
+++ linux-2.6.10/arch/ia64/sn/kernel/setup.c	2005-01-03 13:36:07.000000000 -0800
@@ -243,6 +243,7 @@
 	int pxm;
 	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
 	extern void sn_cpu_init(void);
+	extern void sn_bte_bzero_init(void);

 	/*
 	 * If the generic code has enabled vga console support - lets
@@ -333,6 +334,7 @@
 	screen_info = sn_screen_info;

 	sn_timer_init();
+	sn_bte_bzero_init();
 }

 /**
Index: linux-2.6.10/include/asm-ia64/sn/bte.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/sn/bte.h	2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/sn/bte.h	2005-01-03 13:36:07.000000000 -0800
@@ -48,6 +48,8 @@
 #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
 /* Use a reserved bit to let the caller specify a wait for any BTE */
 #define BTE_WACQUIRE (0x4000)
+/* Return the pointer to the notification cacheline to the user */
+#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
 /* Use the BTE on the node with the destination memory */
 #define BTE_USE_DEST (BTE_WACQUIRE << 1)
 /* Use any available BTE interface on any node for the transfer */


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
@ 2005-01-04 23:45                 ` Dave Hansen
  2005-01-05  1:16                   ` Christoph Lameter
  2005-01-05  0:34                 ` Linus Torvalds
  2005-01-08 21:12                 ` Hugh Dickins
  2 siblings, 1 reply; 87+ messages in thread
From: Dave Hansen @ 2005-01-04 23:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

On Tue, 2005-01-04 at 15:13 -0800, Christoph Lameter wrote:
> +		if (gfp_flags & __GFP_ZERO) {
> +#ifdef CONFIG_HIGHMEM
> +			if (PageHighMem(page)) {
> +				int n = 1 << order;
> +
> +				while (n-- >0)
> +					clear_highpage(page + n);
> +			} else
> +#endif
> +			clear_page(page_address(page), order);
> +		}
>  		if (order && (gfp_flags & __GFP_COMP))
>  			prep_compound_page(page, order);

That #ifdef can probably die.  The compiler should get that all by
itself:

> #ifdef CONFIG_HIGHMEM
> #define PageHighMem(page)       test_bit(PG_highmem, &(page)->flags)
> #else
> #define PageHighMem(page)       0 /* needed to optimize away at compile time */
> #endif

-- Dave


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
  2005-01-04 23:45                 ` Dave Hansen
@ 2005-01-05  0:34                 ` Linus Torvalds
  2005-01-05  0:47                   ` Andrew Morton
  2005-01-08 21:12                 ` Hugh Dickins
  2 siblings, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2005-01-05  0:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-ia64, linux-mm, Linux Kernel Development



On Tue, 4 Jan 2005, Christoph Lameter wrote:
>
> This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
> to request zeroed pages from the page allocator.

Ok, let's start merging this slowly, and in particular, this 1/4 one looks 
pretty much like a cleanup regardless of whatever else happen, so let's 
just do it. However, for it to really be a cleanup, how about making 
_this_ part:

> +
> +		if (gfp_flags & __GFP_ZERO) {
> +#ifdef CONFIG_HIGHMEM
> +			if (PageHighMem(page)) {
> +				int n = 1 << order;
> +
> +				while (n-- >0)
> +					clear_highpage(page + n);
> +			} else
> +#endif
> +			clear_page(page_address(page), order);
> +		}

Match the existing previous part:

>  		if (order && (gfp_flags & __GFP_COMP))
>  			prep_compound_page(page, order);


and just split it up into a "prep_zero_page(page, order)"? I dislike 
#ifdef's in the middle of deep functions. In the middle of a _trivial_ 
function it's much more palatable.

At that point at least part 1 ends up being a nice clean patch on its own, 
and should even shrink the code-size a bit. IOW, it not only is a cleanup, 
there is even a technical argument for it (even without worrying about the 
next stages).

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-05  0:34                 ` Linus Torvalds
@ 2005-01-05  0:47                   ` Andrew Morton
  2005-01-05  1:15                     ` Christoph Lameter
  0 siblings, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2005-01-05  0:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: clameter, linux-ia64, linux-mm, linux-kernel

Linus Torvalds <torvalds@osdl.org> wrote:
>
> On Tue, 4 Jan 2005, Christoph Lameter wrote:
> >
> > This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
> > to request zeroed pages from the page allocator.
> 
> Ok, let's start merging this slowly

One week hence, please.  Things like the no-bitmaps-for-the-buddy-allocator
have been well tested and should go in first.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-05  0:47                   ` Andrew Morton
@ 2005-01-05  1:15                     ` Christoph Lameter
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-05  1:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, linux-ia64, linux-mm, linux-kernel

On Tue, 4 Jan 2005, Andrew Morton wrote:

> > Ok, let's start merging this slowly
>
> One week hence, please.  Things like the no-bitmaps-for-the-buddy-allocator
> have been well tested and should go in first.

The first two patches are basically cleanup type stuff and will not affect
the page allocator in a significant way. On the other hand they touch many
files and are thus difficult to maintain.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-04 23:45                 ` Dave Hansen
@ 2005-01-05  1:16                   ` Christoph Lameter
  2005-01-05  1:26                     ` Linus Torvalds
  0 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2005-01-05  1:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
	Linux Kernel Development

On Tue, 4 Jan 2005, Dave Hansen wrote:

> That #ifdef can probably die.  The compiler should get that all by
> itself:
>
> > #ifdef CONFIG_HIGHMEM
> > #define PageHighMem(page)       test_bit(PG_highmem, &(page)->flags)
> > #else
> > #define PageHighMem(page)       0 /* needed to optimize away at compile time */
> > #endif

Ahh. Great. Do I need to submit a corrected patch that removes those two
lines or is it fine as is?

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-05  1:16                   ` Christoph Lameter
@ 2005-01-05  1:26                     ` Linus Torvalds
  2005-01-05 23:11                       ` Christoph Lameter
  0 siblings, 1 reply; 87+ messages in thread
From: Linus Torvalds @ 2005-01-05  1:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Hansen, Andrew Morton, linux-ia64, linux-mm,
	Linux Kernel Development



On Tue, 4 Jan 2005, Christoph Lameter wrote:
> 
> Ahh. Great. Do I need to submit a corrected patch that removes those two
> lines or is it fine as is?

Please do split it up into a function of its own. It's going to look a lot 
prettier as an intermediate phase. I realize that that touches #3 in the 
series, but I suspect that one will also just be prettier as a result.

		Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix
  2005-01-04 23:16               ` Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix Christoph Lameter
@ 2005-01-05  2:16                 ` Andi Kleen
  2005-01-05 16:24                   ` Christoph Lameter
  0 siblings, 1 reply; 87+ messages in thread
From: Andi Kleen @ 2005-01-05  2:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

> +	/* Check limitations.
> +		1. System must be running (weird things happen during bootup)
> +		2. Size >64KB. Smaller requests cause too much bte traffic
> +	 */
> +	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
> +		return EINVAL;

surely return -EINVAL; 

Also have you thought about doing a similar driver for x86/x86-64 using
cache bypassing stores? 

-Andi

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix
  2005-01-05  2:16                 ` Andi Kleen
@ 2005-01-05 16:24                   ` Christoph Lameter
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-05 16:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Wed, 5 Jan 2005, Andi Kleen wrote:

> Christoph Lameter <clameter@sgi.com> writes:
>
> > +	/* Check limitations.
> > +		1. System must be running (weird things happen during bootup)
> > +		2. Size >64KB. Smaller requests cause too much bte traffic
> > +	 */
> > +	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
> > +		return EINVAL;
>
> surely return -EINVAL;

Anything will do as long as its != 0. But yeah that would more closely
follow convention.

> Also have you thought about doing a similar driver for x86/x86-64 using
> cache bypassing stores?

As you know we do ia64 and I am no expert on x86_64. But the interface for
hardware zeroing is designed for purposes like that.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-05  1:26                     ` Linus Torvalds
@ 2005-01-05 23:11                       ` Christoph Lameter
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-05 23:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Andrew Morton, linux-ia64, linux-mm,
	Linux Kernel Development

On Tue, 4 Jan 2005, Linus Torvalds wrote:

> Please do split it up into a function of its own. It's going to look a lot
> prettier as an intermediate phase. I realize that that touches #3 in the
> series, but I suspect that one will also just be prettier as a result.

Here is the first patch redone as you wanted. I also removed all
dependencies on the second patch. This should be able to get in
on its own.
I will sent the revised second patch dealing with updating clear_page
later and keep back the last two patches until the bitmap thing has been
changed in the buddy allocator.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
to request zeroed pages from the page allocator.

- Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set

- Replace all page zeroing after allocating pages by prior allocations with
  allocations using __GFP_ZERO

Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-05 09:32:52.000000000 -0800
@@ -549,6 +549,12 @@
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
+static inline void prep_zero_page(struct page *page, int order) {
+	int i;
+
+	for(i = 0; i < (1 << order); i++)
+		clear_highpage(page + i);
+}

 static struct page *
 buffered_rmqueue(struct zone *zone, int order, int gfp_flags)
@@ -584,6 +590,10 @@
 		BUG_ON(bad_range(zone, page));
 		mod_page_state_zone(zone, pgalloc, 1 << order);
 		prep_new_page(page, order);
+
+		if (gfp_flags & __GFP_ZERO)
+			prep_zero_page(page, order);
+
 		if (order && (gfp_flags & __GFP_COMP))
 			prep_compound_page(page, order);
 	}
@@ -796,12 +806,9 @@
 	 */
 	BUG_ON(gfp_mask & __GFP_HIGHMEM);

-	page = alloc_pages(gfp_mask, 0);
-	if (page) {
-		void *address = page_address(page);
-		clear_page(address);
-		return (unsigned long) address;
-	}
+	page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+	if (page)
+		return (unsigned long) page_address(page);
 	return 0;
 }

Index: linux-2.6.10/include/linux/gfp.h
===================================================================
--- linux-2.6.10.orig/include/linux/gfp.h	2004-12-24 13:34:27.000000000 -0800
+++ linux-2.6.10/include/linux/gfp.h	2005-01-05 09:30:39.000000000 -0800
@@ -37,6 +37,7 @@
 #define __GFP_NORETRY	0x1000	/* Do not retry.  Might fail */
 #define __GFP_NO_GROW	0x2000	/* Slab internal usage */
 #define __GFP_COMP	0x4000	/* Add compound page metadata */
+#define __GFP_ZERO	0x8000	/* Return zeroed page on success */

 #define __GFP_BITS_SHIFT 16	/* Room for 16 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)

 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/mm/memory.c	2005-01-05 09:30:39.000000000 -0800
@@ -1650,10 +1650,9 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.10/kernel/profile.c
===================================================================
--- linux-2.6.10.orig/kernel/profile.c	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/kernel/profile.c	2005-01-05 09:30:39.000000000 -0800
@@ -326,17 +326,15 @@
 		node = cpu_to_node(cpu);
 		per_cpu(cpu_profile_flip, cpu) = 0;
 		if (!per_cpu(cpu_profile_hits, cpu)[1]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				return NOTIFY_BAD;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
 		}
 		if (!per_cpu(cpu_profile_hits, cpu)[0]) {
-			page = alloc_pages_node(node, GFP_KERNEL, 0);
+			page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 			if (!page)
 				goto out_free;
-			clear_highpage(page);
 			per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
 		}
 		break;
@@ -510,16 +508,14 @@
 		int node = cpu_to_node(cpu);
 		struct page *page;

-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
-		page = alloc_pages_node(node, GFP_KERNEL, 0);
+		page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
 		if (!page)
 			goto out_cleanup;
-		clear_highpage(page);
 		per_cpu(cpu_profile_hits, cpu)[0]
 				= (struct profile_hit *)page_address(page);
 	}
Index: linux-2.6.10/mm/shmem.c
===================================================================
--- linux-2.6.10.orig/mm/shmem.c	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/mm/shmem.c	2005-01-05 09:30:39.000000000 -0800
@@ -369,9 +369,8 @@
 		}

 		spin_unlock(&info->lock);
-		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+		page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
 		if (page) {
-			clear_highpage(page);
 			page->nr_swapped = 0;
 		}
 		spin_lock(&info->lock);
@@ -910,7 +909,7 @@
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp, &pvma, 0);
+	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -926,7 +925,7 @@
 shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
 				 unsigned long idx)
 {
-	return alloc_page(gfp);
+	return alloc_page(gfp | __GFP_ZERO);
 }
 #endif

@@ -1135,7 +1134,6 @@

 		info->alloced++;
 		spin_unlock(&info->lock);
-		clear_highpage(filepage);
 		flush_dcache_page(filepage);
 		SetPageUptodate(filepage);
 	}
Index: linux-2.6.10/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgalloc.h	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -61,9 +61,7 @@
 	pgd_t *pgd = pgd_alloc_one_fast(mm);

 	if (unlikely(pgd == NULL)) {
-		pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-		if (likely(pgd != NULL))
-			clear_page(pgd);
+		pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
 	}
 	return pgd;
 }
@@ -106,10 +104,8 @@
 static inline pmd_t*
 pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pmd != NULL))
-		clear_page(pmd);
 	return pmd;
 }

@@ -140,20 +136,16 @@
 static inline struct page *
 pte_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

-	if (likely(pte != NULL))
-		clear_page(page_address(pte));
 	return pte;
 }

 static inline pte_t *
 pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);

-	if (likely(pte != NULL))
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.10/arch/i386/mm/pgtable.c
===================================================================
--- linux-2.6.10.orig/arch/i386/mm/pgtable.c	2005-01-04 14:16:59.000000000 -0800
+++ linux-2.6.10/arch/i386/mm/pgtable.c	2005-01-05 09:30:39.000000000 -0800
@@ -140,10 +140,7 @@

 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
-	return pte;
+	return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 }

 struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -151,12 +148,10 @@
 	struct page *pte;

 #ifdef CONFIG_HIGHPTE
-	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
 #else
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 #endif
-	if (pte)
-		clear_highpage(pte);
 	return pte;
 }

Index: linux-2.6.10/include/asm-mips/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-mips/pgalloc.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-mips/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -56,9 +56,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PTE_ORDER);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER);

 	return pte;
 }
Index: linux-2.6.10/arch/alpha/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/alpha/mm/init.c	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/arch/alpha/mm/init.c	2005-01-05 09:30:39.000000000 -0800
@@ -42,10 +42,9 @@
 {
 	pgd_t *ret, *init;

-	ret = (pgd_t *)__get_free_page(GFP_KERNEL);
+	ret = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
 	init = pgd_offset(&init_mm, 0UL);
 	if (ret) {
-		clear_page(ret);
 #ifdef CONFIG_ALPHA_LARGE_VMALLOC
 		memcpy (ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
 			(PTRS_PER_PGD - USER_PTRS_PER_PGD - 1)*sizeof(pgd_t));
@@ -63,9 +62,7 @@
 pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

Index: linux-2.6.10/include/asm-parisc/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-parisc/pgalloc.h	2004-12-24 13:35:39.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -120,18 +120,14 @@
 static inline struct page *
 pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
-	if (likely(page != NULL))
-		clear_page(page_address(page));
+	struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return page;
 }

 static inline pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (likely(pte != NULL))
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

Index: linux-2.6.10/include/asm-sparc64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/pgalloc.h	2004-12-24 13:35:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -73,10 +73,9 @@
 		struct page *page;

 		preempt_enable();
-		page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+		page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 		if (page) {
 			ret = (struct page *)page_address(page);
-			clear_page(ret);
 			page->lru.prev = (void *) 2UL;

 			preempt_disable();
Index: linux-2.6.10/include/asm-sh/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh/pgalloc.h	2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/include/asm-sh/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -44,9 +44,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);

 	return pte;
 }
@@ -56,9 +54,7 @@
 {
 	struct page *pte;

-   	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+   	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

 	return pte;
 }
Index: linux-2.6.10/include/asm-m32r/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/pgalloc.h	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -23,10 +23,7 @@
  */
 static __inline__ pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-
-	if (pgd)
-		clear_page(pgd);
+	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);

 	return pgd;
 }
@@ -39,10 +36,7 @@
 static __inline__ pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 	unsigned long address)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
-
-	if (pte)
-		clear_page(pte);
+	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);

 	return pte;
 }
@@ -50,10 +44,8 @@
 static __inline__ struct page *pte_alloc_one(struct mm_struct *mm,
 	unsigned long address)
 {
-	struct page *pte = alloc_page(GFP_KERNEL);
+	struct page *pte = alloc_page(GFP_KERNEL|__GFP_ZERO);

-	if (pte)
-		clear_page(page_address(pte));

 	return pte;
 }
Index: linux-2.6.10/arch/um/kernel/mem.c
===================================================================
--- linux-2.6.10.orig/arch/um/kernel/mem.c	2005-01-04 14:17:00.000000000 -0800
+++ linux-2.6.10/arch/um/kernel/mem.c	2005-01-05 09:30:39.000000000 -0800
@@ -327,9 +327,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pte;
 }

@@ -337,9 +335,7 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_highpage(pte);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	return pte;
 }

Index: linux-2.6.10/include/asm-sh64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh64/pgalloc.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -112,9 +112,7 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT|__GFP_ZERO);

 	return pte;
 }
@@ -123,9 +121,7 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);

 	return pte;
 }
@@ -150,9 +146,7 @@
 static __inline__ pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	pmd_t *pmd;
-	pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pmd)
-		clear_page(pmd);
+	pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return pmd;
 }

Index: linux-2.6.10/include/asm-cris/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/pgalloc.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-cris/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -24,18 +24,14 @@

 extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (pte)
-		clear_page(pte);
+  	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
  	return pte;
 }

 extern inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	struct page *pte;
-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
-	if (pte)
-		clear_page(page_address(pte));
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	return pte;
 }

Index: linux-2.6.10/arch/ppc/mm/pgtable.c
===================================================================
--- linux-2.6.10.orig/arch/ppc/mm/pgtable.c	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/arch/ppc/mm/pgtable.c	2005-01-05 09:30:39.000000000 -0800
@@ -85,8 +85,7 @@
 {
 	pgd_t *ret;

-	if ((ret = (pgd_t *)__get_free_pages(GFP_KERNEL, PGDIR_ORDER)) != NULL)
-		clear_pages(ret, PGDIR_ORDER);
+	ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER);
 	return ret;
 }

@@ -102,7 +101,7 @@
 	extern void *early_get_page(void);

 	if (mem_init_done) {
-		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+		pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 		if (pte) {
 			struct page *ptepage = virt_to_page(pte);
 			ptepage->mapping = (void *) mm;
@@ -110,8 +109,6 @@
 		}
 	} else
 		pte = (pte_t *)early_get_page();
-	if (pte)
-		clear_page(pte);
 	return pte;
 }

Index: linux-2.6.10/include/asm-alpha/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/pgalloc.h	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -40,9 +40,7 @@
 static inline pmd_t *
 pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-	if (ret)
-		clear_page(ret);
+	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	return ret;
 }

Index: linux-2.6.10/include/asm-m68k/motorola_pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68k/motorola_pgalloc.h	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/motorola_pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -12,9 +12,8 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	if (pte) {
-		clear_page(pte);
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
 		nocache_page(pte);
@@ -31,7 +30,7 @@

 static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	pte_t *pte;

 	if(!page)
@@ -39,7 +38,6 @@

 	pte = kmap(page);
 	if (pte) {
-		clear_page(pte);
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
 		nocache_page(pte);
Index: linux-2.6.10/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/sparc64/mm/init.c	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/sparc64/mm/init.c	2005-01-05 09:30:39.000000000 -0800
@@ -1687,13 +1687,12 @@
 	 * Set up the zero page, mark it reserved, so that page count
 	 * is not manipulated when freeing the page from user ptes.
 	 */
-	mem_map_zero = alloc_pages(GFP_KERNEL, 0);
+	mem_map_zero = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if (mem_map_zero == NULL) {
 		prom_printf("paging_init: Cannot alloc zero page.\n");
 		prom_halt();
 	}
 	SetPageReserved(mem_map_zero);
-	clear_page(page_address(mem_map_zero));

 	codepages = (((unsigned long) _etext) - ((unsigned long) _start));
 	codepages = PAGE_ALIGN(codepages) >> PAGE_SHIFT;
Index: linux-2.6.10/include/asm-arm/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/pgalloc.h	2004-12-24 13:35:29.000000000 -0800
+++ linux-2.6.10/include/asm-arm/pgalloc.h	2005-01-05 09:30:39.000000000 -0800
@@ -50,9 +50,8 @@
 {
 	pte_t *pte;

-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
 	if (pte) {
-		clear_page(pte);
 		clean_dcache_area(pte, sizeof(pte_t) * PTRS_PER_PTE);
 		pte += PTRS_PER_PTE;
 	}
@@ -65,10 +64,9 @@
 {
 	struct page *pte;

-	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
 	if (pte) {
 		void *page = page_address(pte);
-		clear_page(page);
 		clean_dcache_area(page, sizeof(pte_t) * PTRS_PER_PTE);
 	}

Index: linux-2.6.10/drivers/block/pktcdvd.c
===================================================================
--- linux-2.6.10.orig/drivers/block/pktcdvd.c	2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10/drivers/block/pktcdvd.c	2005-01-05 09:30:39.000000000 -0800
@@ -135,12 +135,10 @@
 		goto no_bio;

 	for (i = 0; i < PAGES_PER_PACKET; i++) {
-		pkt->pages[i] = alloc_page(GFP_KERNEL);
+		pkt->pages[i] = alloc_page(GFP_KERNEL|| __GFP_ZERO);
 		if (!pkt->pages[i])
 			goto no_page;
 	}
-	for (i = 0; i < PAGES_PER_PACKET; i++)
-		clear_page(page_address(pkt->pages[i]));

 	spin_lock_init(&pkt->lock);


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [2/4]: Extension of clear_page to take an order parameter
  2005-01-04 23:14               ` Prezeroing V3 [2/4]: Extension of " Christoph Lameter
@ 2005-01-05 23:25                 ` Christoph Lameter
  2005-01-06 13:52                   ` Andi Kleen
  0 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2005-01-05 23:25 UTC (permalink / raw)
  To: Linus Torvalds, linux-ia64, Andrew Morton, linux-mm,
	Linux Kernel Development

Here is an updated version that is independent of the first patch and
contains all the necessary modifications to make clear_page take a second
parameter.

Architecture support:
---------------------

Known to work:

ia64
i386
sparc64
m68k

Trivial modification expected to simply work:

arm
cris
h8300
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um

Modification made but it would be good to have some feedback from the arch maintainers:

x86_64
s390
alpha
sh
mips
m32r

Index: linux-2.6.10/include/asm-ia64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/page.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -56,7 +56,7 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
 extern void copy_page (void *to, void *from);

 /*
@@ -65,7 +65,7 @@
  */
 #define clear_user_page(addr, vaddr, page)	\
 do {						\
-	clear_page(addr);			\
+	clear_page(addr, 0);			\
 	flush_dcache_page(page);		\
 } while (0)

Index: linux-2.6.10/include/asm-i386/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/page.h	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -18,7 +18,7 @@

 #include <asm/mmx.h>

-#define clear_page(page)	mmx_clear_page((void *)(page))
+#define clear_page(page, order)	mmx_clear_page((void *)(page),order)
 #define copy_page(to,from)	mmx_copy_page(to,from)

 #else
@@ -28,12 +28,12 @@
  *	Maybe the K6-III ?
  */

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)

 #endif

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/page.h	2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -32,10 +32,10 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-void clear_page(void *);
+void clear_page(void *, int);
 void copy_page(void *, void *);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-sparc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -28,10 +28,10 @@

 #ifndef __ASSEMBLY__

-#define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	 memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		sparc_flush_page_to_ram(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-s390/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -22,12 +22,12 @@

 #ifndef __s390x__

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	register_pair rp;

 	rp.subreg.even = (unsigned long) page;
-	rp.subreg.odd = (unsigned long) 4096;
+	rp.subreg.odd = (unsigned long) 4096 << order;
         asm volatile ("   slr  1,1\n"
 		      "   mvcl %0,0"
 		      : "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@

 #else /* __s390x__ */

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
-        asm volatile ("   lgr  2,%0\n"
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+        	asm volatile ("   lgr  2,%0\n"
                       "   lghi 3,4096\n"
                       "   slgr 1,1\n"
                       "   mvcl 2,0"
                       : : "a" ((void *) (page))
 		      : "memory", "cc", "1", "2", "3" );
+		page += PAGE_SIZE;
+	}
 }

 static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@

 #endif /* __s390x__ */

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /* Pure 2^n version of get_order */
Index: linux-2.6.10/arch/i386/lib/mmx.c
===================================================================
--- linux-2.6.10.orig/arch/i386/lib/mmx.c	2004-12-24 13:34:48.000000000 -0800
+++ linux-2.6.10/arch/i386/lib/mmx.c	2005-01-05 10:09:51.000000000 -0800
@@ -128,7 +128,7 @@
  *	other MMX using processors do not.
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -138,7 +138,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/64;i++)
+	for(i=0;i<((4096/64) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@
  *	Generic MMX implementation without K7 specific streaming
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -267,7 +267,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/128;i++)
+	for(i=0;i<((4096/128) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@
  *	Favour MMX for page clear and copy.
  */

-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
 {
 	int d0, d1;
 	__asm__ __volatile__( \
 		"cld\n\t" \
 		"rep ; stosl" \
 		: "=&c" (d0), "=&D" (d1)
-		:"a" (0),"1" (page),"0" (1024)
+		:"a" (0),"1" (page),"0" (1024 << order)
 		:"memory");
 }
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
 {
 	if(unlikely(in_interrupt()))
-		slow_zero_page(page);
+		slow_clear_page(page, order);
 	else
-		fast_clear_page(page);
+		fast_clear_page(page, order);
 }

 static void slow_copy_page(void *to, void *from)
Index: linux-2.6.10/include/asm-x86_64/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/mmx.h	2005-01-05 10:09:51.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/ia64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/ia64/lib/clear_page.S	2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10/arch/ia64/lib/clear_page.S	2005-01-05 10:09:51.000000000 -0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm	Tuned for Itanium.
  * 2/12/02 kchen	Tuned for both Itanium and McKinley
  * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
  */
 #include <linux/config.h>

@@ -29,27 +30,33 @@
 #define dst4		r11

 #define dst_last	r31
+#define totsize		r14

 GLOBAL_ENTRY(clear_page)
 	.prologue
-	.regstk 1,0,0,0
-	mov r16 = PAGE_SIZE/L3_LINE_SIZE-1	// main loop count, -1=repeat/until
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
 	.save ar.lc, saved_lc
 	mov saved_lc = ar.lc
-
+	;;
 	.body
+	adds dst1 = 16, in0
 	mov ar.lc = (PREFETCH_LINES - 1)
 	mov dst_fetch = in0
-	adds dst1 = 16, in0
 	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
 	;;
 .fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
 	adds dst3 = 48, in0		// executing this multiple times is harmless
 	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
 	;;
-	addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
 	mov ar.lc = r16			// one L3 line per iteration
-	adds dst4 = 64, in0
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
 	;;
 #ifdef CONFIG_ITANIUM
 	// Optimized for Itanium
Index: linux-2.6.10/arch/x86_64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/x86_64/lib/clear_page.S	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/lib/clear_page.S	2005-01-05 10:09:51.000000000 -0800
@@ -7,6 +7,7 @@
 clear_page:
 	xorl   %eax,%eax
 	movl   $4096/64,%ecx
+	shl	%esi, %ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
@@ -42,6 +43,7 @@
 	.section .altinstr_replacement,"ax"
 clear_page_c:
 	movl $4096/8,%ecx
+	shl	%esi, %ecx
 	xorl %eax,%eax
 	rep
 	stosq
Index: linux-2.6.10/include/asm-sh/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh/page.h	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-sh/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -36,12 +36,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
 extern void (*copy_page)(void *to, void *from);

 extern void clear_page_slow(void *to);
 extern void copy_page_slow(void *to, void *from);

+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 #if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
 struct page;
 extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@
 extern void __clear_user_page(void *to, void *orig_to);
 extern void __copy_user_page(void *to, void *from, void *orig_to);
 #elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 #elif defined(CONFIG_CPU_SH4)
 struct page;
Index: linux-2.6.10/include/asm-i386/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/mmx.h	2005-01-05 10:09:51.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/alpha/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/clear_page.S	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/clear_page.S	2005-01-05 10:09:51.000000000 -0800
@@ -6,11 +6,10 @@

 	.text
 	.align 4
-	.global clear_page
-	.ent clear_page
-clear_page:
+	.global _clear_page
+	.ent _clear_page
+_clear_page:
 	.prologue 0
-
 	lda	$0,128
 	nop
 	unop
@@ -36,4 +35,4 @@
 	unop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/include/asm-sh64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh64/page.h	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -50,12 +50,20 @@
 extern void sh64_page_clear(void *page);
 extern void sh64_page_copy(void *from, void *to);

-#define clear_page(page)               sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+		sh64_page_clear(page++, 0);
+	}
+}
+
 #define copy_page(to,from)             sh64_page_copy(from, to)

 #if defined(CONFIG_DCACHE_DISABLED)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	sh_clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #else
Index: linux-2.6.10/include/asm-h8300/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-h8300/page.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-arm/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-arm/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -128,7 +128,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 extern void copy_page(void *to, const void *from);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-ppc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc64/page.h	2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10/include/asm-ppc64/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -102,12 +102,12 @@
 #define REGION_MASK   (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
 #define REGION_STRIDE (1UL << REGION_SHIFT)

-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, int order)
 {
 	unsigned long lines, line_size;

 	line_size = systemcfg->dCacheL1LineSize;
-	lines = naca->dCacheL1LinesPerPage;
+	lines = naca->dCacheL1LinesPerPage << order;

 	__asm__ __volatile__(
 	"mtctr  	%1	# clear_page\n\
Index: linux-2.6.10/include/asm-m32r/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -11,10 +11,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- > 0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+
 extern void copy_page(void *to, void *from);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-alpha/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/page.h	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -15,8 +15,20 @@

 #define STRICT_MM_TYPECHECKS

-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+	int nr = 1 << order;
+
+	while (nr--)
+	{
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)

 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
Index: linux-2.6.10/arch/mips/mm/pg-sb1.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-sb1.c	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-sb1.c	2005-01-05 10:09:51.000000000 -0800
@@ -42,7 +42,7 @@
 #ifdef CONFIG_SIBYTE_DMA_PAGEOPS
 static inline void clear_page_cpu(void *page)
 #else
-void clear_page(void *page)
+void _clear_page(void *page)
 #endif
 {
 	unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@
 		     IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
 }

-void clear_page(void *page)
+void _clear_page(void *page)
 {
 	int cpu = smp_processor_id();

 	/* if the page is above Kseg0, use old way */
 	if (KSEGX(page) != CAC_BASE)
 		return clear_page_cpu(page);
-
 	page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
 	page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
 	__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@

 #endif

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
 EXPORT_SYMBOL(copy_page);
Index: linux-2.6.10/include/asm-m68k/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68k/page.h	2004-12-24 13:35:49.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -50,7 +50,7 @@
 		       );
 }

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	unsigned long tmp;
 	unsigned long *sp = page;
@@ -69,16 +69,16 @@
 			     "dbra   %1,1b\n\t"
 			     : "=a" (sp), "=d" (tmp)
 			     : "a" (page), "0" (sp),
-			       "1" ((PAGE_SIZE - 16) / 16 - 1));
+			       "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
 }

 #else
-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 #endif

 #define clear_user_page(addr, vaddr, page)	\
-	do {	clear_page(addr);		\
+	do {	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-mips/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-mips/page.h	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/include/asm-mips/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -39,7 +39,18 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 extern void copy_page(void * to, void * from);

 extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@
 {
 	extern void (*flush_data_cache_page)(unsigned long addr);

-	clear_page(addr);
+	clear_page(addr, 0);
 	if (pages_do_alias((unsigned long) addr, vaddr))
 		flush_data_cache_page((unsigned long)addr);
 }
Index: linux-2.6.10/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68knommu/page.h	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-cris/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/page.h	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -15,10 +15,10 @@

 #ifdef __KERNEL__

-#define clear_page(page)        memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      memcpy((void *)(to), (void *)(from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)    clear_page(page)
+#define clear_user_page(page, vaddr, pg)    clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-v850/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-v850/page.h	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-v850/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -37,11 +37,11 @@

 #define STRICT_MM_TYPECHECKS

-#define clear_page(page)	memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset ((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to, from)	memcpy ((void *)(to), (void *)from, PAGE_SIZE)

 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-parisc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-parisc/page.h	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -13,7 +13,7 @@
 #include <asm/types.h>
 #include <asm/cache.h>

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      copy_user_page_asm((void *)(to), (void *)(from))

 struct page;
Index: linux-2.6.10/arch/arm/mm/copypage-v6.c
===================================================================
--- linux-2.6.10.orig/arch/arm/mm/copypage-v6.c	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/arm/mm/copypage-v6.c	2005-01-05 10:09:51.000000000 -0800
@@ -47,7 +47,7 @@
  */
 void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
 {
-	clear_page(kaddr);
+	_clear_page(kaddr);
 }

 /*
@@ -116,7 +116,7 @@

 	set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
 	flush_tlb_kernel_page(to);
-	clear_page((void *)to);
+	_clear_page((void *)to);

 	spin_unlock(&v6_lock);
 }
Index: linux-2.6.10/arch/m32r/mm/page.S
===================================================================
--- linux-2.6.10.orig/arch/m32r/mm/page.S	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/arch/m32r/mm/page.S	2005-01-05 10:09:51.000000000 -0800
@@ -51,7 +51,7 @@
 	jmp	r14

 	.text
-	.global	clear_page
+	.global	_clear_page
 	/*
 	 * clear_page (to)
 	 *
@@ -60,7 +60,7 @@
 	 * 16 * 256
 	 */
 	.align	4
-clear_page:
+_clear_page:
 	ldi	r2, #255
 	ldi	r4, #0
 	ld	r3, @r0		/* cache line allocate */
Index: linux-2.6.10/include/asm-ppc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-ppc/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -85,7 +85,7 @@

 struct page;
 extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define  clear_page clear_pages
 extern void copy_page(void *to, void *from);
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/alpha/kernel/alpha_ksyms.c	2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c	2005-01-05 10:09:51.000000000 -0800
@@ -88,7 +88,7 @@
 EXPORT_SYMBOL(__memsetw);
 EXPORT_SYMBOL(__constant_c_memset);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(__direct_map_base);
 EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.10/arch/alpha/lib/ev6-clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/ev6-clear_page.S	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/ev6-clear_page.S	2005-01-05 10:09:51.000000000 -0800
@@ -6,9 +6,9 @@

         .text
         .align 4
-        .global clear_page
-        .ent clear_page
-clear_page:
+        .global _clear_page
+        .ent _clear_page
+_clear_page:
         .prologue 0

 	lda	$0,128
@@ -51,4 +51,4 @@
 	nop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/arch/sh/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/init.c	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/init.c	2005-01-05 10:09:51.000000000 -0800
@@ -57,7 +57,7 @@
 #endif

 void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);

 void show_mem(void)
 {
@@ -255,7 +255,7 @@
 	 * later in the boot process if a better method is available.
 	 */
 	copy_page = copy_page_slow;
-	clear_page = clear_page_slow;
+	_clear_page = clear_page_slow;

 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.10/arch/sh/mm/pg-dma.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-dma.c	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-dma.c	2005-01-05 10:09:51.000000000 -0800
@@ -78,7 +78,7 @@
 		return ret;

 	copy_page = copy_page_dma;
-	clear_page = clear_page_dma;
+	_clear_page = clear_page_dma;

 	return ret;
 }
Index: linux-2.6.10/arch/sh/mm/pg-nommu.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-nommu.c	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-nommu.c	2005-01-05 10:09:51.000000000 -0800
@@ -27,7 +27,7 @@
 static int __init pg_nommu_init(void)
 {
 	copy_page = copy_page_nommu;
-	clear_page = clear_page_nommu;
+	_clear_page = clear_page_nommu;

 	return 0;
 }
Index: linux-2.6.10/arch/mips/mm/pg-r4k.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-r4k.c	2004-12-24 13:34:49.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-r4k.c	2005-01-05 10:09:51.000000000 -0800
@@ -39,9 +39,9 @@

 static unsigned int clear_page_array[0x130 / 4];

-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 /*
  * Maximum sizes:
Index: linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/m32r/kernel/m32r_ksyms.c	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c	2005-01-05 10:09:51.000000000 -0800
@@ -102,7 +102,7 @@
 EXPORT_SYMBOL(memcmp);
 EXPORT_SYMBOL(memscan);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(strcat);
 EXPORT_SYMBOL(strchr);
Index: linux-2.6.10/include/asm-arm26/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm26/page.h	2004-12-24 13:35:22.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -25,7 +25,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 #define copy_page(to, from)  __copy_user_page(to, from, 0);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-sparc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/page.h	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/page.h	2005-01-05 10:09:51.000000000 -0800
@@ -14,8 +14,8 @@

 #ifndef __ASSEMBLY__

-extern void _clear_page(void *page);
-#define clear_page(X)	_clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y)	_clear_page((void *)(X),(Y))
 struct page;
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 #define copy_page(X,Y)	memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
Index: linux-2.6.10/arch/sparc64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/sparc64/lib/clear_page.S	2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/arch/sparc64/lib/clear_page.S	2005-01-05 10:09:51.000000000 -0800
@@ -28,9 +28,12 @@
 	.text

 	.globl		_clear_page
-_clear_page:		/* %o0=dest */
+_clear_page:		/* %o0=dest, %o1=order */
+	sethi		%hi(PAGE_SIZE/64), %o2
+	clr		%o4
+	or		%o2, %lo(PAGE_SIZE/64), %o2
 	ba,pt		%xcc, clear_page_common
-	 clr		%o4
+	 sllx		%o2, %o1, %o1

 	/* This thing is pretty important, it shows up
 	 * on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@
 	flush		%g6
 	wrpr		%o4, 0x0, %pstate

+	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		1, %o4
+	or		%o1, %lo(PAGE_SIZE/64), %o1

 clear_page_common:
 	VISEntryHalf
 	membar		#StoreLoad | #StoreStore | #LoadStore
 	fzero		%f0
-	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		%o0, %g1		! remember vaddr for tlbflush
 	fzero		%f2
-	or		%o1, %lo(PAGE_SIZE/64), %o1
 	faddd		%f0, %f2, %f4
 	fmuld		%f0, %f2, %f6
 	faddd		%f0, %f2, %f8
Index: linux-2.6.10/drivers/net/tc35815.c
===================================================================
--- linux-2.6.10.orig/drivers/net/tc35815.c	2005-01-05 09:43:48.000000000 -0800
+++ linux-2.6.10/drivers/net/tc35815.c	2005-01-05 10:09:51.000000000 -0800
@@ -657,7 +657,7 @@
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
 	} else {
-		clear_page(lp->fd_buf);
+		clear_page(lp->fd_buf, 0);
 #ifdef __mips__
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-05 09:32:52.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-05 10:09:51.000000000 -0800
@@ -550,10 +550,14 @@
  * or two.
  */
 static inline void prep_zero_page(struct page *page, int order) {
-	int i;

-	for(i = 0; i < 1 << order; i++)
-		clear_highpage(page + i);
+	if (PageHighMem(page)) {
+		int i;
+
+		for(i = 0; i < 1 << order; i++)
+			clear_highpage(page + i);
+	} else
+		clear_page(page_address(page), order);
 }

 static struct page *
Index: linux-2.6.10/include/linux/highmem.h
===================================================================
--- linux-2.6.10.orig/include/linux/highmem.h	2005-01-05 10:09:44.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h	2005-01-05 10:10:08.000000000 -0800
@@ -45,7 +45,7 @@
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
-	clear_page(kaddr);
+	clear_page(kaddr, 0);
 	kunmap_atomic(kaddr, KM_USER0);
 }



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [2/4]: Extension of clear_page to take an order parameter
  2005-01-05 23:25                 ` Christoph Lameter
@ 2005-01-06 13:52                   ` Andi Kleen
  2005-01-06 17:47                     ` Christoph Lameter
  0 siblings, 1 reply; 87+ messages in thread
From: Andi Kleen @ 2005-01-06 13:52 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

Christoph Lameter <clameter@sgi.com> writes:

> Here is an updated version that is independent of the first patch and
> contains all the necessary modifications to make clear_page take a second
> parameter.

I still think the clear_page order addition is completely pointless,
because for > order 0 you probably want a cache bypassing store
in a separate function.

Removing it would also make the patch much less intrusive.

-Andi

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [2/4]: Extension of clear_page to take an order parameter
  2005-01-06 13:52                   ` Andi Kleen
@ 2005-01-06 17:47                     ` Christoph Lameter
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-06 17:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Thu, 6 Jan 2005, Andi Kleen wrote:

> Christoph Lameter <clameter@sgi.com> writes:
>
> > Here is an updated version that is independent of the first patch and
> > contains all the necessary modifications to make clear_page take a second
> > parameter.
>
> I still think the clear_page order addition is completely pointless,
> because for > order 0 you probably want a cache bypassing store
> in a separate function.

I would think that having clear_page avoid loading cache
lines from memory should be general improvement.

Bypassing the cache may be beneficial for clear_page in general but I
would like to test that first.

If this is not a win then it may be better to implement the bypassing the
cache through a zero driver.

> Removing it would also make the patch much less intrusive.

Right. I also thought about that. I will likely offer the clear_page patch
as an optional component in V4. Being able to specify an order with
clear_page also helps in other situations like clearing huge pages.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
  2005-01-04 23:45                 ` Dave Hansen
  2005-01-05  0:34                 ` Linus Torvalds
@ 2005-01-08 21:12                 ` Hugh Dickins
  2005-01-08 21:56                   ` David S. Miller
  2005-01-10 17:16                   ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
  2 siblings, 2 replies; 87+ messages in thread
From: Hugh Dickins @ 2005-01-08 21:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, David S. Miller, linux-ia64, Linus Torvalds,
	linux-mm, Linux Kernel Development

On Tue, 4 Jan 2005, Christoph Lameter wrote:
> This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
> to request zeroed pages from the page allocator.
> ...
> --- linux-2.6.10.orig/mm/memory.c	2005-01-04 12:16:41.000000000 -0800
> +++ linux-2.6.10/mm/memory.c	2005-01-04 12:16:49.000000000 -0800
> @@ -1650,10 +1650,9 @@
> 
>  		if (unlikely(anon_vma_prepare(vma)))
>  			goto no_mem;
> -		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> +		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
>  		if (!page)
>  			goto no_mem;
> -		clear_user_highpage(page, addr);
> 
>  		spin_lock(&mm->page_table_lock);
>  		page_table = pte_offset_map(pmd, addr);

Christoph, a late comment: doesn't this effectively replace
do_anonymous_page's clear_user_highpage by clear_highpage, which would
be a bad idea (inefficient? or corrupting?) on those few architectures
which actually do something with that user addr?

Hugh


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-08 21:12                 ` Hugh Dickins
@ 2005-01-08 21:56                   ` David S. Miller
  2005-01-21 20:09                     ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
                                       ` (2 more replies)
  2005-01-10 17:16                   ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
  1 sibling, 3 replies; 87+ messages in thread
From: David S. Miller @ 2005-01-08 21:56 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: clameter, akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Sat, 8 Jan 2005 21:12:10 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> Christoph, a late comment: doesn't this effectively replace
> do_anonymous_page's clear_user_highpage by clear_highpage, which would
> be a bad idea (inefficient? or corrupting?) on those few architectures
> which actually do something with that user addr?

Good catch, it probably does.  We really do need to use
the page clearing routines that pass in the user virtual
address when preparing new anonymous pages or else we'll
get cache aliasing problems on sparc, sparc64, and mips
at the very least.  That is what the virtual address argument
was added for to begin with.

The other way to deal with this is to make whatever routine
the kscrubd thing invokes do all the cache flushing et al.
magic so that the above works when taking pages from the
pre-zero'd pool (only, if no pre-zero'd pages are available
we sill need to invoke clear_user_highpage() with the proper
virtual address).

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-08 21:12                 ` Hugh Dickins
  2005-01-08 21:56                   ` David S. Miller
@ 2005-01-10 17:16                   ` Christoph Lameter
  2005-01-10 18:13                     ` Linus Torvalds
  1 sibling, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2005-01-10 17:16 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David S. Miller, linux-ia64, Linus Torvalds,
	linux-mm, Linux Kernel Development

On Sat, 8 Jan 2005, Hugh Dickins wrote:

> Christoph, a late comment: doesn't this effectively replace
> do_anonymous_page's clear_user_highpage by clear_highpage, which would
> be a bad idea (inefficient? or corrupting?) on those few architectures
> which actually do something with that user addr?

Yes. Right my ia64 centric vision got me again. Thanks for all the other
patches that were posted. I hope this is now all cleared up?


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-10 17:16                   ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
@ 2005-01-10 18:13                     ` Linus Torvalds
  2005-01-10 20:17                       ` Christoph Lameter
  2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
  0 siblings, 2 replies; 87+ messages in thread
From: Linus Torvalds @ 2005-01-10 18:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development



On Mon, 10 Jan 2005, Christoph Lameter wrote:
>
> Yes. Right my ia64 centric vision got me again. Thanks for all the other
> patches that were posted. I hope this is now all cleared up?

Hmm.. I fixed things up, but I didn't exactly do it like the posted 
patches. 

Currently the BK tree
 - doesn't use __GFP_ZERO with anonymous user-mapped pages (which is what 
   you wrote this whole thing for ;)

   Potential fix: declare a per-architecture "alloc_user_highpage(vaddr)"
   that does the proper magic on virtually indexed machines, and on others 
   it just does a "alloc_page(GFP_HIGHUSER | __GFP_ZERO)".

 - verifies that nobody ever asks for a HIGHMEM allocation together with 
   __GFP_ZERO (nobody does - a quick grep shows that 99% of all uses are
   statically clearly fine (there's a few HIGHMEM zero-page users, but 
   they are all GFP_KERNEL or similar), with just two special cases:

	- get_zeroed_page() - which can't use HIGHMEM anyway
	- shm.c does "mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO"
	  and that's fine because while the mapping gfp masks may lack
	  GFP_FS and GFP_IO, they are always supposed to be ok with 
	  waiting.

 - moves "kernel_map_pages()" into "prep_new_page()" to fix the 
   DEBUG_PAGEALLOC issue (Chris Wright).

So that should take care of the known problems.

		Linus

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
  2005-01-10 18:13                     ` Linus Torvalds
@ 2005-01-10 20:17                       ` Christoph Lameter
  2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
  1 sibling, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-10 20:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development

On Mon, 10 Jan 2005, Linus Torvalds wrote:

> Currently the BK tree
>  - doesn't use __GFP_ZERO with anonymous user-mapped pages (which is what
>    you wrote this whole thing for ;)
>
>    Potential fix: declare a per-architecture "alloc_user_highpage(vaddr)"
>    that does the proper magic on virtually indexed machines, and on others
>    it just does a "alloc_page(GFP_HIGHUSER | __GFP_ZERO)".

The following patch adds an alloc_zeroed_user_highpage(vma, vaddr). It
also uses zeroed pages on COW. clear_user_highpage is now only used by
that function. Fold it into alloc_zeroed_user_highpage?

This is against last hours bitkeeper tree. mm/memory.o compiles fine but
I was not able to build a ia64 kernel due to some pieces that seem to be
missing in last hours tree.

Index: linus/include/asm-ia64/page.h
===================================================================
--- linus.orig/include/asm-ia64/page.h	2004-10-20 12:04:58.000000000 -0700
+++ linus/include/asm-ia64/page.h	2005-01-10 12:05:55.000000000 -0800
@@ -75,6 +75,16 @@
 	flush_dcache_page(page);		\
 } while (0)

+
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+({						\
+	struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+	flush_dcache_page(page);		\
+	 page;					\
+})
+
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)

 #ifdef CONFIG_VIRTUAL_MEM_MAP
Index: linus/include/asm-h8300/page.h
===================================================================
--- linus.orig/include/asm-h8300/page.h	2004-10-20 12:04:58.000000000 -0700
+++ linus/include/asm-h8300/page.h	2005-01-10 11:53:17.000000000 -0800
@@ -30,6 +30,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linus/mm/memory.c
===================================================================
--- linus.orig/mm/memory.c	2005-01-10 11:44:39.000000000 -0800
+++ linus/mm/memory.c	2005-01-10 12:05:21.000000000 -0800
@@ -84,20 +84,6 @@
 EXPORT_SYMBOL(vmalloc_earlyreserve);

 /*
- * We special-case the C-O-W ZERO_PAGE, because it's such
- * a common occurrence (no need to read the page to know
- * that it's zero - better for the cache and memory subsystem).
- */
-static inline void copy_cow_page(struct page * from, struct page * to, unsigned long address)
-{
-	if (from == ZERO_PAGE(address)) {
-		clear_user_highpage(to, address);
-		return;
-	}
-	copy_user_highpage(to, from, address);
-}
-
-/*
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
@@ -1329,11 +1315,16 @@

 	if (unlikely(anon_vma_prepare(vma)))
 		goto no_new_page;
-	new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-	if (!new_page)
-		goto no_new_page;
-	copy_cow_page(old_page,new_page,address);
-
+	if (old_page == ZERO_PAGE(address)) {
+		new_page = alloc_zeroed_user_highpage(vma, address);
+		if (!new_page)
+			goto no_new_page;
+	} else {
+		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		if (!new_page)
+			goto no_new_page;
+		copy_user_highpage(new_page, old_page, address);
+	}
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -1795,10 +1786,9 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_zeroed_user_highpage(vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linus/include/asm-m32r/page.h
===================================================================
--- linus.orig/include/asm-m32r/page.h	2004-10-20 12:04:58.000000000 -0700
+++ linus/include/asm-m32r/page.h	2005-01-10 12:08:03.000000000 -0800
@@ -17,6 +17,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linus/include/asm-alpha/page.h
===================================================================
--- linus.orig/include/asm-alpha/page.h	2004-10-20 12:04:57.000000000 -0700
+++ linus/include/asm-alpha/page.h	2005-01-10 11:54:37.000000000 -0800
@@ -18,6 +18,9 @@
 extern void clear_page(void *page);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

Index: linus/include/asm-m68knommu/page.h
===================================================================
--- linus.orig/include/asm-m68knommu/page.h	2005-01-10 09:53:05.000000000 -0800
+++ linus/include/asm-m68knommu/page.h	2005-01-10 11:54:27.000000000 -0800
@@ -30,6 +30,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linus/include/asm-cris/page.h
===================================================================
--- linus.orig/include/asm-cris/page.h	2004-10-20 12:04:57.000000000 -0700
+++ linus/include/asm-cris/page.h	2005-01-10 11:55:06.000000000 -0800
@@ -21,6 +21,9 @@
 #define clear_user_page(page, vaddr, pg)    clear_page(page)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linus/include/linux/highmem.h
===================================================================
--- linus.orig/include/linux/highmem.h	2005-01-06 12:58:48.000000000 -0800
+++ linus/include/linux/highmem.h	2005-01-10 12:08:56.000000000 -0800
@@ -42,6 +42,17 @@
 	smp_wmb();
 }

+#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+static inline struct page* alloc_zeroed_user_highpage(struct vm_area_struct *vma,
+	 unsigned long vaddr)
+{
+	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+
+	clear_user_highpage(page, vaddr);
+	return page;
+}
+#endif
+
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
Index: linus/include/asm-i386/page.h
===================================================================
--- linus.orig/include/asm-i386/page.h	2005-01-06 12:58:47.000000000 -0800
+++ linus/include/asm-i386/page.h	2005-01-10 12:09:43.000000000 -0800
@@ -36,6 +36,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linus/include/asm-x86_64/page.h
===================================================================
--- linus.orig/include/asm-x86_64/page.h	2005-01-06 12:58:48.000000000 -0800
+++ linus/include/asm-x86_64/page.h	2005-01-10 11:56:04.000000000 -0800
@@ -38,6 +38,8 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 /*
  * These are used to make use of C type-checking..
  */
Index: linus/include/asm-s390/page.h
===================================================================
--- linus.orig/include/asm-s390/page.h	2004-10-20 12:04:59.000000000 -0700
+++ linus/include/asm-s390/page.h	2005-01-10 11:56:33.000000000 -0800
@@ -106,6 +106,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /* Pure 2^n version of get_order */
 extern __inline__ int get_order(unsigned long size)
 {

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V4 [0/4]: Overview
  2005-01-10 18:13                     ` Linus Torvalds
  2005-01-10 20:17                       ` Christoph Lameter
@ 2005-01-10 23:53                       ` Christoph Lameter
  2005-01-10 23:54                         ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
                                           ` (3 more replies)
  1 sibling, 4 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development

Changes from V3 to V4:
o Drop __GFP_ZERO patch since its in Linus tree. Include new patch that allows
  archs that need special measures around zeroing of user pages during a page
  fault to maintain their special adaptations.
o Use zeroed pages during COW.
o Updates for clear_page for various platforms. Make clear_page an optional
  patch and fall back to a series of clear_page without order if the patch
  to expand clear_page patch has not been applied.
o x86_64 asm code fixed up
o Port patches to 2.6.10-bk13 and make it fit the bitmapless buddy allocator

The patches increasing the page fault rate (introduction of atomic pte
operations and anticipatory prefaulting) do so by reducing the locking
overhead and are therefore mainly of interest for applications running in
SMP systems with a high number of cpus. The single thread performance does
just show minor increases. Only the performance of multi-threaded
applications increases significantly.

The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page that is also done in the page fault
handler. This zeroing means that all cachelines of the faulted page (on Altix
that means all 128 cachelines of 128 byte each) must be loaded and later
written back. This patch allows to avoid having to load all cachelines
if only a part of the cachelines of that page is needed immediately after
the fault. Doing so will only be effective for sparsely accessed memory
which is typical for anonymous memory and pte maps. Prezeroed pages will
only be used for those purposes. Unzeroed pages will be used as usual for
file mapping, page caching etc etc.

The patch makes prezeroing very effective by:

1. Aggregating zeroing operations to only apply to pages of higher order,
which results in many pages that will later become zero 0 to be zeroed in one
step.
For that purpose the existing clear_page function is extended and made to
take an additional argument specifying the order of the page to be cleared.

2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.

The scrub daemon is invoked when a unzeroed page of a certain order has
been generated so that its worth running it. If no higher order pages are
present then the logic will favor hot zeroing rather than simply shifting
processing around. kscrubd typically runs only for a fraction of a second
and sleeps for long periods of time even under memory benchmarking. kscrubd
performs short bursts of zeroing when needed and tries to stay out off the
processor as much as possible.

The benefits of prezeroing are reduced to minimal quantities if all
cachelines of a page are touched. Prezeroing can only be effective
if the whole page is not immediately used after the page fault.

The patch is composed of 4 parts:

[1/4] GFP_ZERO fixups
	Adds alloc_zeroed_user_highpage(vma, vaddr) that may be customized for
	each arch by defining __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE. Includes
	proper definitions for a large selection of arches, others fall back to
	the default function in include/linux/highmem.h (and falls back to not
	using prezeroed pages).

[2/4] Page Zeroing
	Adds management of ZEROED and NOT_ZEROED pages and a background daemon
	called scrubd. scrubd is disabled by default but can be enabled
	by writing an order number to /proc/sys/vm/scrub_start. If a page
	is coalesced of that order or higher then the scrub daemon will
	start zeroing until all pages of order /proc/sys/vm/scrub_stop and
	higher are zeroed and then go back to sleep.

	In an SMP environment the scrub daemon is typically
	running on the most idle cpu. Thus a single threaded application running
	on one cpu may have the other cpu zeroing pages for it etc. The scrub
	daemon is hardly noticable and usually finished zeroing quickly since
	most processors are optimized for linear memory filling.

The following patches increase performance but may be omitted:


[2/4] SGI Altix Block Transfer Engine Support
	Implements a driver to shift the zeroing off the cpu into hardware.
	With hardware support the impact of zeroing on the system is reduced
	to a minimum.

[4/4] Architecture specific clear_page updates
	Adds second order argument to clear_page and updates all arches.
	This allows the zeroing of large areas of memory without repeately
	invoking clear_page() for the page allocator, scrubd and the huge
	page allocator.



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V4 [1/4]: Arch specific page zeroing during page fault
  2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
@ 2005-01-10 23:54                         ` Christoph Lameter
  2005-01-11  0:41                           ` Chris Wright
  2005-01-10 23:55                         ` Prezeroing V4 [2/4]: Zeroing implementation Christoph Lameter
                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development

This patch fixes the __GFP_ZERO related code by adding a new function
alloc_zeroed_user_highpage that is then used in the anonymous page fault
handler and in the COW code to allocate pages. The function can be defined
per arch to setup special processing for user pages by defining
__HAVE_ARCH_ALLOC_ZEROED_USER_PAGE.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/include/asm-ia64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/page.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -75,6 +75,16 @@
 	flush_dcache_page(page);		\
 } while (0)

+
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+({						\
+	struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+	flush_dcache_page(page);		\
+	 page;					\
+})
+
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)

 #ifdef CONFIG_VIRTUAL_MEM_MAP
Index: linux-2.6.10/include/asm-h8300/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-h8300/page.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -30,6 +30,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/memory.c	2005-01-10 13:54:30.000000000 -0800
@@ -84,20 +84,6 @@
 EXPORT_SYMBOL(vmalloc_earlyreserve);

 /*
- * We special-case the C-O-W ZERO_PAGE, because it's such
- * a common occurrence (no need to read the page to know
- * that it's zero - better for the cache and memory subsystem).
- */
-static inline void copy_cow_page(struct page * from, struct page * to, unsigned long address)
-{
-	if (from == ZERO_PAGE(address)) {
-		clear_user_highpage(to, address);
-		return;
-	}
-	copy_user_highpage(to, from, address);
-}
-
-/*
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
@@ -1329,11 +1315,16 @@

 	if (unlikely(anon_vma_prepare(vma)))
 		goto no_new_page;
-	new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-	if (!new_page)
-		goto no_new_page;
-	copy_cow_page(old_page,new_page,address);
-
+	if (old_page == ZERO_PAGE(address)) {
+		new_page = alloc_zeroed_user_highpage(vma, address);
+		if (!new_page)
+			goto no_new_page;
+	} else {
+		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		if (!new_page)
+			goto no_new_page;
+		copy_user_highpage(new_page, old_page, address);
+	}
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -1795,7 +1786,7 @@

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
+		page = alloc_zeroed_user_highpage(vma, addr);
 		if (!page)
 			goto no_mem;

Index: linux-2.6.10/include/asm-m32r/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -17,6 +17,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-alpha/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/page.h	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -18,6 +18,9 @@
 extern void clear_page(void *page);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

Index: linux-2.6.10/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68knommu/page.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -30,6 +30,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-cris/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/page.h	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -21,6 +21,9 @@
 #define clear_user_page(page, vaddr, pg)    clear_page(page)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/linux/highmem.h
===================================================================
--- linux-2.6.10.orig/include/linux/highmem.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h	2005-01-10 13:53:59.000000000 -0800
@@ -42,6 +42,17 @@
 	smp_wmb();
 }

+#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+static inline struct page* alloc_zeroed_user_highpage(struct vm_area_struct *vma,
+	 unsigned long vaddr)
+{
+	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+
+	clear_user_highpage(page, vaddr);
+	return page;
+}
+#endif
+
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
Index: linux-2.6.10/include/asm-i386/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/page.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -36,6 +36,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/page.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -38,6 +38,8 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-s390/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h	2005-01-10 13:53:59.000000000 -0800
@@ -106,6 +106,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /* Pure 2^n version of get_order */
 extern __inline__ int get_order(unsigned long size)
 {


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V4 [2/4]: Zeroing implementation
  2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
  2005-01-10 23:54                         ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
@ 2005-01-10 23:55                         ` Christoph Lameter
  2005-01-10 23:55                         ` Prezeroing V4 [3/4]: Altix SN2 BTE zero driver Christoph Lameter
  2005-01-10 23:56                         ` Prezeroing V4 [4/4]: Extend clear_page to take an order parameter Christoph Lameter
  3 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development

o Add page zeroing
o Add scrub daemon
o Add ability to view amount of zeroed information in /proc/meninfo

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-10 14:44:22.000000000 -0800
@@ -12,6 +12,7 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *          (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Support for page zeroing, Christoph Lameter, SGI, Dec 2004
  */

 #include <linux/config.h>
@@ -33,6 +34,7 @@
 #include <linux/cpu.h>
 #include <linux/nodemask.h>
 #include <linux/vmalloc.h>
+#include <linux/scrub.h>

 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -167,16 +169,16 @@
  * zone->lock is already acquired when we use these.
  * So, we don't need atomic page->flags operations here.
  */
-static inline unsigned long page_order(struct page *page) {
+static inline unsigned long page_zorder(struct page *page) {
 	return page->private;
 }

-static inline void set_page_order(struct page *page, int order) {
-	page->private = order;
+static inline void set_page_zorder(struct page *page, int order, int zero) {
+	page->private = order + (zero << 10);
 	__SetPagePrivate(page);
 }

-static inline void rmv_page_order(struct page *page)
+static inline void rmv_page_zorder(struct page *page)
 {
 	__ClearPagePrivate(page);
 	page->private = 0;
@@ -187,14 +189,15 @@
  * we can do coalesce a page and its buddy if
  * (a) the buddy is free &&
  * (b) the buddy is on the buddy system &&
- * (c) a page and its buddy have the same order.
+ * (c) a page and its buddy have the same order and the same
+ *     zeroing status.
  * for recording page's order, we use page->private and PG_private.
  *
  */
-static inline int page_is_buddy(struct page *page, int order)
+static inline int page_is_buddy(struct page *page, int order, int zero)
 {
        if (PagePrivate(page)           &&
-           (page_order(page) == order) &&
+           (page_zorder(page) == order + (zero << 10)) &&
            !PageReserved(page)         &&
             page_count(page) == 0)
                return 1;
@@ -225,22 +228,20 @@
  * -- wli
  */

-static inline void __free_pages_bulk (struct page *page, struct page *base,
-		struct zone *zone, unsigned int order)
+static inline int __free_pages_bulk (struct page *page, struct page *base,
+		struct zone *zone, unsigned int order, int zero)
 {
 	unsigned long page_idx;
 	struct page *coalesced;
-	int order_size = 1 << order;

 	if (unlikely(order))
 		destroy_compound_page(page, order);

 	page_idx = page - base;

-	BUG_ON(page_idx & (order_size - 1));
+	BUG_ON(page_idx & (( 1 << order) - 1));
 	BUG_ON(bad_range(zone, page));

-	zone->free_pages += order_size;
 	while (order < MAX_ORDER-1) {
 		struct free_area *area;
 		struct page *buddy;
@@ -250,20 +251,21 @@
 		buddy = base + buddy_idx;
 		if (bad_range(zone, buddy))
 			break;
-		if (!page_is_buddy(buddy, order))
+		if (!page_is_buddy(buddy, order, zero))
 			break;
 		/* Move the buddy up one level. */
 		list_del(&buddy->lru);
-		area = zone->free_area + order;
+		area = zone->free_area[zero] + order;
 		area->nr_free--;
-		rmv_page_order(buddy);
+		rmv_page_zorder(buddy);
 		page_idx &= buddy_idx;
 		order++;
 	}
 	coalesced = base + page_idx;
-	set_page_order(coalesced, order);
-	list_add(&coalesced->lru, &zone->free_area[order].free_list);
-	zone->free_area[order].nr_free++;
+	set_page_zorder(coalesced, order, zero);
+	list_add(&coalesced->lru, &zone->free_area[zero][order].free_list);
+	zone->free_area[zero][order].nr_free++;
+	return order;
 }

 static inline void free_pages_check(const char *function, struct page *page)
@@ -312,8 +314,11 @@
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_pages_bulk list manipulates */
 		list_del(&page->lru);
-		__free_pages_bulk(page, base, zone, order);
+		if (__free_pages_bulk(page, base, zone, order, NOT_ZEROED)
+			>= sysctl_scrub_start)
+				wakeup_kscrubd(zone);
 		ret++;
+		zone->free_pages += 1UL << order;
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
@@ -341,6 +346,18 @@
 	free_pages_bulk(page_zone(page), 1, &list, order);
 }

+void end_zero_page(struct page *page, unsigned int order)
+{
+	unsigned long flags;
+	struct zone * zone = page_zone(page);
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	__free_pages_bulk(page, zone->zone_mem_map, zone, order, ZEROED);
+	zone->zero_pages += 1UL << order;
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}

 /*
  * The order of subdivision here is critical for the IO subsystem.
@@ -358,7 +375,7 @@
  */
 static inline struct page *
 expand(struct zone *zone, struct page *page,
- 	int low, int high, struct free_area *area)
+ 	int low, int high, struct free_area *area, int zero)
 {
 	unsigned long size = 1 << high;

@@ -369,7 +386,7 @@
 		BUG_ON(bad_range(zone, &page[size]));
 		list_add(&page[size].lru, &area->free_list);
 		area->nr_free++;
-		set_page_order(&page[size], high);
+		set_page_zorder(&page[size], high, zero);
 	}
 	return page;
 }
@@ -419,23 +436,44 @@
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct free_area *area)
+{
+	list_del(&page->lru);
+	rmv_page_zorder(page);
+	area->nr_free--;
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area)
+{
+	unsigned long flags;
+	struct page *page = NULL;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	if (!list_empty(&area->free_list)) {
+		page = list_entry(area->free_list.next, struct page, lru);
+		rmpage(page, area);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
 {
-	struct free_area * area;
+	struct free_area *area;
 	unsigned int current_order;
 	struct page *page;

 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
+		area = zone->free_area[zero] + current_order;
 		if (list_empty(&area->free_list))
 			continue;

 		page = list_entry(area->free_list.next, struct page, lru);
-		list_del(&page->lru);
-		rmv_page_order(page);
-		area->nr_free--;
+		rmpage(page, zone->free_area[zero] + current_order);
 		zone->free_pages -= 1UL << order;
-		return expand(zone, page, order, current_order, area);
+		if (zero)
+			zone->zero_pages -= 1UL << order;
+		return expand(zone, page, order, current_order, area, zero);
 	}

 	return NULL;
@@ -447,7 +485,7 @@
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list, int zero)
 {
 	unsigned long flags;
 	int i;
@@ -456,7 +494,7 @@

 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, zero);
 		if (page == NULL)
 			break;
 		allocated++;
@@ -503,7 +541,7 @@
 		ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));

 	for (order = MAX_ORDER - 1; order >= 0; --order)
-		list_for_each(curr, &zone->free_area[order].free_list) {
+		list_for_each(curr, &zone->free_area[NOT_ZEROED][order].free_list) {
 			unsigned long start_pfn, i;

 			start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -595,7 +633,7 @@
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static inline void prep_zero_page(struct page *page, int order)
+void prep_zero_page(struct page *page, unsigned int order)
 {
 	int i;

@@ -608,7 +646,9 @@
 {
 	unsigned long flags;
 	struct page *page = NULL;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	int nr_pages = 1 << order;
+	int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+	int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;

 	if (order == 0) {
 		struct per_cpu_pages *pcp;
@@ -617,7 +657,7 @@
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list, zero);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -629,16 +669,25 @@

 	if (page == NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, zero);
+		/*
+		 * If we failed to obtain a zero and/or unzeroed page
+		 * then we may still be able to obtain the other
+		 * type of page.
+		 */
+		if (!page) {
+			page = __rmqueue(zone, order, !zero);
+			zero = 0;
+		}
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

 	if (page != NULL) {
 		BUG_ON(bad_range(zone, page));
-		mod_page_state_zone(zone, pgalloc, 1 << order);
+		mod_page_state_zone(zone, pgalloc, nr_pages);
 		prep_new_page(page, order);

-		if (gfp_flags & __GFP_ZERO)
+		if ((gfp_flags & __GFP_ZERO) && !zero)
 			prep_zero_page(page, order);

 		if (order && (gfp_flags & __GFP_COMP))
@@ -667,7 +716,7 @@
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
-		free_pages -= z->free_area[o].nr_free << o;
+		free_pages -= (z->free_area[NOT_ZEROED][o].nr_free + z->free_area[ZEROED][o].nr_free)  << o;

 		/* Require fewer higher order pages to be free */
 		min >>= 1;
@@ -1045,7 +1094,7 @@
 }

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat)
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
 {
 	struct zone *zones = pgdat->node_zones;
 	int i;
@@ -1053,27 +1102,31 @@
 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		*active += zones[i].nr_active;
 		*inactive += zones[i].nr_inactive;
 		*free += zones[i].free_pages;
+		*zero += zones[i].zero_pages;
 	}
 }

 void get_zone_counts(unsigned long *active,
-		unsigned long *inactive, unsigned long *free)
+		unsigned long *inactive, unsigned long *free, unsigned long *zero)
 {
 	struct pglist_data *pgdat;

 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for_each_pgdat(pgdat) {
-		unsigned long l, m, n;
-		__get_zone_counts(&l, &m, &n, pgdat);
+		unsigned long l, m, n,o;
+		__get_zone_counts(&l, &m, &n, &o, pgdat);
 		*active += l;
 		*inactive += m;
 		*free += n;
+		*zero += o;
 	}
 }

@@ -1110,6 +1163,7 @@

 #define K(x) ((x) << (PAGE_SHIFT-10))

+const char *temp[3] = { "hot", "cold", "zero" };
 /*
  * Show free area list (used inside shift_scroll-lock stuff)
  * We also calculate the percentage fragmentation. We do this by counting the
@@ -1122,6 +1176,7 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;
 	struct zone *zone;

 	for_each_zone(zone) {
@@ -1142,10 +1197,10 @@

 			pageset = zone->pageset + cpu;

-			for (temperature = 0; temperature < 2; temperature++)
+			for (temperature = 0; temperature < 3; temperature++)
 				printk("cpu %d %s: low %d, high %d, batch %d\n",
 					cpu,
-					temperature ? "cold" : "hot",
+					temp[temperature],
 					pageset->pcp[temperature].low,
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch);
@@ -1153,20 +1208,21 @@
 	}

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 	printk("\nFree pages: %11ukB (%ukB HighMem)\n",
 		K(nr_free_pages()),
 		K(nr_free_highpages()));

 	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
-		"unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+		"unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
 		active,
 		inactive,
 		ps.nr_dirty,
 		ps.nr_writeback,
 		ps.nr_unstable,
 		nr_free_pages(),
+		zero,
 		ps.nr_slab,
 		ps.nr_mapped,
 		ps.nr_page_table_pages);
@@ -1215,7 +1271,7 @@

 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < MAX_ORDER; order++) {
-			nr = zone->free_area[order].nr_free;
+			nr = zone->free_area[NOT_ZEROED][order].nr_free + zone->free_area[ZEROED][order].nr_free;
 			total += nr << order;
 			printk("%lu*%lukB ", nr, K(1UL) << order);
 		}
@@ -1515,8 +1571,10 @@
 {
 	int order;
 	for (order = 0; order < MAX_ORDER ; order++) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
-		zone->free_area[order].nr_free = 0;
+		INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
+		zone->free_area[NOT_ZEROED][order].nr_free = 0;
+		zone->free_area[ZEROED][order].nr_free = 0;
 	}
 }

@@ -1541,6 +1599,7 @@

 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->kscrubd_wait);
 	pgdat->kswapd_max_order = 0;

 	for (j = 0; j < MAX_NR_ZONES; j++) {
@@ -1564,6 +1623,7 @@
 		spin_lock_init(&zone->lru_lock);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->zero_pages = 0;

 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;

@@ -1597,6 +1657,13 @@
 			pcp->high = 2 * batch;
 			pcp->batch = 1 * batch;
 			INIT_LIST_HEAD(&pcp->list);
+
+			pcp = &zone->pageset[cpu].pcp[2];	/* zero pages */
+			pcp->count = 0;
+			pcp->low = 0;
+			pcp->high = 2 * batch;
+			pcp->batch = 1 * batch;
+			INIT_LIST_HEAD(&pcp->list);
 		}
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
 				zone_names[j], realsize, batch);
@@ -1722,7 +1789,7 @@
 		spin_lock_irqsave(&zone->lock, flags);
 		seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
 		for (order = 0; order < MAX_ORDER; ++order)
-			seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+			seq_printf(m, "%6lu ", zone->free_area[NOT_ZEROED][order].nr_free);
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
Index: linux-2.6.10/include/linux/mmzone.h
===================================================================
--- linux-2.6.10.orig/include/linux/mmzone.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/mmzone.h	2005-01-10 13:54:50.000000000 -0800
@@ -51,7 +51,7 @@
 };

 struct per_cpu_pageset {
-	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
+	struct per_cpu_pages pcp[3];	/* 0: hot.  1: cold  2: cold zeroed pages */
 #ifdef CONFIG_NUMA
 	unsigned long numa_hit;		/* allocated in intended node */
 	unsigned long numa_miss;	/* allocated in non intended node */
@@ -107,10 +107,14 @@
  * ZONE_HIGHMEM	 > 896 MB	only page cache and user processes
  */

+#define NOT_ZEROED 0
+#define ZEROED 1
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
 	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		zero_pages;
 	/*
 	 * protection[] is a pre-calculated number of extra pages that must be
 	 * available in a zone in order for __alloc_pages() to allocate memory
@@ -131,7 +135,7 @@
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	struct free_area	free_area[MAX_ORDER];
+	struct free_area	free_area[2][MAX_ORDER];


 	ZONE_PADDING(_pad1_)
@@ -266,6 +270,9 @@
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
+
+	wait_queue_head_t       kscrubd_wait;
+	struct task_struct *kscrubd;
 } pg_data_t;

 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -274,9 +281,9 @@
 extern struct pglist_data *pgdat_list;

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat);
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free);
+			unsigned long *free, unsigned long *zero);
 void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone, int order);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Index: linux-2.6.10/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.10.orig/fs/proc/proc_misc.c	2005-01-10 13:48:10.000000000 -0800
+++ linux-2.6.10/fs/proc/proc_misc.c	2005-01-10 13:54:50.000000000 -0800
@@ -123,12 +123,13 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;
 	unsigned long committed;
 	unsigned long allowed;
 	struct vmalloc_info vmi;

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 /*
  * display in kilobytes.
@@ -148,6 +149,7 @@
 	len = sprintf(page,
 		"MemTotal:     %8lu kB\n"
 		"MemFree:      %8lu kB\n"
+		"MemZero:      %8lu kB\n"
 		"Buffers:      %8lu kB\n"
 		"Cached:       %8lu kB\n"
 		"SwapCached:   %8lu kB\n"
@@ -171,6 +173,7 @@
 		"VmallocChunk: %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
+		K(zero),
 		K(i.bufferram),
 		K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
 		K(total_swapcache_pages),
Index: linux-2.6.10/mm/readahead.c
===================================================================
--- linux-2.6.10.orig/mm/readahead.c	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/readahead.c	2005-01-10 13:54:50.000000000 -0800
@@ -573,7 +573,8 @@
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;

-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
Index: linux-2.6.10/drivers/base/node.c
===================================================================
--- linux-2.6.10.orig/drivers/base/node.c	2005-01-10 13:48:08.000000000 -0800
+++ linux-2.6.10/drivers/base/node.c	2005-01-10 13:54:50.000000000 -0800
@@ -42,13 +42,15 @@
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;

 	si_meminfo_node(&i, nid);
-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));

 	n = sprintf(buf, "\n"
 		       "Node %d MemTotal:     %8lu kB\n"
 		       "Node %d MemFree:      %8lu kB\n"
+		       "Node %d MemZero:      %8lu kB\n"
 		       "Node %d MemUsed:      %8lu kB\n"
 		       "Node %d Active:       %8lu kB\n"
 		       "Node %d Inactive:     %8lu kB\n"
@@ -58,6 +60,7 @@
 		       "Node %d LowFree:      %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
+		       nid, K(zero),
 		       nid, K(i.totalram - i.freeram),
 		       nid, K(active),
 		       nid, K(inactive),
Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h	2005-01-10 13:54:50.000000000 -0800
@@ -731,6 +731,7 @@
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
+#define PF_KSCRUBD	0x00800000	/* I am kscrubd */

 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.10/mm/Makefile
===================================================================
--- linux-2.6.10.orig/mm/Makefile	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/Makefile	2005-01-10 13:54:50.000000000 -0800
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+			   vmalloc.o scrubd.o

 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
Index: linux-2.6.10/mm/scrubd.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/mm/scrubd.c	2005-01-10 14:56:20.000000000 -0800
@@ -0,0 +1,134 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = 5;	/* if a page of this order is coalesed then run kscrubd */
+unsigned int sysctl_scrub_stop = 2;	/* Mininum order of page to zero */
+unsigned int sysctl_scrub_load = 999;	/* Do not run scrubd if load > */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec(table, write, file, buffer, length, ppos);
+	if (sysctl_scrub_start < MAX_ORDER) {
+		struct zone *zone;
+
+		for_each_zone(zone)
+			wakeup_kscrubd(zone);
+	}
+	return 0;
+}
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+	int order;
+
+	for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+		struct free_area *area = z->free_area[NOT_ZEROED] + order;
+		if (!list_empty(&area->free_list)) {
+			struct page *page = scrubd_rmpage(z, area);
+			struct list_head *l;
+			int size = PAGE_SIZE << order;
+
+			if (!page)
+				continue;
+
+			list_for_each(l, &zero_drivers) {
+				struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+
+				if (driver->start(page_address(page), size) == 0)
+					goto done;
+			}
+
+			/* Unable to find a zeroing device that would
+			 * deal with this page so just do it on our own.
+			 * This will likely thrash the cpu caches.
+			 */
+			cond_resched();
+			prep_zero_page(page, order);
+done:
+			end_zero_page(page, order);
+			cond_resched();
+			return 1 << order;
+		}
+	}
+	return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+	int i;
+	unsigned long pages_zeroed;
+
+	if (system_state != SYSTEM_RUNNING)
+		return;
+
+	do {
+		pages_zeroed = 0;
+		for (i = 0; i < pgdat->nr_zones; i++) {
+			struct zone *zone = pgdat->node_zones + i;
+
+			pages_zeroed += zero_highest_order_page(zone);
+		}
+	} while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t*)p;
+	struct task_struct *tsk = current;
+	DEFINE_WAIT(wait);
+	cpumask_t cpumask;
+
+	daemonize("kscrubd%d", pgdat->node_id);
+	cpumask = node_to_cpumask(pgdat->node_id);
+	if (!cpus_empty(cpumask))
+		set_cpus_allowed(tsk, cpumask);
+
+	tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+	for ( ; ; ) {
+		if (current->flags & PF_FREEZE)
+			refrigerator(PF_FREEZE);
+		prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+		schedule();
+		finish_wait(&pgdat->kscrubd_wait, &wait);
+
+		scrub_pgdat(pgdat);
+	}
+	return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+	pg_data_t *pgdat;
+	for_each_pgdat(pgdat)
+		pgdat->kscrubd
+		= find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+	return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.10/include/linux/scrub.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/include/linux/scrub.h	2005-01-10 14:34:25.000000000 -0800
@@ -0,0 +1,49 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+        int (*start)(void *, unsigned long);		/* Start bzero transfer */
+        struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+extern unsigned int sysctl_scrub_load;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+	list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+	list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+        if (avenrun[0] >= ((unsigned long)sysctl_scrub_load << FSHIFT))
+		return;
+	if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+                return;
+        wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+				      void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page, unsigned int order);
+#endif
Index: linux-2.6.10/kernel/sysctl.c
===================================================================
--- linux-2.6.10.orig/kernel/sysctl.c	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/kernel/sysctl.c	2005-01-10 13:54:50.000000000 -0800
@@ -40,6 +40,7 @@
 #include <linux/times.h>
 #include <linux/limits.h>
 #include <linux/dcache.h>
+#include <linux/scrub.h>
 #include <linux/syscalls.h>

 #include <asm/uaccess.h>
@@ -827,6 +828,33 @@
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_SCRUB_START,
+		.procname	= "scrub_start",
+		.data		= &sysctl_scrub_start,
+		.maxlen		= sizeof(sysctl_scrub_start),
+		.mode		= 0644,
+		.proc_handler	= &scrub_start_handler,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_STOP,
+		.procname	= "scrub_stop",
+		.data		= &sysctl_scrub_stop,
+		.maxlen		= sizeof(sysctl_scrub_stop),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_LOAD,
+		.procname	= "scrub_load",
+		.data		= &sysctl_scrub_load,
+		.maxlen		= sizeof(sysctl_scrub_load),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
 	{ .ctl_name = 0 }
 };

Index: linux-2.6.10/include/linux/sysctl.h
===================================================================
--- linux-2.6.10.orig/include/linux/sysctl.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/sysctl.h	2005-01-10 13:54:50.000000000 -0800
@@ -169,6 +169,9 @@
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+	VM_SCRUB_START=30,	/* percentage * 10 at which to start scrubd */
+	VM_SCRUB_STOP=31,	/* percentage * 10 at which to stop scrubd */
+	VM_SCRUB_LOAD=32,	/* Load factor at which not to scrub anymore */
 };


Index: linux-2.6.10/include/linux/gfp.h
===================================================================
--- linux-2.6.10.orig/include/linux/gfp.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/gfp.h	2005-01-10 13:54:50.000000000 -0800
@@ -132,4 +132,5 @@

 void page_alloc_init(void);

+void prep_zero_page(struct page *, unsigned int order);
 #endif /* __LINUX_GFP_H */


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V4 [3/4]: Altix SN2 BTE zero driver
  2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
  2005-01-10 23:54                         ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
  2005-01-10 23:55                         ` Prezeroing V4 [2/4]: Zeroing implementation Christoph Lameter
@ 2005-01-10 23:55                         ` Christoph Lameter
  2005-01-10 23:56                         ` Prezeroing V4 [4/4]: Extend clear_page to take an order parameter Christoph Lameter
  3 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development

o Zeroing driver implemented with the Block Transfer Engine in the Altix
  SN2 SHub.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/arch/ia64/sn/kernel/bte.c
===================================================================
--- linux-2.6.10.orig/arch/ia64/sn/kernel/bte.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/ia64/sn/kernel/bte.c	2005-01-10 13:54:52.000000000 -0800
@@ -4,6 +4,8 @@
  * for more details.
  *
  * Copyright (c) 2000-2003 Silicon Graphics, Inc.  All Rights Reserved.
+ *
+ * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
  */

 #include <linux/config.h>
@@ -20,6 +22,8 @@
 #include <linux/bootmem.h>
 #include <linux/string.h>
 #include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scrub.h>

 #include <asm/sn/bte.h>

@@ -30,7 +34,7 @@
 /* two interfaces on two btes */
 #define MAX_INTERFACES_TO_TRY		4

-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
 {
 	nodepda_t *tmp_nodepda;

@@ -132,7 +136,6 @@
 			if (bte == NULL) {
 				continue;
 			}
-
 			if (spin_trylock(&bte->spinlock)) {
 				if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
 				    (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
@@ -157,7 +160,7 @@
 		}
 	} while (1);

-	if (notification == NULL) {
+	if (notification == NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
 		/* User does not want to be notified. */
 		bte->most_rcnt_na = &bte->notify;
 	} else {
@@ -192,6 +195,8 @@

 	itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);

+	if (mode & BTE_NOTIFY_AND_GET_POINTER)
+		 *(u64 volatile **)(notification) = &bte->notify;
 	spin_unlock_irqrestore(&bte->spinlock, irq_flags);

 	if (notification != NULL) {
@@ -449,5 +454,47 @@
 		mynodepda->bte_if[i].cleanup_active = 0;
 		mynodepda->bte_if[i].bh_error = 0;
 	}
+}
+
+u64 *bte_zero_notify[MAX_COMPACT_NODES];
+
+#define ZERO_RATE_PER_SEC 500000000
+
+static int bte_start_bzero(void *p, unsigned long len)
+{
+	int rc;
+	int ticks;
+	int node = get_nasid();
+
+	/* Check limitations.
+		1. System must be running (weird things happen during bootup)
+		2. Size >64KB. Smaller requests cause too much bte traffic
+	 */
+	if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
+		return EINVAL;
+
+	rc = bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, bte_zero_notify+node);
+	if (rc)
+		return rc;
+
+	ticks = (len*HZ)/ZERO_RATE_PER_SEC;
+	if (ticks) {
+		/* Wait the minimum time of the transfer */
+		current->state = TASK_INTERRUPTIBLE;
+		schedule_timeout(ticks);
+	}
+	while (*(bte_zero_notify[node]) != BTE_WORD_BUSY) {
+		/* Then keep on checking until transfer is complete */
+		cpu_relax();
+		schedule();
+	}
+	return 0;
+}
+
+static struct zero_driver bte_bzero = {
+	.start = bte_start_bzero,
+};

+void sn_bte_bzero_init(void) {
+	register_zero_driver(&bte_bzero);
 }
Index: linux-2.6.10/arch/ia64/sn/kernel/setup.c
===================================================================
--- linux-2.6.10.orig/arch/ia64/sn/kernel/setup.c	2005-01-10 13:48:08.000000000 -0800
+++ linux-2.6.10/arch/ia64/sn/kernel/setup.c	2005-01-10 13:54:52.000000000 -0800
@@ -244,6 +244,7 @@
 	int pxm;
 	int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
 	extern void sn_cpu_init(void);
+	extern void sn_bte_bzero_init(void);

 	/*
 	 * If the generic code has enabled vga console support - lets
@@ -334,6 +335,7 @@
 	screen_info = sn_screen_info;

 	sn_timer_init();
+	sn_bte_bzero_init();
 }

 /**
Index: linux-2.6.10/include/asm-ia64/sn/bte.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/sn/bte.h	2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/sn/bte.h	2005-01-10 13:54:52.000000000 -0800
@@ -48,6 +48,8 @@
 #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
 /* Use a reserved bit to let the caller specify a wait for any BTE */
 #define BTE_WACQUIRE (0x4000)
+/* Return the pointer to the notification cacheline to the user */
+#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
 /* Use the BTE on the node with the destination memory */
 #define BTE_USE_DEST (BTE_WACQUIRE << 1)
 /* Use any available BTE interface on any node for the transfer */


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Prezeroing V4 [4/4]: Extend clear_page to take an order parameter
  2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
                                           ` (2 preceding siblings ...)
  2005-01-10 23:55                         ` Prezeroing V4 [3/4]: Altix SN2 BTE zero driver Christoph Lameter
@ 2005-01-10 23:56                         ` Christoph Lameter
  3 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
	linux-mm, Linux Kernel Development


- Extend clear_page to take an order parameter.

Architecture support:
---------------------

Known to work:

ia64
i386
x86_64
sparc64
m68k

Trivial modification expected to simply work:

arm
cris
h8300
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um

Modification made but it would be good to have some feedback from the arch maintainers:

s390
alpha
sh
mips
m32r

Index: linux-2.6.10/include/asm-ia64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h	2005-01-10 14:23:21.000000000 -0800
@@ -56,7 +56,7 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
 extern void copy_page (void *to, void *from);

 /*
@@ -65,7 +65,7 @@
  */
 #define clear_user_page(addr, vaddr, page)	\
 do {						\
-	clear_page(addr);			\
+	clear_page(addr, 0);			\
 	flush_dcache_page(page);		\
 } while (0)

Index: linux-2.6.10/include/asm-i386/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -18,7 +18,7 @@

 #include <asm/mmx.h>

-#define clear_page(page)	mmx_clear_page((void *)(page))
+#define clear_page(page, order)	mmx_clear_page((void *)(page),order)
 #define copy_page(to,from)	mmx_copy_page(to,from)

 #else
@@ -28,12 +28,12 @@
  *	Maybe the K6-III ?
  */

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)

 #endif

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -32,10 +32,10 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-void clear_page(void *);
+void clear_page(void *, int);
 void copy_page(void *, void *);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-sparc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -28,10 +28,10 @@

 #ifndef __ASSEMBLY__

-#define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	 memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		sparc_flush_page_to_ram(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-s390/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -22,12 +22,12 @@

 #ifndef __s390x__

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	register_pair rp;

 	rp.subreg.even = (unsigned long) page;
-	rp.subreg.odd = (unsigned long) 4096;
+	rp.subreg.odd = (unsigned long) 4096 << order;
         asm volatile ("   slr  1,1\n"
 		      "   mvcl %0,0"
 		      : "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@

 #else /* __s390x__ */

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
-        asm volatile ("   lgr  2,%0\n"
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+        	asm volatile ("   lgr  2,%0\n"
                       "   lghi 3,4096\n"
                       "   slgr 1,1\n"
                       "   mvcl 2,0"
                       : : "a" ((void *) (page))
 		      : "memory", "cc", "1", "2", "3" );
+		page += PAGE_SIZE;
+	}
 }

 static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@

 #endif /* __s390x__ */

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/arch/i386/lib/mmx.c
===================================================================
--- linux-2.6.10.orig/arch/i386/lib/mmx.c	2004-12-24 13:34:48.000000000 -0800
+++ linux-2.6.10/arch/i386/lib/mmx.c	2005-01-10 14:23:22.000000000 -0800
@@ -128,7 +128,7 @@
  *	other MMX using processors do not.
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -138,7 +138,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/64;i++)
+	for(i=0;i<((4096/64) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@
  *	Generic MMX implementation without K7 specific streaming
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -267,7 +267,7 @@
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/128;i++)
+	for(i=0;i<((4096/128) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@
  *	Favour MMX for page clear and copy.
  */

-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
 {
 	int d0, d1;
 	__asm__ __volatile__( \
 		"cld\n\t" \
 		"rep ; stosl" \
 		: "=&c" (d0), "=&D" (d1)
-		:"a" (0),"1" (page),"0" (1024)
+		:"a" (0),"1" (page),"0" (1024 << order)
 		:"memory");
 }
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
 {
 	if(unlikely(in_interrupt()))
-		slow_zero_page(page);
+		slow_clear_page(page, order);
 	else
-		fast_clear_page(page);
+		fast_clear_page(page, order);
 }

 static void slow_copy_page(void *to, void *from)
Index: linux-2.6.10/include/asm-x86_64/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/mmx.h	2005-01-10 14:23:22.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/ia64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/ia64/lib/clear_page.S	2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10/arch/ia64/lib/clear_page.S	2005-01-10 14:23:22.000000000 -0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm	Tuned for Itanium.
  * 2/12/02 kchen	Tuned for both Itanium and McKinley
  * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
  */
 #include <linux/config.h>

@@ -29,27 +30,33 @@
 #define dst4		r11

 #define dst_last	r31
+#define totsize		r14

 GLOBAL_ENTRY(clear_page)
 	.prologue
-	.regstk 1,0,0,0
-	mov r16 = PAGE_SIZE/L3_LINE_SIZE-1	// main loop count, -1=repeat/until
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
 	.save ar.lc, saved_lc
 	mov saved_lc = ar.lc
-
+	;;
 	.body
+	adds dst1 = 16, in0
 	mov ar.lc = (PREFETCH_LINES - 1)
 	mov dst_fetch = in0
-	adds dst1 = 16, in0
 	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
 	;;
 .fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
 	adds dst3 = 48, in0		// executing this multiple times is harmless
 	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
 	;;
-	addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
 	mov ar.lc = r16			// one L3 line per iteration
-	adds dst4 = 64, in0
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
 	;;
 #ifdef CONFIG_ITANIUM
 	// Optimized for Itanium
Index: linux-2.6.10/arch/x86_64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/x86_64/lib/clear_page.S	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/lib/clear_page.S	2005-01-10 14:23:22.000000000 -0800
@@ -1,12 +1,16 @@
 /*
  * Zero a page.
  * rdi	page
+ * rsi	order
  */
 	.globl clear_page
 	.p2align 4
 clear_page:
+	movl   $4096/64,%eax
+	movl	%esi, %ecx
+	shll	%cl, %eax
+	movl	%eax, %ecx
 	xorl   %eax,%eax
-	movl   $4096/64,%ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
@@ -41,7 +45,10 @@

 	.section .altinstr_replacement,"ax"
 clear_page_c:
-	movl $4096/8,%ecx
+	movl $4096/8,%eax
+	movl %esi, %ecx
+	shll %cl, %eax
+	movl %eax, %ecx
 	xorl %eax,%eax
 	rep
 	stosq
Index: linux-2.6.10/include/asm-sh/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh/page.h	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-sh/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -36,12 +36,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
 extern void (*copy_page)(void *to, void *from);

 extern void clear_page_slow(void *to);
 extern void copy_page_slow(void *to, void *from);

+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 #if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
 struct page;
 extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@
 extern void __clear_user_page(void *to, void *orig_to);
 extern void __copy_user_page(void *to, void *from, void *orig_to);
 #elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 #elif defined(CONFIG_CPU_SH4)
 struct page;
Index: linux-2.6.10/include/asm-i386/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/mmx.h	2005-01-10 14:23:22.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/alpha/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/clear_page.S	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/clear_page.S	2005-01-10 14:23:22.000000000 -0800
@@ -6,11 +6,10 @@

 	.text
 	.align 4
-	.global clear_page
-	.ent clear_page
-clear_page:
+	.global _clear_page
+	.ent _clear_page
+_clear_page:
 	.prologue 0
-
 	lda	$0,128
 	nop
 	unop
@@ -36,4 +35,4 @@
 	unop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/include/asm-sh64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh64/page.h	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -50,12 +50,20 @@
 extern void sh64_page_clear(void *page);
 extern void sh64_page_copy(void *from, void *to);

-#define clear_page(page)               sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+		sh64_page_clear(page++, 0);
+	}
+}
+
 #define copy_page(to,from)             sh64_page_copy(from, to)

 #if defined(CONFIG_DCACHE_DISABLED)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	sh_clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #else
Index: linux-2.6.10/include/asm-h8300/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-h8300/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-arm/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-arm/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -128,7 +128,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 extern void copy_page(void *to, const void *from);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-ppc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc64/page.h	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/asm-ppc64/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -102,12 +102,12 @@
 #define REGION_MASK   (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
 #define REGION_STRIDE (1UL << REGION_SHIFT)

-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, unsigned int order)
 {
 	unsigned long lines, line_size;

 	line_size = ppc64_caches.dline_size;
-	lines = ppc64_caches.dlines_per_page;
+	lines = ppc64_caches.dlines_per_page << order;

 	__asm__ __volatile__(
 	"mtctr  	%1	# clear_page\n\
Index: linux-2.6.10/include/asm-m32r/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -11,10 +11,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- > 0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+
 extern void copy_page(void *to, void *from);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-alpha/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -15,8 +15,20 @@

 #define STRICT_MM_TYPECHECKS

-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+	int nr = 1 << order;
+
+	while (nr--)
+	{
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
Index: linux-2.6.10/arch/mips/mm/pg-sb1.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-sb1.c	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-sb1.c	2005-01-10 14:23:22.000000000 -0800
@@ -42,7 +42,7 @@
 #ifdef CONFIG_SIBYTE_DMA_PAGEOPS
 static inline void clear_page_cpu(void *page)
 #else
-void clear_page(void *page)
+void _clear_page(void *page)
 #endif
 {
 	unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@
 		     IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
 }

-void clear_page(void *page)
+void _clear_page(void *page)
 {
 	int cpu = smp_processor_id();

 	/* if the page is above Kseg0, use old way */
 	if (KSEGX(page) != CAC_BASE)
 		return clear_page_cpu(page);
-
 	page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
 	page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
 	__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@

 #endif

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
 EXPORT_SYMBOL(copy_page);
Index: linux-2.6.10/include/asm-m68k/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68k/page.h	2004-12-24 13:35:49.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -50,7 +50,7 @@
 		       );
 }

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	unsigned long tmp;
 	unsigned long *sp = page;
@@ -69,16 +69,16 @@
 			     "dbra   %1,1b\n\t"
 			     : "=a" (sp), "=d" (tmp)
 			     : "a" (page), "0" (sp),
-			       "1" ((PAGE_SIZE - 16) / 16 - 1));
+			       "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
 }

 #else
-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 #endif

 #define clear_user_page(addr, vaddr, page)	\
-	do {	clear_page(addr);		\
+	do {	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-mips/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-mips/page.h	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/include/asm-mips/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -39,7 +39,18 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 extern void copy_page(void * to, void * from);

 extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@
 {
 	extern void (*flush_data_cache_page)(unsigned long addr);

-	clear_page(addr);
+	clear_page(addr, 0);
 	if (pages_do_alias((unsigned long) addr, vaddr))
 		flush_data_cache_page((unsigned long)addr);
 }
Index: linux-2.6.10/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68knommu/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-cris/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/page.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -15,10 +15,10 @@

 #ifdef __KERNEL__

-#define clear_page(page)        memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      memcpy((void *)(to), (void *)(from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)    clear_page(page)
+#define clear_user_page(page, vaddr, pg)    clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

 #define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-v850/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-v850/page.h	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-v850/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -37,11 +37,11 @@

 #define STRICT_MM_TYPECHECKS

-#define clear_page(page)	memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset ((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to, from)	memcpy ((void *)(to), (void *)from, PAGE_SIZE)

 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-parisc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-parisc/page.h	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -13,7 +13,7 @@
 #include <asm/types.h>
 #include <asm/cache.h>

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      copy_user_page_asm((void *)(to), (void *)(from))

 struct page;
Index: linux-2.6.10/arch/arm/mm/copypage-v6.c
===================================================================
--- linux-2.6.10.orig/arch/arm/mm/copypage-v6.c	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/arm/mm/copypage-v6.c	2005-01-10 14:23:22.000000000 -0800
@@ -47,7 +47,7 @@
  */
 void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
 {
-	clear_page(kaddr);
+	_clear_page(kaddr);
 }

 /*
@@ -116,7 +116,7 @@

 	set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
 	flush_tlb_kernel_page(to);
-	clear_page((void *)to);
+	_clear_page((void *)to);

 	spin_unlock(&v6_lock);
 }
Index: linux-2.6.10/arch/m32r/mm/page.S
===================================================================
--- linux-2.6.10.orig/arch/m32r/mm/page.S	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/arch/m32r/mm/page.S	2005-01-10 14:23:22.000000000 -0800
@@ -51,7 +51,7 @@
 	jmp	r14

 	.text
-	.global	clear_page
+	.global	_clear_page
 	/*
 	 * clear_page (to)
 	 *
@@ -60,7 +60,7 @@
 	 * 16 * 256
 	 */
 	.align	4
-clear_page:
+_clear_page:
 	ldi	r2, #255
 	ldi	r4, #0
 	ld	r3, @r0		/* cache line allocate */
Index: linux-2.6.10/include/asm-ppc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-ppc/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -85,7 +85,7 @@

 struct page;
 extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define  clear_page clear_pages
 extern void copy_page(void *to, void *from);
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/alpha/kernel/alpha_ksyms.c	2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c	2005-01-10 14:23:22.000000000 -0800
@@ -88,7 +88,7 @@
 EXPORT_SYMBOL(__memsetw);
 EXPORT_SYMBOL(__constant_c_memset);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(__direct_map_base);
 EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.10/arch/alpha/lib/ev6-clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/ev6-clear_page.S	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/ev6-clear_page.S	2005-01-10 14:23:22.000000000 -0800
@@ -6,9 +6,9 @@

         .text
         .align 4
-        .global clear_page
-        .ent clear_page
-clear_page:
+        .global _clear_page
+        .ent _clear_page
+_clear_page:
         .prologue 0

 	lda	$0,128
@@ -51,4 +51,4 @@
 	nop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/arch/sh/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/init.c	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/init.c	2005-01-10 14:23:22.000000000 -0800
@@ -57,7 +57,7 @@
 #endif

 void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);

 void show_mem(void)
 {
@@ -255,7 +255,7 @@
 	 * later in the boot process if a better method is available.
 	 */
 	copy_page = copy_page_slow;
-	clear_page = clear_page_slow;
+	_clear_page = clear_page_slow;

 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.10/arch/sh/mm/pg-dma.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-dma.c	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-dma.c	2005-01-10 14:23:22.000000000 -0800
@@ -78,7 +78,7 @@
 		return ret;

 	copy_page = copy_page_dma;
-	clear_page = clear_page_dma;
+	_clear_page = clear_page_dma;

 	return ret;
 }
Index: linux-2.6.10/arch/sh/mm/pg-nommu.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-nommu.c	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-nommu.c	2005-01-10 14:23:22.000000000 -0800
@@ -27,7 +27,7 @@
 static int __init pg_nommu_init(void)
 {
 	copy_page = copy_page_nommu;
-	clear_page = clear_page_nommu;
+	_clear_page = clear_page_nommu;

 	return 0;
 }
Index: linux-2.6.10/arch/mips/mm/pg-r4k.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-r4k.c	2004-12-24 13:34:49.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-r4k.c	2005-01-10 14:23:22.000000000 -0800
@@ -39,9 +39,9 @@

 static unsigned int clear_page_array[0x130 / 4];

-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 /*
  * Maximum sizes:
Index: linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/m32r/kernel/m32r_ksyms.c	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c	2005-01-10 14:23:22.000000000 -0800
@@ -102,7 +102,7 @@
 EXPORT_SYMBOL(memcmp);
 EXPORT_SYMBOL(memscan);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(strcat);
 EXPORT_SYMBOL(strchr);
Index: linux-2.6.10/include/asm-arm26/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm26/page.h	2004-12-24 13:35:22.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -25,7 +25,7 @@
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 #define copy_page(to, from)  __copy_user_page(to, from, 0);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-sparc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/page.h	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/page.h	2005-01-10 14:23:22.000000000 -0800
@@ -14,8 +14,8 @@

 #ifndef __ASSEMBLY__

-extern void _clear_page(void *page);
-#define clear_page(X)	_clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y)	_clear_page((void *)(X),(Y))
 struct page;
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 #define copy_page(X,Y)	memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
Index: linux-2.6.10/arch/sparc64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/sparc64/lib/clear_page.S	2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/arch/sparc64/lib/clear_page.S	2005-01-10 14:23:22.000000000 -0800
@@ -28,9 +28,12 @@
 	.text

 	.globl		_clear_page
-_clear_page:		/* %o0=dest */
+_clear_page:		/* %o0=dest, %o1=order */
+	sethi		%hi(PAGE_SIZE/64), %o2
+	clr		%o4
+	or		%o2, %lo(PAGE_SIZE/64), %o2
 	ba,pt		%xcc, clear_page_common
-	 clr		%o4
+	 sllx		%o2, %o1, %o1

 	/* This thing is pretty important, it shows up
 	 * on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@
 	flush		%g6
 	wrpr		%o4, 0x0, %pstate

+	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		1, %o4
+	or		%o1, %lo(PAGE_SIZE/64), %o1

 clear_page_common:
 	VISEntryHalf
 	membar		#StoreLoad | #StoreStore | #LoadStore
 	fzero		%f0
-	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		%o0, %g1		! remember vaddr for tlbflush
 	fzero		%f2
-	or		%o1, %lo(PAGE_SIZE/64), %o1
 	faddd		%f0, %f2, %f4
 	fmuld		%f0, %f2, %f6
 	faddd		%f0, %f2, %f8
Index: linux-2.6.10/drivers/net/tc35815.c
===================================================================
--- linux-2.6.10.orig/drivers/net/tc35815.c	2004-12-24 13:33:48.000000000 -0800
+++ linux-2.6.10/drivers/net/tc35815.c	2005-01-10 14:23:22.000000000 -0800
@@ -657,7 +657,7 @@
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
 	} else {
-		clear_page(lp->fd_buf);
+		clear_page(lp->fd_buf, 0);
 #ifdef __mips__
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
Index: linux-2.6.10/include/linux/highmem.h
===================================================================
--- linux-2.6.10.orig/include/linux/highmem.h	2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h	2005-01-10 14:23:22.000000000 -0800
@@ -56,7 +56,7 @@
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
-	clear_page(kaddr);
+	clear_page(kaddr, 0);
 	kunmap_atomic(kaddr, KM_USER0);
 }

Index: linux-2.6.10/fs/afs/file.c
===================================================================
--- linux-2.6.10.orig/fs/afs/file.c	2004-12-24 13:35:59.000000000 -0800
+++ linux-2.6.10/fs/afs/file.c	2005-01-10 14:23:22.000000000 -0800
@@ -172,7 +172,7 @@
 				      (size_t) PAGE_SIZE);
 		desc.buffer	= kmap(page);

-		clear_page(desc.buffer);
+		clear_page(desc.buffer, 0);

 		/* read the contents of the file from the server into the
 		 * page */
Index: linux-2.6.10/fs/ntfs/compress.c
===================================================================
--- linux-2.6.10.orig/fs/ntfs/compress.c	2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/fs/ntfs/compress.c	2005-01-10 14:23:22.000000000 -0800
@@ -107,7 +107,7 @@
 		 * FIXME: Using clear_page() will become wrong when we get
 		 * PAGE_CACHE_SIZE != PAGE_SIZE but for now there is no problem.
 		 */
-		clear_page(kp);
+		clear_page(kp, 0);
 		return;
 	}
 	kp_ofs = ni->initialized_size & ~PAGE_CACHE_MASK;
@@ -742,7 +742,7 @@
 				 * for now there is no problem.
 				 */
 				if (likely(!cur_ofs))
-					clear_page(page_address(page));
+					clear_page(page_address(page), 0);
 				else
 					memset(page_address(page) + cur_ofs, 0,
 							PAGE_CACHE_SIZE -
Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-10 14:21:06.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-10 14:23:22.000000000 -0800
@@ -639,6 +639,10 @@
 {
 	int i;

+	if (!PageHighMem(page)) {
+		clear_page(page_address(page), order);
+		return;
+	}
 	for(i = 0; i < (1 << order); i++)
 		clear_highpage(page + i);
 }
Index: linux-2.6.10/mm/hugetlb.c
===================================================================
--- linux-2.6.10.orig/mm/hugetlb.c	2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/hugetlb.c	2005-01-10 14:23:22.000000000 -0800
@@ -89,8 +89,7 @@
 	spin_unlock(&hugetlb_lock);
 	set_page_count(page, 1);
 	page[1].mapping = (void *)free_huge_page;
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-		clear_highpage(&page[i]);
+	clear_page(page_address(page), HUGETLB_PAGE_ORDER);
 	return page;
 }



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V4 [1/4]: Arch specific page zeroing during page fault
  2005-01-10 23:54                         ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
@ 2005-01-11  0:41                           ` Chris Wright
  2005-01-11  0:46                             ` Christoph Lameter
  0 siblings, 1 reply; 87+ messages in thread
From: Chris Wright @ 2005-01-11  0:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, David S. Miller,
	linux-ia64, linux-mm, Linux Kernel Development

* Christoph Lameter (clameter@sgi.com) wrote:
> @@ -1795,7 +1786,7 @@
> 
>  		if (unlikely(anon_vma_prepare(vma)))
>  			goto no_mem;
> -		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
> +		page = alloc_zeroed_user_highpage(vma, addr);

Oops, HIGHZERO is gone already in Linus' tree.

thanks,
-chris
-- 
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V4 [1/4]: Arch specific page zeroing during page fault
  2005-01-11  0:41                           ` Chris Wright
@ 2005-01-11  0:46                             ` Christoph Lameter
  2005-01-11  0:49                               ` Chris Wright
  0 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2005-01-11  0:46 UTC (permalink / raw)
  To: Chris Wright
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, David S. Miller,
	linux-ia64, linux-mm, Linux Kernel Development

On Mon, 10 Jan 2005, Chris Wright wrote:

> * Christoph Lameter (clameter@sgi.com) wrote:
> > @@ -1795,7 +1786,7 @@
> >
> >  		if (unlikely(anon_vma_prepare(vma)))
> >  			goto no_mem;
> > -		page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
> > +		page = alloc_zeroed_user_highpage(vma, addr);
>
> Oops, HIGHZERO is gone already in Linus' tree.

Use bk13 as I indicated.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Prezeroing V4 [1/4]: Arch specific page zeroing during page fault
  2005-01-11  0:46                             ` Christoph Lameter
@ 2005-01-11  0:49                               ` Chris Wright
  0 siblings, 0 replies; 87+ messages in thread
From: Chris Wright @ 2005-01-11  0:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Wright, Linus Torvalds, Hugh Dickins, Andrew Morton,
	David S. Miller, linux-ia64, linux-mm, Linux Kernel Development

* Christoph Lameter (clameter@sgi.com) wrote:
> Use bk13 as I indicated.

Ah, so you did, thanks ;-)
-chris
--
Linux Security Modules     http://lsm.immunix.org     http://lsm.bkbits.net

^ permalink raw reply	[flat|nested] 87+ messages in thread

* alloc_zeroed_user_highpage to fix the clear_user_highpage issue
  2005-01-08 21:56                   ` David S. Miller
@ 2005-01-21 20:09                     ` Christoph Lameter
  2005-02-09  9:58                       ` [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL Michael Ellerman
  2005-01-21 20:12                     ` Extend clear_page by an order parameter Christoph Lameter
  2005-01-21 20:15                     ` A scrub daemon (prezeroing) Christoph Lameter
  2 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2005-01-21 20:09 UTC (permalink / raw)
  To: David S. Miller
  Cc: Hugh Dickins, akpm, linux-ia64, torvalds, linux-mm, linux-kernel

This patch adds a new function alloc_zeroed_user_highpage that is then used in the
anonymous page fault handler and in the COW code to allocate zeroed pages. The function
can be defined per arch to setup special processing for user pages by defining
__HAVE_ARCH_ALLOC_ZEROED_USER_PAGE. For arches that do not need to do special things
for user pages, alloc_zeroed_user_highpage is defined to simply do

	alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)

Patch against 2.6.11-rc1-bk9

This patch needs to update a number of archs. Wish there was a better way
to do this.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/include/linux/highmem.h
===================================================================
--- linux-2.6.10.orig/include/linux/highmem.h	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h	2005-01-21 10:44:27.000000000 -0800
@@ -42,6 +42,17 @@ static inline void clear_user_highpage(s
 	smp_wmb();
 }

+#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+static inline struct page* alloc_zeroed_user_highpage(struct vm_area_struct *vma,
+	 unsigned long vaddr)
+{
+	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+
+	clear_user_highpage(page, vaddr);
+	return page;
+}
+#endif
+
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/memory.c	2005-01-21 11:10:42.000000000 -0800
@@ -84,20 +84,6 @@ EXPORT_SYMBOL(high_memory);
 EXPORT_SYMBOL(vmalloc_earlyreserve);

 /*
- * We special-case the C-O-W ZERO_PAGE, because it's such
- * a common occurrence (no need to read the page to know
- * that it's zero - better for the cache and memory subsystem).
- */
-static inline void copy_cow_page(struct page * from, struct page * to, unsigned long address)
-{
-	if (from == ZERO_PAGE(address)) {
-		clear_user_highpage(to, address);
-		return;
-	}
-	copy_user_highpage(to, from, address);
-}
-
-/*
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
@@ -1329,11 +1315,16 @@ static int do_wp_page(struct mm_struct *

 	if (unlikely(anon_vma_prepare(vma)))
 		goto no_new_page;
-	new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
-	if (!new_page)
-		goto no_new_page;
-	copy_cow_page(old_page,new_page,address);
-
+	if (old_page == ZERO_PAGE(address)) {
+		new_page = alloc_zeroed_user_highpage(vma, address);
+		if (!new_page)
+			goto no_new_page;
+	} else {
+		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		if (!new_page)
+			goto no_new_page;
+		copy_user_highpage(new_page, old_page, address);
+	}
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -1795,10 +1786,9 @@ do_anonymous_page(struct mm_struct *mm,

 		if (unlikely(anon_vma_prepare(vma)))
 			goto no_mem;
-		page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+		page = alloc_zeroed_user_highpage(vma, addr);
 		if (!page)
 			goto no_mem;
-		clear_user_highpage(page, addr);

 		spin_lock(&mm->page_table_lock);
 		page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.10/include/asm-ia64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/page.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -75,6 +75,16 @@ do {						\
 	flush_dcache_page(page);		\
 } while (0)

+
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+({						\
+	struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+	flush_dcache_page(page);		\
+	page;					\
+})
+
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)

 #ifdef CONFIG_VIRTUAL_MEM_MAP
Index: linux-2.6.10/include/asm-i386/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/page.h	2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -36,6 +36,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/page.h	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -38,6 +38,8 @@ void copy_page(void *, void *);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-m32r/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -17,6 +17,9 @@ extern void copy_page(void *to, void *fr
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-alpha/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/page.h	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -18,6 +18,9 @@
 extern void clear_page(void *page);
 #define clear_user_page(page, vaddr, pg)	clear_page(page)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

Index: linux-2.6.10/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68knommu/page.h	2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -30,6 +30,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-cris/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/page.h	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -21,6 +21,9 @@
 #define clear_user_page(page, vaddr, pg)    clear_page(page)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */
Index: linux-2.6.10/include/asm-s390/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -106,6 +106,9 @@ static inline void copy_page(void *to, v
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /* Pure 2^n version of get_order */
 extern __inline__ int get_order(unsigned long size)
 {
Index: linux-2.6.10/include/asm-h8300/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-h8300/page.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h	2005-01-21 10:44:27.000000000 -0800
@@ -30,6 +30,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
 /*
  * These are used to make use of C type-checking..
  */


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Extend clear_page by an order parameter
  2005-01-08 21:56                   ` David S. Miller
  2005-01-21 20:09                     ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
@ 2005-01-21 20:12                     ` Christoph Lameter
  2005-01-21 22:29                       ` Paul Mackerras
  2005-01-23  7:45                       ` Andrew Morton
  2005-01-21 20:15                     ` A scrub daemon (prezeroing) Christoph Lameter
  2 siblings, 2 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-21 20:12 UTC (permalink / raw)
  To: David S. Miller
  Cc: Hugh Dickins, akpm, linux-ia64, torvalds, linux-mm, linux-kernel

The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
clear_page that is capable of zeroing multiple pages at once (and scrubd
too but that is now an independent patch). The following patch extends
clear_page with a second parameter specifying the order of the page to be zeroed to allow an
efficient zeroing of pages. Hope I caught everything....

Patch against 2.6.11-rc1-bk9

Architecture support:
---------------------

Known to work:

ia64
i386
x86_64
sparc64
m68k

Trivial modification expected to simply work:

arm
cris
h8300
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um

Modification made but it would be good to have some feedback from the arch maintainers:

s390
alpha
sh
mips
m32r

Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-21 11:51:39.000000000 -0800
@@ -591,11 +591,16 @@ void fastcall free_cold_page(struct page
 	free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, unsigned int gfp_flags)
 {
 	int i;

 	BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+	if (!PageHighMem(page)) {
+		clear_page(page_address(page), order);
+		return;
+	}
+
 	for(i = 0; i < (1 << order); i++)
 		clear_highpage(page + i);
 }
Index: linux-2.6.10/mm/hugetlb.c
===================================================================
--- linux-2.6.10.orig/mm/hugetlb.c	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/hugetlb.c	2005-01-21 11:51:39.000000000 -0800
@@ -78,7 +78,6 @@ void free_huge_page(struct page *page)
 struct page *alloc_huge_page(void)
 {
 	struct page *page;
-	int i;

 	spin_lock(&hugetlb_lock);
 	page = dequeue_huge_page();
@@ -89,8 +88,7 @@ struct page *alloc_huge_page(void)
 	spin_unlock(&hugetlb_lock);
 	set_page_count(page, 1);
 	page[1].mapping = (void *)free_huge_page;
-	for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
-		clear_highpage(&page[i]);
+	prep_zero_page(page, HUGETLB_PAGE_ORDER, GFP_HIGHUSER);
 	return page;
 }

Index: linux-2.6.10/include/linux/highmem.h
===================================================================
--- linux-2.6.10.orig/include/linux/highmem.h	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h	2005-01-21 11:51:39.000000000 -0800
@@ -45,7 +45,7 @@ static inline void clear_user_highpage(s
 static inline void clear_highpage(struct page *page)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
-	clear_page(kaddr);
+	clear_page(kaddr, 0);
 	kunmap_atomic(kaddr, KM_USER0);
 }

Index: linux-2.6.10/drivers/net/tc35815.c
===================================================================
--- linux-2.6.10.orig/drivers/net/tc35815.c	2004-12-24 13:33:48.000000000 -0800
+++ linux-2.6.10/drivers/net/tc35815.c	2005-01-21 11:51:39.000000000 -0800
@@ -657,7 +657,7 @@ tc35815_init_queues(struct net_device *d
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
 	} else {
-		clear_page(lp->fd_buf);
+		clear_page(lp->fd_buf, 0);
 #ifdef __mips__
 		dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
 #endif
Index: linux-2.6.10/fs/afs/file.c
===================================================================
--- linux-2.6.10.orig/fs/afs/file.c	2004-12-24 13:35:59.000000000 -0800
+++ linux-2.6.10/fs/afs/file.c	2005-01-21 11:51:39.000000000 -0800
@@ -172,7 +172,7 @@ static int afs_file_readpage(struct file
 				      (size_t) PAGE_SIZE);
 		desc.buffer	= kmap(page);

-		clear_page(desc.buffer);
+		clear_page(desc.buffer, 0);

 		/* read the contents of the file from the server into the
 		 * page */
Index: linux-2.6.10/fs/ntfs/compress.c
===================================================================
--- linux-2.6.10.orig/fs/ntfs/compress.c	2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/fs/ntfs/compress.c	2005-01-21 11:51:39.000000000 -0800
@@ -107,7 +107,7 @@ static void zero_partial_compressed_page
 		 * FIXME: Using clear_page() will become wrong when we get
 		 * PAGE_CACHE_SIZE != PAGE_SIZE but for now there is no problem.
 		 */
-		clear_page(kp);
+		clear_page(kp, 0);
 		return;
 	}
 	kp_ofs = ni->initialized_size & ~PAGE_CACHE_MASK;
@@ -742,7 +742,7 @@ lock_retry_remap:
 				 * for now there is no problem.
 				 */
 				if (likely(!cur_ofs))
-					clear_page(page_address(page));
+					clear_page(page_address(page), 0);
 				else
 					memset(page_address(page) + cur_ofs, 0,
 							PAGE_CACHE_SIZE -
Index: linux-2.6.10/include/asm-ia64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/page.h	2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -56,7 +56,7 @@
 # ifdef __KERNEL__
 #  define STRICT_MM_TYPECHECKS

-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
 extern void copy_page (void *to, void *from);

 /*
@@ -65,7 +65,7 @@ extern void copy_page (void *to, void *f
  */
 #define clear_user_page(addr, vaddr, page)	\
 do {						\
-	clear_page(addr);			\
+	clear_page(addr, 0);			\
 	flush_dcache_page(page);		\
 } while (0)

Index: linux-2.6.10/arch/ia64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/ia64/lib/clear_page.S	2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10/arch/ia64/lib/clear_page.S	2005-01-21 11:51:39.000000000 -0800
@@ -7,6 +7,7 @@
  * 1/06/01 davidm	Tuned for Itanium.
  * 2/12/02 kchen	Tuned for both Itanium and McKinley
  * 3/08/02 davidm	Some more tweaking
+ * 12/10/04 clameter	Make it work on pages of order size
  */
 #include <linux/config.h>

@@ -29,27 +30,33 @@
 #define dst4		r11

 #define dst_last	r31
+#define totsize		r14

 GLOBAL_ENTRY(clear_page)
 	.prologue
-	.regstk 1,0,0,0
-	mov r16 = PAGE_SIZE/L3_LINE_SIZE-1	// main loop count, -1=repeat/until
+	.regstk 2,0,0,0
+	mov r16 = PAGE_SIZE/L3_LINE_SIZE	// main loop count
+	mov totsize = PAGE_SIZE
 	.save ar.lc, saved_lc
 	mov saved_lc = ar.lc
-
+	;;
 	.body
+	adds dst1 = 16, in0
 	mov ar.lc = (PREFETCH_LINES - 1)
 	mov dst_fetch = in0
-	adds dst1 = 16, in0
 	adds dst2 = 32, in0
+	shl r16 = r16, in1
+	shl totsize = totsize, in1
 	;;
 .fetch:	stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
 	adds dst3 = 48, in0		// executing this multiple times is harmless
 	br.cloop.sptk.few .fetch
+	add r16 = -1,r16
+	add dst_last = totsize, dst_fetch
+	adds dst4 = 64, in0
 	;;
-	addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
 	mov ar.lc = r16			// one L3 line per iteration
-	adds dst4 = 64, in0
+	adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
 	;;
 #ifdef CONFIG_ITANIUM
 	// Optimized for Itanium
Index: linux-2.6.10/include/asm-i386/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/page.h	2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -18,7 +18,7 @@

 #include <asm/mmx.h>

-#define clear_page(page)	mmx_clear_page((void *)(page))
+#define clear_page(page, order)	mmx_clear_page((void *)(page),order)
 #define copy_page(to,from)	mmx_copy_page(to,from)

 #else
@@ -28,12 +28,12 @@
  *	Maybe the K6-III ?
  */

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)

 #endif

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-i386/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/mmx.h	2005-01-21 11:51:39.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/i386/lib/mmx.c
===================================================================
--- linux-2.6.10.orig/arch/i386/lib/mmx.c	2004-12-24 13:34:48.000000000 -0800
+++ linux-2.6.10/arch/i386/lib/mmx.c	2005-01-21 11:51:39.000000000 -0800
@@ -128,7 +128,7 @@ void *_mmx_memcpy(void *to, const void *
  *	other MMX using processors do not.
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -138,7 +138,7 @@ static void fast_clear_page(void *page)
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/64;i++)
+	for(i=0;i<((4096/64) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@ static void fast_copy_page(void *to, voi
  *	Generic MMX implementation without K7 specific streaming
  */

-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
 {
 	int i;

@@ -267,7 +267,7 @@ static void fast_clear_page(void *page)
 		"  pxor %%mm0, %%mm0\n" : :
 	);

-	for(i=0;i<4096/128;i++)
+	for(i=0;i<((4096/128) << order);i++)
 	{
 		__asm__ __volatile__ (
 		"  movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@ static void fast_copy_page(void *to, voi
  *	Favour MMX for page clear and copy.
  */

-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
 {
 	int d0, d1;
 	__asm__ __volatile__( \
 		"cld\n\t" \
 		"rep ; stosl" \
 		: "=&c" (d0), "=&D" (d1)
-		:"a" (0),"1" (page),"0" (1024)
+		:"a" (0),"1" (page),"0" (1024 << order)
 		:"memory");
 }
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
 {
 	if(unlikely(in_interrupt()))
-		slow_zero_page(page);
+		slow_clear_page(page, order);
 	else
-		fast_clear_page(page);
+		fast_clear_page(page, order);
 }

 static void slow_copy_page(void *to, void *from)
Index: linux-2.6.10/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/page.h	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -32,10 +32,10 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-void clear_page(void *);
+void clear_page(void *, int);
 void copy_page(void *, void *);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-x86_64/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/mmx.h	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/mmx.h	2005-01-21 11:51:39.000000000 -0800
@@ -8,7 +8,7 @@
 #include <linux/types.h>

 extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
 extern void mmx_copy_page(void *to, void *from);

 #endif
Index: linux-2.6.10/arch/x86_64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/x86_64/lib/clear_page.S	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/lib/clear_page.S	2005-01-21 11:51:39.000000000 -0800
@@ -1,12 +1,16 @@
 /*
  * Zero a page.
  * rdi	page
+ * rsi	order
  */
 	.globl clear_page
 	.p2align 4
 clear_page:
+	movl   $4096/64,%eax
+	movl	%esi, %ecx
+	shll	%cl, %eax
+	movl	%eax, %ecx
 	xorl   %eax,%eax
-	movl   $4096/64,%ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
@@ -41,7 +45,10 @@ clear_page_end:

 	.section .altinstr_replacement,"ax"
 clear_page_c:
-	movl $4096/8,%ecx
+	movl $4096/8,%eax
+	movl %esi, %ecx
+	shll %cl, %eax
+	movl %eax, %ecx
 	xorl %eax,%eax
 	rep
 	stosq
Index: linux-2.6.10/include/asm-sparc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -28,10 +28,10 @@

 #ifndef __ASSEMBLY__

-#define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	 memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		sparc_flush_page_to_ram(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-s390/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -22,12 +22,12 @@

 #ifndef __s390x__

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	register_pair rp;

 	rp.subreg.even = (unsigned long) page;
-	rp.subreg.odd = (unsigned long) 4096;
+	rp.subreg.odd = (unsigned long) 4096 << order;
         asm volatile ("   slr  1,1\n"
 		      "   mvcl %0,0"
 		      : "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@ static inline void copy_page(void *to, v

 #else /* __s390x__ */

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
-        asm volatile ("   lgr  2,%0\n"
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+        	asm volatile ("   lgr  2,%0\n"
                       "   lghi 3,4096\n"
                       "   slgr 1,1\n"
                       "   mvcl 2,0"
                       : : "a" ((void *) (page))
 		      : "memory", "cc", "1", "2", "3" );
+		page += PAGE_SIZE;
+	}
 }

 static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@ static inline void copy_page(void *to, v

 #endif /* __s390x__ */

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /* Pure 2^n version of get_order */
Index: linux-2.6.10/include/asm-sh/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh/page.h	2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-sh/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -36,12 +36,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
 extern void (*copy_page)(void *to, void *from);

 extern void clear_page_slow(void *to);
 extern void copy_page_slow(void *to, void *from);

+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 #if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
 struct page;
 extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@ extern void copy_user_page(void *to, voi
 extern void __clear_user_page(void *to, void *orig_to);
 extern void __copy_user_page(void *to, void *from, void *orig_to);
 #elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 #elif defined(CONFIG_CPU_SH4)
 struct page;
Index: linux-2.6.10/arch/alpha/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/clear_page.S	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/clear_page.S	2005-01-21 11:51:39.000000000 -0800
@@ -6,11 +6,10 @@

 	.text
 	.align 4
-	.global clear_page
-	.ent clear_page
-clear_page:
+	.global _clear_page
+	.ent _clear_page
+_clear_page:
 	.prologue 0
-
 	lda	$0,128
 	nop
 	unop
@@ -36,4 +35,4 @@ clear_page:
 	unop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/include/asm-sh64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh64/page.h	2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -50,12 +50,20 @@ extern struct page *mem_map;
 extern void sh64_page_clear(void *page);
 extern void sh64_page_copy(void *from, void *to);

-#define clear_page(page)               sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+	int nr = 1 << order;
+
+	while (nr-- >0) {
+		sh64_page_clear(page++, 0);
+	}
+}
+
 #define copy_page(to,from)             sh64_page_copy(from, to)

 #if defined(CONFIG_DCACHE_DISABLED)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	sh_clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 #else
Index: linux-2.6.10/include/asm-h8300/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-h8300/page.h	2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-arm/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/page.h	2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-arm/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -128,7 +128,7 @@ extern void __cpu_copy_user_page(void *t
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 extern void copy_page(void *to, const void *from);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-ppc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc64/page.h	2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-ppc64/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -102,12 +102,12 @@
 #define REGION_MASK   (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
 #define REGION_STRIDE (1UL << REGION_SHIFT)

-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, unsigned int order)
 {
 	unsigned long lines, line_size;

 	line_size = ppc64_caches.dline_size;
-	lines = ppc64_caches.dlines_per_page;
+	lines = ppc64_caches.dlines_per_page << order;

 	__asm__ __volatile__(
 	"mtctr  	%1	# clear_page\n\
Index: linux-2.6.10/include/asm-m32r/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -11,10 +11,22 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- > 0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+
 extern void copy_page(void *to, void *from);

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-alpha/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/page.h	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -15,8 +15,20 @@

 #define STRICT_MM_TYPECHECKS

-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+	int nr = 1 << order;
+
+	while (nr--)
+	{
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)

 extern void copy_page(void * _to, void * _from);
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
Index: linux-2.6.10/arch/mips/mm/pg-sb1.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-sb1.c	2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-sb1.c	2005-01-21 11:51:39.000000000 -0800
@@ -42,7 +42,7 @@
 #ifdef CONFIG_SIBYTE_DMA_PAGEOPS
 static inline void clear_page_cpu(void *page)
 #else
-void clear_page(void *page)
+void _clear_page(void *page)
 #endif
 {
 	unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@ void sb1_dma_init(void)
 		     IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
 }

-void clear_page(void *page)
+void _clear_page(void *page)
 {
 	int cpu = smp_processor_id();

 	/* if the page is above Kseg0, use old way */
 	if (KSEGX(page) != CAC_BASE)
 		return clear_page_cpu(page);
-
 	page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
 	page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
 	__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@ void copy_page(void *to, void *from)

 #endif

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
 EXPORT_SYMBOL(copy_page);
Index: linux-2.6.10/include/asm-m68k/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68k/page.h	2004-12-24 13:35:49.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -50,7 +50,7 @@ static inline void copy_page(void *to, v
 		       );
 }

-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
 {
 	unsigned long tmp;
 	unsigned long *sp = page;
@@ -69,16 +69,16 @@ static inline void clear_page(void *page
 			     "dbra   %1,1b\n\t"
 			     : "=a" (sp), "=d" (tmp)
 			     : "a" (page), "0" (sp),
-			       "1" ((PAGE_SIZE - 16) / 16 - 1));
+			       "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
 }

 #else
-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 #endif

 #define clear_user_page(addr, vaddr, page)	\
-	do {	clear_page(addr);		\
+	do {	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-mips/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-mips/page.h	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/include/asm-mips/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -39,7 +39,18 @@
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__

-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+	unsigned int nr = 1 << order;
+
+	while (nr-- >0) {
+		_clear_page(page);
+		page += PAGE_SIZE;
+	}
+}
+
 extern void copy_page(void * to, void * from);

 extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@ static inline void clear_user_page(void
 {
 	extern void (*flush_data_cache_page)(unsigned long addr);

-	clear_page(addr);
+	clear_page(addr, 0);
 	if (pages_do_alias((unsigned long) addr, vaddr))
 		flush_data_cache_page((unsigned long)addr);
 }
Index: linux-2.6.10/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68knommu/page.h	2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -24,10 +24,10 @@
 #define get_user_page(vaddr)		__get_free_page(GFP_KERNEL)
 #define free_user_page(page, addr)	free_page(addr)

-#define clear_page(page)	memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)	clear_page(page)
+#define clear_user_page(page, vaddr, pg)	clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-cris/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/page.h	2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -15,10 +15,10 @@

 #ifdef __KERNEL__

-#define clear_page(page)        memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      memcpy((void *)(to), (void *)(from), PAGE_SIZE)

-#define clear_user_page(page, vaddr, pg)    clear_page(page)
+#define clear_user_page(page, vaddr, pg)    clear_page(page, 0)
 #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

 /*
Index: linux-2.6.10/include/asm-v850/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-v850/page.h	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-v850/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -37,11 +37,11 @@

 #define STRICT_MM_TYPECHECKS

-#define clear_page(page)	memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset ((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to, from)	memcpy ((void *)(to), (void *)from, PAGE_SIZE)

 #define clear_user_page(addr, vaddr, page)	\
-	do { 	clear_page(addr);		\
+	do { 	clear_page(addr, 0);		\
 		flush_dcache_page(page);	\
 	} while (0)
 #define copy_user_page(to, from, vaddr, page)	\
Index: linux-2.6.10/include/asm-parisc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-parisc/page.h	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -13,7 +13,7 @@
 #include <asm/types.h>
 #include <asm/cache.h>

-#define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order)	memset((void *)(page), 0, PAGE_SIZE << (order))
 #define copy_page(to,from)      copy_user_page_asm((void *)(to), (void *)(from))

 struct page;
Index: linux-2.6.10/arch/arm/mm/copypage-v6.c
===================================================================
--- linux-2.6.10.orig/arch/arm/mm/copypage-v6.c	2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/arm/mm/copypage-v6.c	2005-01-21 11:51:39.000000000 -0800
@@ -47,7 +47,7 @@ void v6_copy_user_page_nonaliasing(void
  */
 void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
 {
-	clear_page(kaddr);
+	_clear_page(kaddr);
 }

 /*
@@ -116,7 +116,7 @@ void v6_clear_user_page_aliasing(void *k

 	set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
 	flush_tlb_kernel_page(to);
-	clear_page((void *)to);
+	_clear_page((void *)to);

 	spin_unlock(&v6_lock);
 }
Index: linux-2.6.10/arch/m32r/mm/page.S
===================================================================
--- linux-2.6.10.orig/arch/m32r/mm/page.S	2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/arch/m32r/mm/page.S	2005-01-21 11:51:39.000000000 -0800
@@ -51,7 +51,7 @@ copy_page:
 	jmp	r14

 	.text
-	.global	clear_page
+	.global	_clear_page
 	/*
 	 * clear_page (to)
 	 *
@@ -60,7 +60,7 @@ copy_page:
 	 * 16 * 256
 	 */
 	.align	4
-clear_page:
+_clear_page:
 	ldi	r2, #255
 	ldi	r4, #0
 	ld	r3, @r0		/* cache line allocate */
Index: linux-2.6.10/include/asm-ppc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc/page.h	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-ppc/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -85,7 +85,7 @@ typedef unsigned long pgprot_t;

 struct page;
 extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define  clear_page clear_pages
 extern void copy_page(void *to, void *from);
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/alpha/kernel/alpha_ksyms.c	2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c	2005-01-21 11:51:39.000000000 -0800
@@ -88,7 +88,7 @@ EXPORT_SYMBOL(__memset);
 EXPORT_SYMBOL(__memsetw);
 EXPORT_SYMBOL(__constant_c_memset);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(__direct_map_base);
 EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.10/arch/alpha/lib/ev6-clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/ev6-clear_page.S	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/ev6-clear_page.S	2005-01-21 11:51:39.000000000 -0800
@@ -6,9 +6,9 @@

         .text
         .align 4
-        .global clear_page
-        .ent clear_page
-clear_page:
+        .global _clear_page
+        .ent _clear_page
+_clear_page:
         .prologue 0

 	lda	$0,128
@@ -51,4 +51,4 @@ clear_page:
 	nop
 	nop

-	.end clear_page
+	.end _clear_page
Index: linux-2.6.10/arch/sh/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/init.c	2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/init.c	2005-01-21 11:51:39.000000000 -0800
@@ -57,7 +57,7 @@ bootmem_data_t discontig_node_bdata[MAX_
 #endif

 void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);

 void show_mem(void)
 {
@@ -255,7 +255,7 @@ void __init mem_init(void)
 	 * later in the boot process if a better method is available.
 	 */
 	copy_page = copy_page_slow;
-	clear_page = clear_page_slow;
+	_clear_page = clear_page_slow;

 	/* this will put all low memory onto the freelists */
 	totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.10/arch/sh/mm/pg-dma.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-dma.c	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-dma.c	2005-01-21 11:51:39.000000000 -0800
@@ -78,7 +78,7 @@ static int __init pg_dma_init(void)
 		return ret;

 	copy_page = copy_page_dma;
-	clear_page = clear_page_dma;
+	_clear_page = clear_page_dma;

 	return ret;
 }
Index: linux-2.6.10/arch/sh/mm/pg-nommu.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-nommu.c	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-nommu.c	2005-01-21 11:51:39.000000000 -0800
@@ -27,7 +27,7 @@ static void clear_page_nommu(void *to)
 static int __init pg_nommu_init(void)
 {
 	copy_page = copy_page_nommu;
-	clear_page = clear_page_nommu;
+	_clear_page = clear_page_nommu;

 	return 0;
 }
Index: linux-2.6.10/arch/mips/mm/pg-r4k.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-r4k.c	2004-12-24 13:34:49.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-r4k.c	2005-01-21 11:51:39.000000000 -0800
@@ -39,9 +39,9 @@

 static unsigned int clear_page_array[0x130 / 4];

-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));

-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 /*
  * Maximum sizes:
Index: linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/m32r/kernel/m32r_ksyms.c	2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c	2005-01-21 11:51:39.000000000 -0800
@@ -102,7 +102,7 @@ EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(memcmp);
 EXPORT_SYMBOL(memscan);
 EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);

 EXPORT_SYMBOL(strcat);
 EXPORT_SYMBOL(strchr);
Index: linux-2.6.10/include/asm-arm26/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm26/page.h	2004-12-24 13:35:22.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -25,7 +25,7 @@ extern void copy_page(void *to, const vo
 		preempt_enable();			\
 	} while (0)

-#define clear_page(page)	memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order)	memzero((void *)(page), PAGE_SIZE << (order))
 #define copy_page(to, from)  __copy_user_page(to, from, 0);

 #undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-sparc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/page.h	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/page.h	2005-01-21 11:51:39.000000000 -0800
@@ -14,8 +14,8 @@

 #ifndef __ASSEMBLY__

-extern void _clear_page(void *page);
-#define clear_page(X)	_clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y)	_clear_page((void *)(X),(Y))
 struct page;
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 #define copy_page(X,Y)	memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
Index: linux-2.6.10/arch/sparc64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/sparc64/lib/clear_page.S	2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/arch/sparc64/lib/clear_page.S	2005-01-21 11:51:39.000000000 -0800
@@ -28,9 +28,12 @@
 	.text

 	.globl		_clear_page
-_clear_page:		/* %o0=dest */
+_clear_page:		/* %o0=dest, %o1=order */
+	sethi		%hi(PAGE_SIZE/64), %o2
+	clr		%o4
+	or		%o2, %lo(PAGE_SIZE/64), %o2
 	ba,pt		%xcc, clear_page_common
-	 clr		%o4
+	 sllx		%o2, %o1, %o1

 	/* This thing is pretty important, it shows up
 	 * on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@ clear_user_page:	/* %o0=dest, %o1=vaddr
 	flush		%g6
 	wrpr		%o4, 0x0, %pstate

+	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		1, %o4
+	or		%o1, %lo(PAGE_SIZE/64), %o1

 clear_page_common:
 	VISEntryHalf
 	membar		#StoreLoad | #StoreStore | #LoadStore
 	fzero		%f0
-	sethi		%hi(PAGE_SIZE/64), %o1
 	mov		%o0, %g1		! remember vaddr for tlbflush
 	fzero		%f2
-	or		%o1, %lo(PAGE_SIZE/64), %o1
 	faddd		%f0, %f2, %f4
 	fmuld		%f0, %f2, %f6
 	faddd		%f0, %f2, %f8

^ permalink raw reply	[flat|nested] 87+ messages in thread

* A scrub daemon (prezeroing)
  2005-01-08 21:56                   ` David S. Miller
  2005-01-21 20:09                     ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
  2005-01-21 20:12                     ` Extend clear_page by an order parameter Christoph Lameter
@ 2005-01-21 20:15                     ` Christoph Lameter
  2 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-21 20:15 UTC (permalink / raw)
  Cc: linux-mm, linux-kernel

Adds management of ZEROED and NOT_ZEROED pages and a background daemon
called scrubd. scrubd is disabled by default but can be enabled
by writing an order number to /proc/sys/vm/scrub_start. If a page
is coalesced of that order or higher then the scrub daemon will
start zeroing until all pages of order /proc/sys/vm/scrub_stop and
higher are zeroed and then go back to sleep.

In an SMP environment the scrub daemon is typically
running on the most idle cpu. Thus a single threaded application running
on one cpu may have the other cpu zeroing pages for it etc. The scrub
daemon is hardly noticable and usually finished zeroing quickly since
most processors are optimized for linear memory filling.

Note that this patch does not depend on any other patches but other
patches would improve what scrubd does. The extension of clear_pages by an
order parameter would increase the speed of zeroing and the patch
introducing alloc_zeroed_user_highpage is necessary for user
pages to be allocated from the pool of zeroed pages.

Patch against 2.6.11-rc1-bk9

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c	2005-01-21 12:01:44.000000000 -0800
@@ -12,6 +12,8 @@
  *  Zone balancing, Kanoj Sarcar, SGI, Jan 2000
  *  Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
  *          (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Page zeroing by Christoph Lameter, SGI, Dec 2004 based on
+ *	initial code for __GFP_ZERO support by Andrea Arcangeli, Oct 2004.
  */

 #include <linux/config.h>
@@ -33,6 +35,7 @@
 #include <linux/cpu.h>
 #include <linux/nodemask.h>
 #include <linux/vmalloc.h>
+#include <linux/scrub.h>

 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -167,16 +170,16 @@ static void destroy_compound_page(struct
  * zone->lock is already acquired when we use these.
  * So, we don't need atomic page->flags operations here.
  */
-static inline unsigned long page_order(struct page *page) {
+static inline unsigned long page_zorder(struct page *page) {
 	return page->private;
 }

-static inline void set_page_order(struct page *page, int order) {
-	page->private = order;
+static inline void set_page_zorder(struct page *page, int order, int zero) {
+	page->private = order + (zero << 10);
 	__SetPagePrivate(page);
 }

-static inline void rmv_page_order(struct page *page)
+static inline void rmv_page_zorder(struct page *page)
 {
 	__ClearPagePrivate(page);
 	page->private = 0;
@@ -187,14 +190,15 @@ static inline void rmv_page_order(struct
  * we can do coalesce a page and its buddy if
  * (a) the buddy is free &&
  * (b) the buddy is on the buddy system &&
- * (c) a page and its buddy have the same order.
+ * (c) a page and its buddy have the same order and the same
+ *     zeroing status.
  * for recording page's order, we use page->private and PG_private.
  *
  */
-static inline int page_is_buddy(struct page *page, int order)
+static inline int page_is_buddy(struct page *page, int order, int zero)
 {
        if (PagePrivate(page)           &&
-           (page_order(page) == order) &&
+           (page_zorder(page) == order + (zero << 10)) &&
            !PageReserved(page)         &&
             page_count(page) == 0)
                return 1;
@@ -225,22 +229,20 @@ static inline int page_is_buddy(struct p
  * -- wli
  */

-static inline void __free_pages_bulk (struct page *page, struct page *base,
-		struct zone *zone, unsigned int order)
+static inline int __free_pages_bulk (struct page *page, struct page *base,
+		struct zone *zone, unsigned int order, int zero)
 {
 	unsigned long page_idx;
 	struct page *coalesced;
-	int order_size = 1 << order;

 	if (unlikely(order))
 		destroy_compound_page(page, order);

 	page_idx = page - base;

-	BUG_ON(page_idx & (order_size - 1));
+	BUG_ON(page_idx & (( 1 << order) - 1));
 	BUG_ON(bad_range(zone, page));

-	zone->free_pages += order_size;
 	while (order < MAX_ORDER-1) {
 		struct free_area *area;
 		struct page *buddy;
@@ -250,20 +252,21 @@ static inline void __free_pages_bulk (st
 		buddy = base + buddy_idx;
 		if (bad_range(zone, buddy))
 			break;
-		if (!page_is_buddy(buddy, order))
+		if (!page_is_buddy(buddy, order, zero))
 			break;
 		/* Move the buddy up one level. */
 		list_del(&buddy->lru);
-		area = zone->free_area + order;
+		area = zone->free_area[zero] + order;
 		area->nr_free--;
-		rmv_page_order(buddy);
+		rmv_page_zorder(buddy);
 		page_idx &= buddy_idx;
 		order++;
 	}
 	coalesced = base + page_idx;
-	set_page_order(coalesced, order);
-	list_add(&coalesced->lru, &zone->free_area[order].free_list);
-	zone->free_area[order].nr_free++;
+	set_page_zorder(coalesced, order, zero);
+	list_add(&coalesced->lru, &zone->free_area[zero][order].free_list);
+	zone->free_area[zero][order].nr_free++;
+	return order;
 }

 static inline void free_pages_check(const char *function, struct page *page)
@@ -312,8 +315,11 @@ free_pages_bulk(struct zone *zone, int c
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_pages_bulk list manipulates */
 		list_del(&page->lru);
-		__free_pages_bulk(page, base, zone, order);
+		if (__free_pages_bulk(page, base, zone, order, NOT_ZEROED)
+			>= sysctl_scrub_start)
+				wakeup_kscrubd(zone);
 		ret++;
+		zone->free_pages += 1UL << order;
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
@@ -341,6 +347,18 @@ void __free_pages_ok(struct page *page,
 	free_pages_bulk(page_zone(page), 1, &list, order);
 }

+void end_zero_page(struct page *page, unsigned int order)
+{
+	unsigned long flags;
+	struct zone * zone = page_zone(page);
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	__free_pages_bulk(page, zone->zone_mem_map, zone, order, ZEROED);
+	zone->zero_pages += 1UL << order;
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}

 /*
  * The order of subdivision here is critical for the IO subsystem.
@@ -358,7 +376,7 @@ void __free_pages_ok(struct page *page,
  */
 static inline struct page *
 expand(struct zone *zone, struct page *page,
- 	int low, int high, struct free_area *area)
+ 	int low, int high, struct free_area *area, int zero)
 {
 	unsigned long size = 1 << high;

@@ -369,7 +387,7 @@ expand(struct zone *zone, struct page *p
 		BUG_ON(bad_range(zone, &page[size]));
 		list_add(&page[size].lru, &area->free_list);
 		area->nr_free++;
-		set_page_order(&page[size], high);
+		set_page_zorder(&page[size], high, zero);
 	}
 	return page;
 }
@@ -420,23 +438,44 @@ static void prep_new_page(struct page *p
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct free_area *area)
+{
+	list_del(&page->lru);
+	rmv_page_zorder(page);
+	area->nr_free--;
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area)
+{
+	unsigned long flags;
+	struct page *page = NULL;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	if (!list_empty(&area->free_list)) {
+		page = list_entry(area->free_list.next, struct page, lru);
+		rmpage(page, area);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
 {
-	struct free_area * area;
+	struct free_area *area;
 	unsigned int current_order;
 	struct page *page;

 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
+		area = zone->free_area[zero] + current_order;
 		if (list_empty(&area->free_list))
 			continue;

 		page = list_entry(area->free_list.next, struct page, lru);
-		list_del(&page->lru);
-		rmv_page_order(page);
-		area->nr_free--;
+		rmpage(page, zone->free_area[zero] + current_order);
 		zone->free_pages -= 1UL << order;
-		return expand(zone, page, order, current_order, area);
+		if (zero)
+			zone->zero_pages -= 1UL << order;
+		return expand(zone, page, order, current_order, area, zero);
 	}

 	return NULL;
@@ -448,7 +487,7 @@ static struct page *__rmqueue(struct zon
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list, int zero)
 {
 	unsigned long flags;
 	int i;
@@ -457,7 +496,7 @@ static int rmqueue_bulk(struct zone *zon

 	spin_lock_irqsave(&zone->lock, flags);
 	for (i = 0; i < count; ++i) {
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, zero);
 		if (page == NULL)
 			break;
 		allocated++;
@@ -504,7 +543,7 @@ void mark_free_pages(struct zone *zone)
 		ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));

 	for (order = MAX_ORDER - 1; order >= 0; --order)
-		list_for_each(curr, &zone->free_area[order].free_list) {
+		list_for_each(curr, &zone->free_area[NOT_ZEROED][order].free_list) {
 			unsigned long start_pfn, i;

 			start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -591,7 +630,7 @@ void fastcall free_cold_page(struct page
 	free_hot_cold_page(page, 1);
 }

-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, unsigned int gfp_flags)
 {
 	int i;

@@ -610,7 +649,9 @@ buffered_rmqueue(struct zone *zone, int
 {
 	unsigned long flags;
 	struct page *page = NULL;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	int nr_pages = 1 << order;
+	int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+	int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;

 	if (order == 0) {
 		struct per_cpu_pages *pcp;
@@ -619,7 +660,7 @@ buffered_rmqueue(struct zone *zone, int
 		local_irq_save(flags);
 		if (pcp->count <= pcp->low)
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+						pcp->batch, &pcp->list, zero);
 		if (pcp->count) {
 			page = list_entry(pcp->list.next, struct page, lru);
 			list_del(&page->lru);
@@ -631,16 +672,25 @@ buffered_rmqueue(struct zone *zone, int

 	if (page == NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, zero);
+		/*
+		 * If we failed to obtain a zero and/or unzeroed page
+		 * then we may still be able to obtain the other
+		 * type of page.
+		 */
+		if (!page) {
+			page = __rmqueue(zone, order, !zero);
+			zero = 0;
+		}
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

 	if (page != NULL) {
 		BUG_ON(bad_range(zone, page));
-		mod_page_state_zone(zone, pgalloc, 1 << order);
+		mod_page_state_zone(zone, pgalloc, nr_pages);
 		prep_new_page(page, order);

-		if (gfp_flags & __GFP_ZERO)
+		if ((gfp_flags & __GFP_ZERO) && !zero)
 			prep_zero_page(page, order, gfp_flags);

 		if (order && (gfp_flags & __GFP_COMP))
@@ -669,7 +719,7 @@ int zone_watermark_ok(struct zone *z, in
 		return 0;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
-		free_pages -= z->free_area[o].nr_free << o;
+		free_pages -= (z->free_area[NOT_ZEROED][o].nr_free + z->free_area[ZEROED][o].nr_free)  << o;

 		/* Require fewer higher order pages to be free */
 		min >>= 1;
@@ -1046,7 +1096,7 @@ unsigned long __read_page_state(unsigned
 }

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat)
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
 {
 	struct zone *zones = pgdat->node_zones;
 	int i;
@@ -1054,27 +1104,31 @@ void __get_zone_counts(unsigned long *ac
 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for (i = 0; i < MAX_NR_ZONES; i++) {
 		*active += zones[i].nr_active;
 		*inactive += zones[i].nr_inactive;
 		*free += zones[i].free_pages;
+		*zero += zones[i].zero_pages;
 	}
 }

 void get_zone_counts(unsigned long *active,
-		unsigned long *inactive, unsigned long *free)
+		unsigned long *inactive, unsigned long *free, unsigned long *zero)
 {
 	struct pglist_data *pgdat;

 	*active = 0;
 	*inactive = 0;
 	*free = 0;
+	*zero = 0;
 	for_each_pgdat(pgdat) {
-		unsigned long l, m, n;
-		__get_zone_counts(&l, &m, &n, pgdat);
+		unsigned long l, m, n,o;
+		__get_zone_counts(&l, &m, &n, &o, pgdat);
 		*active += l;
 		*inactive += m;
 		*free += n;
+		*zero += o;
 	}
 }

@@ -1111,6 +1165,7 @@ void si_meminfo_node(struct sysinfo *val

 #define K(x) ((x) << (PAGE_SHIFT-10))

+const char *temp[3] = { "hot", "cold", "zero" };
 /*
  * Show free area list (used inside shift_scroll-lock stuff)
  * We also calculate the percentage fragmentation. We do this by counting the
@@ -1123,6 +1178,7 @@ void show_free_areas(void)
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;
 	struct zone *zone;

 	for_each_zone(zone) {
@@ -1143,10 +1199,10 @@ void show_free_areas(void)

 			pageset = zone->pageset + cpu;

-			for (temperature = 0; temperature < 2; temperature++)
+			for (temperature = 0; temperature < 3; temperature++)
 				printk("cpu %d %s: low %d, high %d, batch %d\n",
 					cpu,
-					temperature ? "cold" : "hot",
+					temp[temperature],
 					pageset->pcp[temperature].low,
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch);
@@ -1154,20 +1210,21 @@ void show_free_areas(void)
 	}

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 	printk("\nFree pages: %11ukB (%ukB HighMem)\n",
 		K(nr_free_pages()),
 		K(nr_free_highpages()));

 	printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
-		"unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+		"unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
 		active,
 		inactive,
 		ps.nr_dirty,
 		ps.nr_writeback,
 		ps.nr_unstable,
 		nr_free_pages(),
+		zero,
 		ps.nr_slab,
 		ps.nr_mapped,
 		ps.nr_page_table_pages);
@@ -1216,7 +1273,7 @@ void show_free_areas(void)

 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < MAX_ORDER; order++) {
-			nr = zone->free_area[order].nr_free;
+			nr = zone->free_area[NOT_ZEROED][order].nr_free + zone->free_area[ZEROED][order].nr_free;
 			total += nr << order;
 			printk("%lu*%lukB ", nr, K(1UL) << order);
 		}
@@ -1516,8 +1573,10 @@ void zone_init_free_lists(struct pglist_
 {
 	int order;
 	for (order = 0; order < MAX_ORDER ; order++) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
-		zone->free_area[order].nr_free = 0;
+		INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+		INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
+		zone->free_area[NOT_ZEROED][order].nr_free = 0;
+		zone->free_area[ZEROED][order].nr_free = 0;
 	}
 }

@@ -1542,6 +1601,7 @@ static void __init free_area_init_core(s

 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->kscrubd_wait);
 	pgdat->kswapd_max_order = 0;

 	for (j = 0; j < MAX_NR_ZONES; j++) {
@@ -1565,6 +1625,7 @@ static void __init free_area_init_core(s
 		spin_lock_init(&zone->lru_lock);
 		zone->zone_pgdat = pgdat;
 		zone->free_pages = 0;
+		zone->zero_pages = 0;

 		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;

@@ -1598,6 +1659,13 @@ static void __init free_area_init_core(s
 			pcp->high = 2 * batch;
 			pcp->batch = 1 * batch;
 			INIT_LIST_HEAD(&pcp->list);
+
+			pcp = &zone->pageset[cpu].pcp[2];	/* zero pages */
+			pcp->count = 0;
+			pcp->low = 0;
+			pcp->high = 2 * batch;
+			pcp->batch = 1 * batch;
+			INIT_LIST_HEAD(&pcp->list);
 		}
 		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
 				zone_names[j], realsize, batch);
@@ -1723,7 +1791,7 @@ static int frag_show(struct seq_file *m,
 		spin_lock_irqsave(&zone->lock, flags);
 		seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
 		for (order = 0; order < MAX_ORDER; ++order)
-			seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+			seq_printf(m, "%6lu ", zone->free_area[NOT_ZEROED][order].nr_free);
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
Index: linux-2.6.10/include/linux/mmzone.h
===================================================================
--- linux-2.6.10.orig/include/linux/mmzone.h	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/linux/mmzone.h	2005-01-21 11:56:07.000000000 -0800
@@ -51,7 +51,7 @@ struct per_cpu_pages {
 };

 struct per_cpu_pageset {
-	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
+	struct per_cpu_pages pcp[3];	/* 0: hot.  1: cold  2: cold zeroed pages */
 #ifdef CONFIG_NUMA
 	unsigned long numa_hit;		/* allocated in intended node */
 	unsigned long numa_miss;	/* allocated in non intended node */
@@ -107,10 +107,14 @@ struct per_cpu_pageset {
  * ZONE_HIGHMEM	 > 896 MB	only page cache and user processes
  */

+#define NOT_ZEROED 0
+#define ZEROED 1
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 	unsigned long		free_pages;
 	unsigned long		pages_min, pages_low, pages_high;
+	unsigned long		zero_pages;
 	/*
 	 * protection[] is a pre-calculated number of extra pages that must be
 	 * available in a zone in order for __alloc_pages() to allocate memory
@@ -131,7 +135,7 @@ struct zone {
 	 * free areas of different sizes
 	 */
 	spinlock_t		lock;
-	struct free_area	free_area[MAX_ORDER];
+	struct free_area	free_area[2][MAX_ORDER];


 	ZONE_PADDING(_pad1_)
@@ -266,6 +270,9 @@ typedef struct pglist_data {
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
+
+	wait_queue_head_t       kscrubd_wait;
+	struct task_struct *kscrubd;
 } pg_data_t;

 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -274,9 +281,9 @@ typedef struct pglist_data {
 extern struct pglist_data *pgdat_list;

 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free, struct pglist_data *pgdat);
+			unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
 void get_zone_counts(unsigned long *active, unsigned long *inactive,
-			unsigned long *free);
+			unsigned long *free, unsigned long *zero);
 void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone, int order);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Index: linux-2.6.10/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.10.orig/fs/proc/proc_misc.c	2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/fs/proc/proc_misc.c	2005-01-21 11:56:07.000000000 -0800
@@ -123,12 +123,13 @@ static int meminfo_read_proc(char *page,
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;
 	unsigned long committed;
 	unsigned long allowed;
 	struct vmalloc_info vmi;

 	get_page_state(&ps);
-	get_zone_counts(&active, &inactive, &free);
+	get_zone_counts(&active, &inactive, &free, &zero);

 /*
  * display in kilobytes.
@@ -148,6 +149,7 @@ static int meminfo_read_proc(char *page,
 	len = sprintf(page,
 		"MemTotal:     %8lu kB\n"
 		"MemFree:      %8lu kB\n"
+		"MemZero:      %8lu kB\n"
 		"Buffers:      %8lu kB\n"
 		"Cached:       %8lu kB\n"
 		"SwapCached:   %8lu kB\n"
@@ -171,6 +173,7 @@ static int meminfo_read_proc(char *page,
 		"VmallocChunk: %8lu kB\n",
 		K(i.totalram),
 		K(i.freeram),
+		K(zero),
 		K(i.bufferram),
 		K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
 		K(total_swapcache_pages),
Index: linux-2.6.10/mm/readahead.c
===================================================================
--- linux-2.6.10.orig/mm/readahead.c	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/readahead.c	2005-01-21 11:56:07.000000000 -0800
@@ -573,7 +573,8 @@ unsigned long max_sane_readahead(unsigne
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long free;
+	unsigned long zero;

-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
Index: linux-2.6.10/drivers/base/node.c
===================================================================
--- linux-2.6.10.orig/drivers/base/node.c	2005-01-21 10:43:56.000000000 -0800
+++ linux-2.6.10/drivers/base/node.c	2005-01-21 11:56:07.000000000 -0800
@@ -42,13 +42,15 @@ static ssize_t node_read_meminfo(struct
 	unsigned long inactive;
 	unsigned long active;
 	unsigned long free;
+	unsigned long zero;

 	si_meminfo_node(&i, nid);
-	__get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+	__get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));

 	n = sprintf(buf, "\n"
 		       "Node %d MemTotal:     %8lu kB\n"
 		       "Node %d MemFree:      %8lu kB\n"
+		       "Node %d MemZero:      %8lu kB\n"
 		       "Node %d MemUsed:      %8lu kB\n"
 		       "Node %d Active:       %8lu kB\n"
 		       "Node %d Inactive:     %8lu kB\n"
@@ -58,6 +60,7 @@ static ssize_t node_read_meminfo(struct
 		       "Node %d LowFree:      %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
+		       nid, K(zero),
 		       nid, K(i.totalram - i.freeram),
 		       nid, K(active),
 		       nid, K(inactive),
Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h	2005-01-21 10:44:03.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h	2005-01-21 11:56:07.000000000 -0800
@@ -736,6 +736,7 @@ do { if (atomic_dec_and_test(&(tsk)->usa
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
+#define PF_KSCRUBD	0x00800000	/* I am kscrubd */

 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.10/mm/Makefile
===================================================================
--- linux-2.6.10.orig/mm/Makefile	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/Makefile	2005-01-21 11:56:07.000000000 -0800
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+			   vmalloc.o scrubd.o

 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
Index: linux-2.6.10/mm/scrubd.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/mm/scrubd.c	2005-01-21 11:56:07.000000000 -0800
@@ -0,0 +1,134 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = 5;	/* if a page of this order is coalesed then run kscrubd */
+unsigned int sysctl_scrub_stop = 2;	/* Mininum order of page to zero */
+unsigned int sysctl_scrub_load = 999;	/* Do not run scrubd if load > */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec(table, write, file, buffer, length, ppos);
+	if (sysctl_scrub_start < MAX_ORDER) {
+		struct zone *zone;
+
+		for_each_zone(zone)
+			wakeup_kscrubd(zone);
+	}
+	return 0;
+}
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+	int order;
+
+	for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+		struct free_area *area = z->free_area[NOT_ZEROED] + order;
+		if (!list_empty(&area->free_list)) {
+			struct page *page = scrubd_rmpage(z, area);
+			struct list_head *l;
+			int size = PAGE_SIZE << order;
+
+			if (!page)
+				continue;
+
+			list_for_each(l, &zero_drivers) {
+				struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+
+				if (driver->start(page_address(page), size) == 0)
+					goto done;
+			}
+
+			/* Unable to find a zeroing device that would
+			 * deal with this page so just do it on our own.
+			 * This will likely thrash the cpu caches.
+			 */
+			cond_resched();
+			prep_zero_page(page, order, 0);
+done:
+			end_zero_page(page, order);
+			cond_resched();
+			return 1 << order;
+		}
+	}
+	return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+	int i;
+	unsigned long pages_zeroed;
+
+	if (system_state != SYSTEM_RUNNING)
+		return;
+
+	do {
+		pages_zeroed = 0;
+		for (i = 0; i < pgdat->nr_zones; i++) {
+			struct zone *zone = pgdat->node_zones + i;
+
+			pages_zeroed += zero_highest_order_page(zone);
+		}
+	} while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t*)p;
+	struct task_struct *tsk = current;
+	DEFINE_WAIT(wait);
+	cpumask_t cpumask;
+
+	daemonize("kscrubd%d", pgdat->node_id);
+	cpumask = node_to_cpumask(pgdat->node_id);
+	if (!cpus_empty(cpumask))
+		set_cpus_allowed(tsk, cpumask);
+
+	tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+	for ( ; ; ) {
+		if (current->flags & PF_FREEZE)
+			refrigerator(PF_FREEZE);
+		prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+		schedule();
+		finish_wait(&pgdat->kscrubd_wait, &wait);
+
+		scrub_pgdat(pgdat);
+	}
+	return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+	pg_data_t *pgdat;
+	for_each_pgdat(pgdat)
+		pgdat->kscrubd
+		= find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+	return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.10/include/linux/scrub.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/include/linux/scrub.h	2005-01-21 11:56:07.000000000 -0800
@@ -0,0 +1,49 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+        int (*start)(void *, unsigned long);		/* Start bzero transfer */
+        struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+extern unsigned int sysctl_scrub_load;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+	list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+	list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+        if (avenrun[0] >= ((unsigned long)sysctl_scrub_load << FSHIFT))
+		return;
+	if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+                return;
+        wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+				      void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page, unsigned int order);
+#endif
Index: linux-2.6.10/kernel/sysctl.c
===================================================================
--- linux-2.6.10.orig/kernel/sysctl.c	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/kernel/sysctl.c	2005-01-21 11:56:07.000000000 -0800
@@ -40,6 +40,7 @@
 #include <linux/times.h>
 #include <linux/limits.h>
 #include <linux/dcache.h>
+#include <linux/scrub.h>
 #include <linux/syscalls.h>

 #include <asm/uaccess.h>
@@ -827,6 +828,33 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_jiffies,
 	},
 #endif
+	{
+		.ctl_name	= VM_SCRUB_START,
+		.procname	= "scrub_start",
+		.data		= &sysctl_scrub_start,
+		.maxlen		= sizeof(sysctl_scrub_start),
+		.mode		= 0644,
+		.proc_handler	= &scrub_start_handler,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_STOP,
+		.procname	= "scrub_stop",
+		.data		= &sysctl_scrub_stop,
+		.maxlen		= sizeof(sysctl_scrub_stop),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= VM_SCRUB_LOAD,
+		.procname	= "scrub_load",
+		.data		= &sysctl_scrub_load,
+		.maxlen		= sizeof(sysctl_scrub_load),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
 	{ .ctl_name = 0 }
 };

Index: linux-2.6.10/include/linux/sysctl.h
===================================================================
--- linux-2.6.10.orig/include/linux/sysctl.h	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/linux/sysctl.h	2005-01-21 11:56:07.000000000 -0800
@@ -169,6 +169,9 @@ enum
 	VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
 	VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
 	VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+	VM_SCRUB_START=30,	/* percentage * 10 at which to start scrubd */
+	VM_SCRUB_STOP=31,	/* percentage * 10 at which to stop scrubd */
+	VM_SCRUB_LOAD=32,	/* Load factor at which not to scrub anymore */
 };


Index: linux-2.6.10/include/linux/gfp.h
===================================================================
--- linux-2.6.10.orig/include/linux/gfp.h	2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/linux/gfp.h	2005-01-21 11:56:07.000000000 -0800
@@ -131,4 +131,5 @@ extern void FASTCALL(free_cold_page(stru

 void page_alloc_init(void);

+void prep_zero_page(struct page *, unsigned int order);
 #endif /* __LINUX_GFP_H */


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-21 20:12                     ` Extend clear_page by an order parameter Christoph Lameter
@ 2005-01-21 22:29                       ` Paul Mackerras
  2005-01-21 23:48                         ` Christoph Lameter
  2005-01-23  7:45                       ` Andrew Morton
  1 sibling, 1 reply; 87+ messages in thread
From: Paul Mackerras @ 2005-01-21 22:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David S. Miller, Hugh Dickins, akpm, linux-ia64, torvalds,
	linux-mm, linux-kernel

Christoph Lameter writes:

> The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
> clear_page that is capable of zeroing multiple pages at once (and scrubd
> too but that is now an independent patch). The following patch extends
> clear_page with a second parameter specifying the order of the page to be zeroed to allow an
> efficient zeroing of pages. Hope I caught everything....

Wouldn't it be nicer to call the version that takes the order
parameter "clear_pages" and then define clear_page(p) as
clear_pages(p, 0) ?

Paul.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-21 22:29                       ` Paul Mackerras
@ 2005-01-21 23:48                         ` Christoph Lameter
  2005-01-22  0:35                           ` Paul Mackerras
  0 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2005-01-21 23:48 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: David S. Miller, Hugh Dickins, akpm, linux-ia64, torvalds,
	linux-mm, linux-kernel

On Sat, 22 Jan 2005, Paul Mackerras wrote:

> Christoph Lameter writes:
>
> > The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
> > clear_page that is capable of zeroing multiple pages at once (and scrubd
> > too but that is now an independent patch). The following patch extends
> > clear_page with a second parameter specifying the order of the page to be zeroed to allow an
> > efficient zeroing of pages. Hope I caught everything....
>
> Wouldn't it be nicer to call the version that takes the order
> parameter "clear_pages" and then define clear_page(p) as
> clear_pages(p, 0) ?

clear_page clears one page of the specified order. clear_page cannot clear
multiple pages. Calling the function clear_pages would give a wrong
impression on what the function does and may lead to attempts to specify
the number of zero order pages as a parameter instead of the order.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-21 23:48                         ` Christoph Lameter
@ 2005-01-22  0:35                           ` Paul Mackerras
  2005-01-22  0:43                             ` Andrew Morton
  0 siblings, 1 reply; 87+ messages in thread
From: Paul Mackerras @ 2005-01-22  0:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David S. Miller, Hugh Dickins, akpm, linux-ia64, torvalds,
	linux-mm, linux-kernel

Christoph Lameter writes:

> clear_page clears one page of the specified order.

Now you're really being confusing.  A cluster of 2^n contiguous pages
isn't one page by any normal definition.  Call it "clear_page_cluster"
or "clear_page_order" or something, but not "clear_page".

Paul.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-22  0:35                           ` Paul Mackerras
@ 2005-01-22  0:43                             ` Andrew Morton
  2005-01-22  1:08                               ` Paul Mackerras
                                                 ` (2 more replies)
  0 siblings, 3 replies; 87+ messages in thread
From: Andrew Morton @ 2005-01-22  0:43 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: clameter, davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

Paul Mackerras <paulus@samba.org> wrote:
>
> A cluster of 2^n contiguous pages
>  isn't one page by any normal definition.

It is, actually, from the POV of the page allocator.  It's a "higher order
page" and is controlled by a struct page*, just like a zero-order page...

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-22  0:43                             ` Andrew Morton
@ 2005-01-22  1:08                               ` Paul Mackerras
  2005-01-22  1:20                               ` Roman Zippel
  2005-01-22  1:25                               ` Paul Mackerras
  2 siblings, 0 replies; 87+ messages in thread
From: Paul Mackerras @ 2005-01-22  1:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: clameter, davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

Andrew Morton writes:

> It is, actually, from the POV of the page allocator.  It's a "higher order
> page" and is controlled by a struct page*, just like a zero-order page...

OK.  I still reckon it's confusing terminology for the rest of us who
don't have our heads deep in the page allocator code.

Paul.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-22  0:43                             ` Andrew Morton
  2005-01-22  1:08                               ` Paul Mackerras
@ 2005-01-22  1:20                               ` Roman Zippel
  2005-01-22  1:25                               ` Paul Mackerras
  2 siblings, 0 replies; 87+ messages in thread
From: Roman Zippel @ 2005-01-22  1:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Mackerras, clameter, davem, hugh, linux-ia64, torvalds,
	linux-mm, linux-kernel

Hi,

On Fri, 21 Jan 2005, Andrew Morton wrote:

> Paul Mackerras <paulus@samba.org> wrote:
> >
> > A cluster of 2^n contiguous pages
> >  isn't one page by any normal definition.
> 
> It is, actually, from the POV of the page allocator.  It's a "higher order
> page" and is controlled by a struct page*, just like a zero-order page...

OTOH we also have alloc_page/alloc_pages.

bye, Roman

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-22  0:43                             ` Andrew Morton
  2005-01-22  1:08                               ` Paul Mackerras
  2005-01-22  1:20                               ` Roman Zippel
@ 2005-01-22  1:25                               ` Paul Mackerras
  2005-01-22  1:54                                 ` Christoph Lameter
  2 siblings, 1 reply; 87+ messages in thread
From: Paul Mackerras @ 2005-01-22  1:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: clameter, davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

Andrew Morton writes:

> It is, actually, from the POV of the page allocator.  It's a "higher order
> page" and is controlled by a struct page*, just like a zero-order page...

So why is the function that gets me one of these "higher order pages"
called "get_free_pages" with an "s"? :)

Christoph's patch is bigger than it needs to be because he has to
change all the occurrences of clear_page(x) to clear_page(x, 0), and
then he has to change a lot of architectures' clear_page functions to
be called _clear_page instead.  If he picked a different name for the
"clear a higher order page" function it would end up being less
invasive as well as less confusing.

The argument that clear_page is called that because it clears a higher
order page won't wash; all the clear_page implementations in his patch
are perfectly capable of clearing any contiguous set of 2^order pages
(oops, I mean "zero-order pages"), not just a "higher order page".

Paul.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-22  1:25                               ` Paul Mackerras
@ 2005-01-22  1:54                                 ` Christoph Lameter
  2005-01-22  2:53                                   ` Paul Mackerras
  0 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2005-01-22  1:54 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andrew Morton, davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

On Sat, 22 Jan 2005, Paul Mackerras wrote:

> Christoph's patch is bigger than it needs to be because he has to
> change all the occurrences of clear_page(x) to clear_page(x, 0), and
> then he has to change a lot of architectures' clear_page functions to
> be called _clear_page instead.  If he picked a different name for the
> "clear a higher order page" function it would end up being less
> invasive as well as less confusing.

I had the name "zero_page" in V1 and V2 of the patch where it was
separate. Then someone complained about code duplication.

> The argument that clear_page is called that because it clears a higher
> order page won't wash; all the clear_page implementations in his patch
> are perfectly capable of clearing any contiguous set of 2^order pages
> (oops, I mean "zero-order pages"), not just a "higher order page".

clear_page is called clear_page because it clears one page of *any* order
not just higher orders. zero-order pages are not segregated nor are they
intrisincally better just because they contain more memory ;-).

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-22  1:54                                 ` Christoph Lameter
@ 2005-01-22  2:53                                   ` Paul Mackerras
  0 siblings, 0 replies; 87+ messages in thread
From: Paul Mackerras @ 2005-01-22  2:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

Christoph Lameter writes:

> I had the name "zero_page" in V1 and V2 of the patch where it was
> separate. Then someone complained about code duplication.

Well, if you duplicated each arch's clear_page implementation in
zero_page, then yes, that would be unnecessary code duplication.  I
would suggest that for architectures where the clear_page
implementation can easily be extended, rename it to clear_page_order
(or something) and #define clear_page(x) to be clear_page_order(x, 0).
For architectures where it can't, leave clear_page as clear_page and
define clear_page_order as an inline function that calls clear_page in
a loop.

> clear_page is called clear_page because it clears one page of *any* order
> not just higher orders. zero-order pages are not segregated nor are they
> intrisincally better just because they contain more memory ;-).

You have missed my point, which was about address constraints, not a
distinction between zero-order pages and higher-order pages.

Anyway, I remain of the opinion that your naming is inconsistent with
the naming of other functions that deal with zero-order and
higher-order pages, such as get_free_pages, alloc_pages, free_pages,
etc., and that your patch is unnecessarily intrusive.  I guess it's up
to Andrew to decide which way we go.

Paul.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-21 20:12                     ` Extend clear_page by an order parameter Christoph Lameter
  2005-01-21 22:29                       ` Paul Mackerras
@ 2005-01-23  7:45                       ` Andrew Morton
  2005-01-24 16:37                         ` Christoph Lameter
  1 sibling, 1 reply; 87+ messages in thread
From: Andrew Morton @ 2005-01-23  7:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

Christoph Lameter <clameter@sgi.com> wrote:
>
> The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
>  clear_page that is capable of zeroing multiple pages at once (and scrubd
>  too but that is now an independent patch). The following patch extends
>  clear_page with a second parameter specifying the order of the page to be zeroed to allow an
>  efficient zeroing of pages. Hope I caught everything....
> 

Sorry, I take it back.  As Paul says:

: Wouldn't it be nicer to call the version that takes the order
: parameter "clear_pages" and then define clear_page(p) as
: clear_pages(p, 0) ?

It would make the patch considerably smaller, and our naming is all over
the place anyway...

>  -static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
>  +void prep_zero_page(struct page *page, unsigned int order, unsigned int gfp_flags)
>   {
>   	int i;
> 
>   	BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
>  +	if (!PageHighMem(page)) {
>  +		clear_page(page_address(page), order);
>  +		return;
>  +	}
>  +
>   	for(i = 0; i < (1 << order); i++)
>   		clear_highpage(page + i);
>   }

I'd have thought that we'd want to make the new clear_pages() handle
highmem pages too, if only from a regularity POV.  x86 hugetlbpages could
use it then, if someone thinks up a fast page-clearer.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-23  7:45                       ` Andrew Morton
@ 2005-01-24 16:37                         ` Christoph Lameter
  2005-01-24 20:23                           ` David S. Miller
  0 siblings, 1 reply; 87+ messages in thread
From: Christoph Lameter @ 2005-01-24 16:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

On Sat, 22 Jan 2005, Andrew Morton wrote:

> Christoph Lameter <clameter@sgi.com> wrote:
> >
> > The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
> >  clear_page that is capable of zeroing multiple pages at once (and scrubd
> >  too but that is now an independent patch). The following patch extends
> >  clear_page with a second parameter specifying the order of the page to be zeroed to allow an
> >  efficient zeroing of pages. Hope I caught everything....
> >
>
> Sorry, I take it back.  As Paul says:
>
> : Wouldn't it be nicer to call the version that takes the order
> : parameter "clear_pages" and then define clear_page(p) as
> : clear_pages(p, 0) ?

> It would make the patch considerably smaller, and our naming is all over
> the place anyway...

Sounds good. Note though that this just means renaming clear_page to
clear_pages for all arches which would increase the patch size for the
arch specific section.

> I'd have thought that we'd want to make the new clear_pages() handle
> highmem pages too, if only from a regularity POV.  x86 hugetlbpages could
> use it then, if someone thinks up a fast page-clearer.

That would get us back to code duplication. We would have a clear_page (no
highmem support) and a clear_pages (supporting highmem). Then it may
also be better to pass the page struct to clear_pages instead of a memory address.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-24 16:37                         ` Christoph Lameter
@ 2005-01-24 20:23                           ` David S. Miller
  2005-01-24 20:33                             ` Christoph Lameter
  0 siblings, 1 reply; 87+ messages in thread
From: David S. Miller @ 2005-01-24 20:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

On Mon, 24 Jan 2005 08:37:15 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:

> Then it may also be better to pass the page struct to clear_pages
> instead of a memory address.

What is more generally available at the call sites at this time?
Consider both HIGHMEM and non-HIGHMEM setups in your estimation
please :-)


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: Extend clear_page by an order parameter
  2005-01-24 20:23                           ` David S. Miller
@ 2005-01-24 20:33                             ` Christoph Lameter
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-01-24 20:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: akpm, hugh, linux-ia64, torvalds, linux-mm, linux-kernel

On Mon, 24 Jan 2005, David S. Miller wrote:

> On Mon, 24 Jan 2005 08:37:15 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > Then it may also be better to pass the page struct to clear_pages
> > instead of a memory address.
>
> What is more generally available at the call sites at this time?
> Consider both HIGHMEM and non-HIGHMEM setups in your estimation
> please :-)

The only call site is prep_zero_page which has a GFP flag, the order and
the pointer to struct page.

The patch makes the huge page code call prep_zero_page and scrubd will
also call prep_zero_page.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL
  2005-01-21 20:09                     ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
@ 2005-02-09  9:58                       ` Michael Ellerman
  2005-02-10  0:38                         ` Christoph Lameter
  0 siblings, 1 reply; 87+ messages in thread
From: Michael Ellerman @ 2005-02-09  9:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: davem, hugh, akpm, linux-ia64, torvalds, linux-mm, linux-kernel

Hi All,

The generic and IA-64 versions of alloc_zeroed_user_highpage() don't check the return value from alloc_page_vma(). This can lead to an oops if we're OOM.

This fixes my oops on PPC64, but I haven't got an IA-64 machine/compiler handy.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>

diff -rN -p -u oombreakage-old/include/asm-ia64/page.h oombreakage-new/include/asm-ia64/page.h
--- oombreakage-old/include/asm-ia64/page.h	2005-02-04 04:10:37.000000000 +1100
+++ oombreakage-new/include/asm-ia64/page.h	2005-02-09 20:53:37.000000000 +1100
@@ -79,7 +79,8 @@ do {						\
 #define alloc_zeroed_user_highpage(vma, vaddr) \
 ({						\
 	struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
-	flush_dcache_page(page);		\
+	if (page)				\
+ 		flush_dcache_page(page);	\
 	page;					\
 })
 
diff -rN -p -u oombreakage-old/include/linux/highmem.h oombreakage-new/include/linux/highmem.h
--- oombreakage-old/include/linux/highmem.h	2005-02-09 20:22:41.000000000 +1100
+++ oombreakage-new/include/linux/highmem.h	2005-02-09 20:47:01.000000000 +1100
@@ -48,7 +48,9 @@ alloc_zeroed_user_highpage(struct vm_are
 {
 	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
 
-	clear_user_highpage(page, vaddr);
+	if (page)
+		clear_user_highpage(page, vaddr);
+
 	return page;
 }
 #endif




^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL
  2005-02-09  9:58                       ` [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL Michael Ellerman
@ 2005-02-10  0:38                         ` Christoph Lameter
  0 siblings, 0 replies; 87+ messages in thread
From: Christoph Lameter @ 2005-02-10  0:38 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: akpm, linux-kernel

On Wed, 9 Feb 2005, Michael Ellerman wrote:

> The generic and IA-64 versions of alloc_zeroed_user_highpage() don't
> check the return value from alloc_page_vma(). This can lead to an oops
> if we're OOM. This fixes my oops on PPC64, but I haven't got an IA-64
> machine/compiler handy.

Patch looks okay to me. These are the only occurences as far as I can tell
after reviewing the alloc_zeroed_user_higpage implementations in
include/asm-*/page.h.



^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2005-02-10  0:39 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <B8E391BBE9FE384DAA4C5C003888BE6F02900FBD@scsmsx401.amr.corp.intel.com>
     [not found] ` <41C20E3E.3070209@yahoo.com.au>
2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
2004-12-21 19:56     ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Christoph Lameter
2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd Christoph Lameter
2005-01-01  2:22       ` Nick Piggin
2005-01-01  2:55         ` pmarques
2004-12-21 19:57     ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing Christoph Lameter
2004-12-22 12:46       ` Robin Holt
2004-12-22 19:56         ` Christoph Lameter
2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
2004-12-23 19:33       ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
2004-12-23 19:33         ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
2004-12-24  8:33           ` Pavel Machek
2004-12-24 16:18             ` Christoph Lameter
2004-12-24 16:27               ` Pavel Machek
2004-12-24 17:02                 ` David S. Miller
2004-12-24 17:05           ` David S. Miller
2004-12-27 22:48             ` David S. Miller
2005-01-03 17:52             ` Christoph Lameter
2005-01-01 10:24           ` Geert Uytterhoeven
2005-01-04 23:12             ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
2005-01-04 23:13               ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
2005-01-04 23:45                 ` Dave Hansen
2005-01-05  1:16                   ` Christoph Lameter
2005-01-05  1:26                     ` Linus Torvalds
2005-01-05 23:11                       ` Christoph Lameter
2005-01-05  0:34                 ` Linus Torvalds
2005-01-05  0:47                   ` Andrew Morton
2005-01-05  1:15                     ` Christoph Lameter
2005-01-08 21:12                 ` Hugh Dickins
2005-01-08 21:56                   ` David S. Miller
2005-01-21 20:09                     ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
2005-02-09  9:58                       ` [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL Michael Ellerman
2005-02-10  0:38                         ` Christoph Lameter
2005-01-21 20:12                     ` Extend clear_page by an order parameter Christoph Lameter
2005-01-21 22:29                       ` Paul Mackerras
2005-01-21 23:48                         ` Christoph Lameter
2005-01-22  0:35                           ` Paul Mackerras
2005-01-22  0:43                             ` Andrew Morton
2005-01-22  1:08                               ` Paul Mackerras
2005-01-22  1:20                               ` Roman Zippel
2005-01-22  1:25                               ` Paul Mackerras
2005-01-22  1:54                                 ` Christoph Lameter
2005-01-22  2:53                                   ` Paul Mackerras
2005-01-23  7:45                       ` Andrew Morton
2005-01-24 16:37                         ` Christoph Lameter
2005-01-24 20:23                           ` David S. Miller
2005-01-24 20:33                             ` Christoph Lameter
2005-01-21 20:15                     ` A scrub daemon (prezeroing) Christoph Lameter
2005-01-10 17:16                   ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
2005-01-10 18:13                     ` Linus Torvalds
2005-01-10 20:17                       ` Christoph Lameter
2005-01-10 23:53                       ` Prezeroing V4 [0/4]: Overview Christoph Lameter
2005-01-10 23:54                         ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
2005-01-11  0:41                           ` Chris Wright
2005-01-11  0:46                             ` Christoph Lameter
2005-01-11  0:49                               ` Chris Wright
2005-01-10 23:55                         ` Prezeroing V4 [2/4]: Zeroing implementation Christoph Lameter
2005-01-10 23:55                         ` Prezeroing V4 [3/4]: Altix SN2 BTE zero driver Christoph Lameter
2005-01-10 23:56                         ` Prezeroing V4 [4/4]: Extend clear_page to take an order parameter Christoph Lameter
2005-01-04 23:14               ` Prezeroing V3 [2/4]: Extension of " Christoph Lameter
2005-01-05 23:25                 ` Christoph Lameter
2005-01-06 13:52                   ` Andi Kleen
2005-01-06 17:47                     ` Christoph Lameter
2005-01-04 23:15               ` Prezeroing V3 [3/4]: Page zeroing through kscrubd Christoph Lameter
2005-01-04 23:16               ` Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix Christoph Lameter
2005-01-05  2:16                 ` Andi Kleen
2005-01-05 16:24                   ` Christoph Lameter
2004-12-23 19:34         ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Christoph Lameter
2004-12-23 19:35         ` Prezeroing V2 [4/4]: Hardware Zeroing through SGI BTE Christoph Lameter
2004-12-23 20:08         ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Brian Gerst
2004-12-24 16:24           ` Christoph Lameter
2004-12-23 19:49       ` Prezeroing V2 [0/3]: Why and When it works Arjan van de Ven
2004-12-23 20:57       ` Matt Mackall
2004-12-23 21:01       ` Paul Mackerras
2004-12-23 21:11       ` Paul Mackerras
2004-12-23 21:37         ` Andrew Morton
2004-12-23 23:00           ` Paul Mackerras
2004-12-23 21:48         ` Linus Torvalds
2004-12-23 22:34           ` Zwane Mwaikambo
2004-12-24  9:14           ` Arjan van de Ven
2004-12-24 18:21             ` Linus Torvalds
2004-12-24 18:57               ` Arjan van de Ven
2004-12-27 22:50               ` David S. Miller
2004-12-28 11:53                 ` Marcelo Tosatti
2004-12-24 16:17           ` Christoph Lameter
2004-12-24 18:31     ` Increase page fault rate by prezeroing V1 [0/3]: Overview Andrea Arcangeli
2005-01-03 17:54       ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).