* Increase page fault rate by prezeroing V1 [0/3]: Overview
[not found] ` <41C20E3E.3070209@yahoo.com.au>
@ 2004-12-21 19:55 ` Christoph Lameter
2004-12-21 19:56 ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Christoph Lameter
` (4 more replies)
0 siblings, 5 replies; 89+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:55 UTC (permalink / raw)
To: Nick Piggin
Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
linux-mm, linux-kernel
The patches increasing the page fault rate (introduction of atomic pte operations
and anticipatory prefaulting) do so by reducing the locking overhead and are
therefore mainly of interest for applications running in SMP systems with a high
number of cpus. The single thread performance does just show minor increases.
Only the performance of multi-threaded applications increase significantly.
The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page that is also done in the page fault
handler. Others have seen this too and have tried provide a way to provide
zeroed pages to the page fault handler:
http://marc.theaimsgroup.com/?t=109914559100004&r=1&w=2
http://marc.theaimsgroup.com/?t=109777267500005&r=1&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=104931944213955&w=2
The problem so far has been that simple zeroing of pages simply shifts
the time spend somewhere else. Plus one would not want to zero hot
pages.
This patch addresses those issues by making it more effective to zero pages by:
1. Aggregating zeroing operations to mainly apply to larger order pages
which results in many later order 0 pages to be zeroed in one go.
For that purpose a new achitecture specific function zero_page(page, order)
is introduced.
2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.
The result is a significant increase of the page fault performance even for
single threaded applications:
w/o patch:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 3 1 0.146s 11.155s 11.030s 69584.896 69566.852
w/patch
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
1 1 1 0.014s 0.110s 0.012s524292.194 517665.538
This is a performance increase by a factor 8!
The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmark the system will run out of these very
fast but the efficient algorithm for page zeroing still makes this a winner
(8 way system with 6 GB RAM, no hardware zeroing support):
w/o patch:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 3 1 0.146s 11.155s 11.030s 69584.896 69566.852
4 3 2 0.170s 14.909s 7.097s 52150.369 98643.687
4 3 4 0.181s 16.597s 5.079s 46869.167 135642.420
4 3 8 0.166s 23.239s 4.037s 33599.215 179791.120
w/patch
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 3 1 0.183s 2.750s 2.093s268077.996 267952.890
4 3 2 0.185s 4.876s 2.097s155344.562 263967.292
4 3 4 0.150s 6.617s 2.097s116205.793 264774.080
4 3 8 0.186s 13.693s 3.054s 56659.819 221701.073
The patch is composed of 3 parts:
[1/3] Introduce __GFP_ZERO
Modifies the page allocator to be able to take the __GFP_ZERO flag
and returns zeroed memory on request. Modifies locations throughout
the linux sources that retrieve a page and then zeroe it to request
a zeroed page.
Adds new low level zero_page functions for i386, ia64 and x86_64.
(x64_64 untested)
[2/3] Page Zeroing
Adds management of ZEROED and NOT_ZEROED pages and a background daemon
called scrubd. scrubd is disable by default but can be enabled
by writing an order number to /proc/sys/vm/scrub_start. If a page
is coalesced of that order then the scrub daemon will start zeroing
until all pages of order /proc/sys/vm/scrub_stop and higher are
zeroed.
[3/3] SGI Altix Block Transfer Engine Support
Implements a driver to shift the zeroing off the cpu into hardware.
With hardware support there will be minimal impact of zeroing
on the performance of the system.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO
2004-12-21 19:55 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
@ 2004-12-21 19:56 ` Christoph Lameter
2004-12-21 19:57 ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd Christoph Lameter
` (3 subsequent siblings)
4 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:56 UTC (permalink / raw)
To: Nick Piggin
Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
linux-mm, linux-kernel
This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
to request zeroed pages from the page allocator.
- Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set
- Replace all page zeroing after allocating pages by request for
zeroed pages.
- Add an arch specific call zero_page to clear pages greater than
order 0 and a fallback to repeated calles to clear_page if an
architecture does not support zero_page(address, order) yet.
- Add ia64 zero_page function
- Add i386 zero_page function
- Add x86_64 zero_page function (untested, unverified)
Index: linux-2.6.9/mm/page_alloc.c
===================================================================
--- linux-2.6.9.orig/mm/page_alloc.c 2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c 2004-12-21 10:19:37.000000000 -0800
@@ -575,6 +575,18 @@
BUG_ON(bad_range(zone, page));
mod_page_state_zone(zone, pgalloc, 1 << order);
prep_new_page(page, order);
+
+ if (gfp_flags & __GFP_ZERO) {
+#ifdef CONFIG_HIGHMEM
+ if (PageHighMem(page)) {
+ int n = 1 << order;
+
+ while (n-- >0)
+ clear_highpage(page + n);
+ } else
+#endif
+ zero_page(page_address(page), order);
+ }
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
}
@@ -767,12 +779,9 @@
*/
BUG_ON(gfp_mask & __GFP_HIGHMEM);
- page = alloc_pages(gfp_mask, 0);
- if (page) {
- void *address = page_address(page);
- clear_page(address);
- return (unsigned long) address;
- }
+ page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+ if (page)
+ return (unsigned long) page_address(page);
return 0;
}
Index: linux-2.6.9/include/linux/gfp.h
===================================================================
--- linux-2.6.9.orig/include/linux/gfp.h 2004-10-18 14:53:44.000000000 -0700
+++ linux-2.6.9/include/linux/gfp.h 2004-12-21 10:19:37.000000000 -0800
@@ -37,6 +37,7 @@
#define __GFP_NORETRY 0x1000 /* Do not retry. Might fail */
#define __GFP_NO_GROW 0x2000 /* Slab internal usage */
#define __GFP_COMP 0x4000 /* Add compound page metadata */
+#define __GFP_ZERO 0x8000 /* Return zeroed page on success */
#define __GFP_BITS_SHIFT 16 /* Room for 16 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)
/* Flag - indicates that the buffer will be suitable for DMA. Ignored on some
platforms, used as appropriate on others */
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-12-21 10:19:37.000000000 -0800
@@ -1445,10 +1445,9 @@
if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
- page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
if (!page)
goto no_mem;
- clear_user_highpage(page, addr);
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.9/kernel/profile.c
===================================================================
--- linux-2.6.9.orig/kernel/profile.c 2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/kernel/profile.c 2004-12-21 10:19:37.000000000 -0800
@@ -326,17 +326,15 @@
node = cpu_to_node(cpu);
per_cpu(cpu_profile_flip, cpu) = 0;
if (!per_cpu(cpu_profile_hits, cpu)[1]) {
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
return NOTIFY_BAD;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
}
if (!per_cpu(cpu_profile_hits, cpu)[0]) {
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
goto out_free;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
}
break;
@@ -510,16 +508,14 @@
int node = cpu_to_node(cpu);
struct page *page;
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
goto out_cleanup;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[1]
= (struct profile_hit *)page_address(page);
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
goto out_cleanup;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[0]
= (struct profile_hit *)page_address(page);
}
Index: linux-2.6.9/mm/shmem.c
===================================================================
--- linux-2.6.9.orig/mm/shmem.c 2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/mm/shmem.c 2004-12-21 10:19:37.000000000 -0800
@@ -369,9 +369,8 @@
}
spin_unlock(&info->lock);
- page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+ page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
if (page) {
- clear_highpage(page);
page->nr_swapped = 0;
}
spin_lock(&info->lock);
@@ -910,7 +909,7 @@
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
pvma.vm_pgoff = idx;
pvma.vm_end = PAGE_SIZE;
- page = alloc_page_vma(gfp, &pvma, 0);
+ page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
mpol_free(pvma.vm_policy);
return page;
}
@@ -926,7 +925,7 @@
shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
unsigned long idx)
{
- return alloc_page(gfp);
+ return alloc_page(gfp | __GFP_ZERO);
}
#endif
@@ -1135,7 +1134,6 @@
info->alloced++;
spin_unlock(&info->lock);
- clear_highpage(filepage);
flush_dcache_page(filepage);
SetPageUptodate(filepage);
}
Index: linux-2.6.9/mm/hugetlb.c
===================================================================
--- linux-2.6.9.orig/mm/hugetlb.c 2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/hugetlb.c 2004-12-21 10:19:37.000000000 -0800
@@ -77,7 +77,6 @@
struct page *alloc_huge_page(void)
{
struct page *page;
- int i;
spin_lock(&hugetlb_lock);
page = dequeue_huge_page();
@@ -88,8 +87,7 @@
spin_unlock(&hugetlb_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
- for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
- clear_highpage(&page[i]);
+ zero_page(page_address(page), HUGETLB_PAGE_ORDER);
return page;
}
Index: linux-2.6.9/arch/ia64/lib/Makefile
===================================================================
--- linux-2.6.9.orig/arch/ia64/lib/Makefile 2004-10-18 14:55:28.000000000 -0700
+++ linux-2.6.9/arch/ia64/lib/Makefile 2004-12-21 10:19:37.000000000 -0800
@@ -6,7 +6,7 @@
lib-y := __divsi3.o __udivsi3.o __modsi3.o __umodsi3.o \
__divdi3.o __udivdi3.o __moddi3.o __umoddi3.o \
- bitop.o checksum.o clear_page.o csum_partial_copy.o copy_page.o \
+ bitop.o checksum.o clear_page.o zero_page.o csum_partial_copy.o copy_page.o \
clear_user.o strncpy_from_user.o strlen_user.o strnlen_user.o \
flush.o ip_fast_csum.o do_csum.o \
memset.o strlen.o swiotlb.o
Index: linux-2.6.9/include/asm-ia64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/page.h 2004-10-18 14:53:21.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/page.h 2004-12-21 10:19:37.000000000 -0800
@@ -57,6 +57,8 @@
# define STRICT_MM_TYPECHECKS
extern void clear_page (void *page);
+extern void zero_page (void *page, int order);
+
extern void copy_page (void *to, void *from);
/*
Index: linux-2.6.9/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgalloc.h 2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgalloc.h 2004-12-21 10:19:37.000000000 -0800
@@ -61,9 +61,7 @@
pgd_t *pgd = pgd_alloc_one_fast(mm);
if (unlikely(pgd == NULL)) {
- pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
- if (likely(pgd != NULL))
- clear_page(pgd);
+ pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
}
return pgd;
}
@@ -107,10 +105,8 @@
static inline pmd_t*
pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
{
- pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
- if (likely(pmd != NULL))
- clear_page(pmd);
return pmd;
}
@@ -141,20 +137,16 @@
static inline struct page *
pte_alloc_one (struct mm_struct *mm, unsigned long addr)
{
- struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
- if (likely(pte != NULL))
- clear_page(page_address(pte));
return pte;
}
static inline pte_t *
pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
- if (likely(pte != NULL))
- clear_page(pte);
return pte;
}
Index: linux-2.6.9/arch/ia64/lib/zero_page.S
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/arch/ia64/lib/zero_page.S 2004-12-21 10:19:37.000000000 -0800
@@ -0,0 +1,84 @@
+/*
+ * Copyright (C) 1999-2002 Hewlett-Packard Co
+ * Stephane Eranian <eranian@hpl.hp.com>
+ * David Mosberger-Tang <davidm@hpl.hp.com>
+ * Copyright (C) 2002 Ken Chen <kenneth.w.chen@intel.com>
+ *
+ * 1/06/01 davidm Tuned for Itanium.
+ * 2/12/02 kchen Tuned for both Itanium and McKinley
+ * 3/08/02 davidm Some more tweaking
+ * 12/10/04 clameter Make it work on pages of order size
+ */
+#include <linux/config.h>
+
+#include <asm/asmmacro.h>
+#include <asm/page.h>
+
+#ifdef CONFIG_ITANIUM
+# define L3_LINE_SIZE 64 // Itanium L3 line size
+# define PREFETCH_LINES 9 // magic number
+#else
+# define L3_LINE_SIZE 128 // McKinley L3 line size
+# define PREFETCH_LINES 12 // magic number
+#endif
+
+#define saved_lc r2
+#define dst_fetch r3
+#define dst1 r8
+#define dst2 r9
+#define dst3 r10
+#define dst4 r11
+
+#define dst_last r31
+#define totsize r14
+
+GLOBAL_ENTRY(zero_page)
+ .prologue
+ .regstk 2,0,0,0
+ mov r16 = PAGE_SIZE/L3_LINE_SIZE // main loop count
+ mov totsize = PAGE_SIZE
+ .save ar.lc, saved_lc
+ mov saved_lc = ar.lc
+ ;;
+ .body
+ adds dst1 = 16, in0
+ mov ar.lc = (PREFETCH_LINES - 1)
+ mov dst_fetch = in0
+ adds dst2 = 32, in0
+ shl r16 = r16, in1
+ shl totsize = totsize, in1
+ ;;
+.fetch: stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
+ adds dst3 = 48, in0 // executing this multiple times is harmless
+ br.cloop.sptk.few .fetch
+ add r16 = -1,r16
+ add dst_last = totsize, dst_fetch
+ adds dst4 = 64, in0
+ ;;
+ mov ar.lc = r16 // one L3 line per iteration
+ adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
+ ;;
+#ifdef CONFIG_ITANIUM
+ // Optimized for Itanium
+1: stf.spill.nta [dst1] = f0, 64
+ stf.spill.nta [dst2] = f0, 64
+ cmp.lt p8,p0=dst_fetch, dst_last
+ ;;
+#else
+ // Optimized for McKinley
+1: stf.spill.nta [dst1] = f0, 64
+ stf.spill.nta [dst2] = f0, 64
+ stf.spill.nta [dst3] = f0, 64
+ stf.spill.nta [dst4] = f0, 128
+ cmp.lt p8,p0=dst_fetch, dst_last
+ ;;
+ stf.spill.nta [dst1] = f0, 64
+ stf.spill.nta [dst2] = f0, 64
+#endif
+ stf.spill.nta [dst3] = f0, 64
+(p8) stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
+ br.cloop.sptk.few 1b
+ ;;
+ mov ar.lc = saved_lc // restore lc
+ br.ret.sptk.many rp
+END(zero_page)
Index: linux-2.6.9/include/asm-i386/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/page.h 2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/asm-i386/page.h 2004-12-21 10:19:37.000000000 -0800
@@ -20,6 +20,7 @@
#define clear_page(page) mmx_clear_page((void *)(page))
#define copy_page(to,from) mmx_copy_page(to,from)
+#define zero_page(page, order) mmx_zero_page(page, order)
#else
@@ -29,6 +30,7 @@
*/
#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define zero_page(page, ordeR) memset((void *)(page), 0, PAGE_SIZE << order)
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
#endif
Index: linux-2.6.9/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/page.h 2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/asm-x86_64/page.h 2004-12-21 10:19:37.000000000 -0800
@@ -33,6 +33,7 @@
#ifndef __ASSEMBLY__
void clear_page(void *);
+void zero_page(void *, int);
void copy_page(void *, void *);
#define clear_user_page(page, vaddr, pg) clear_page(page)
Index: linux-2.6.9/include/asm-sparc/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-sparc/page.h 2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/include/asm-sparc/page.h 2004-12-21 10:19:37.000000000 -0800
@@ -29,6 +29,7 @@
#ifndef __ASSEMBLY__
#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define zero_page(page,order) memset((void *)(page), 0, PAGE_SIZE <<(order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
#define clear_user_page(addr, vaddr, page) \
do { clear_page(addr); \
Index: linux-2.6.9/include/asm-s390/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-s390/page.h 2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/include/asm-s390/page.h 2004-12-21 10:19:37.000000000 -0800
@@ -33,6 +33,17 @@
: "+&a" (rp) : : "memory", "cc", "1" );
}
+static inline void zero_page(void *page, int order)
+{
+ register_pair rp;
+
+ rp.subreg.even = (unsigned long) page;
+ rp.subreg.odd = (unsigned long) 4096 << order;
+ asm volatile (" slr 1,1\n"
+ " mvcl %0,0"
+ : "+&a" (rp) : : "memory", "cc", "1" );
+}
+
static inline void copy_page(void *to, void *from)
{
if (MACHINE_HAS_MVPG)
Index: linux-2.6.9/arch/i386/lib/mmx.c
===================================================================
--- linux-2.6.9.orig/arch/i386/lib/mmx.c 2004-10-18 14:54:23.000000000 -0700
+++ linux-2.6.9/arch/i386/lib/mmx.c 2004-12-21 10:55:00.000000000 -0800
@@ -161,6 +161,39 @@
kernel_fpu_end();
}
+static void fast_zero_page(void *page, int order)
+{
+ int i;
+
+ kernel_fpu_begin();
+
+ __asm__ __volatile__ (
+ " pxor %%mm0, %%mm0\n" : :
+ );
+
+ for(i=0;i<((4096/64) << order);i++)
+ {
+ __asm__ __volatile__ (
+ " movntq %%mm0, (%0)\n"
+ " movntq %%mm0, 8(%0)\n"
+ " movntq %%mm0, 16(%0)\n"
+ " movntq %%mm0, 24(%0)\n"
+ " movntq %%mm0, 32(%0)\n"
+ " movntq %%mm0, 40(%0)\n"
+ " movntq %%mm0, 48(%0)\n"
+ " movntq %%mm0, 56(%0)\n"
+ : : "r" (page) : "memory");
+ page+=64;
+ }
+ /* since movntq is weakly-ordered, a "sfence" is needed to become
+ * ordered again.
+ */
+ __asm__ __volatile__ (
+ " sfence \n" : :
+ );
+ kernel_fpu_end();
+}
+
static void fast_copy_page(void *to, void *from)
{
int i;
@@ -293,6 +326,42 @@
kernel_fpu_end();
}
+static void fast_zero_page(void *page, int order)
+{
+ int i;
+
+ kernel_fpu_begin();
+
+ __asm__ __volatile__ (
+ " pxor %%mm0, %%mm0\n" : :
+ );
+
+ for(i=0;i<((4096/128) << order);i++)
+ {
+ __asm__ __volatile__ (
+ " movq %%mm0, (%0)\n"
+ " movq %%mm0, 8(%0)\n"
+ " movq %%mm0, 16(%0)\n"
+ " movq %%mm0, 24(%0)\n"
+ " movq %%mm0, 32(%0)\n"
+ " movq %%mm0, 40(%0)\n"
+ " movq %%mm0, 48(%0)\n"
+ " movq %%mm0, 56(%0)\n"
+ " movq %%mm0, 64(%0)\n"
+ " movq %%mm0, 72(%0)\n"
+ " movq %%mm0, 80(%0)\n"
+ " movq %%mm0, 88(%0)\n"
+ " movq %%mm0, 96(%0)\n"
+ " movq %%mm0, 104(%0)\n"
+ " movq %%mm0, 112(%0)\n"
+ " movq %%mm0, 120(%0)\n"
+ : : "r" (page) : "memory");
+ page+=128;
+ }
+
+ kernel_fpu_end();
+}
+
static void fast_copy_page(void *to, void *from)
{
int i;
@@ -359,7 +428,7 @@
* Favour MMX for page clear and copy.
*/
-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page)
{
int d0, d1;
__asm__ __volatile__( \
@@ -369,15 +438,34 @@
:"a" (0),"1" (page),"0" (1024)
:"memory");
}
+
+static void slow_zero_page(void * page, int order)
+{
+ int d0, d1;
+ __asm__ __volatile__( \
+ "cld\n\t" \
+ "rep ; stosl" \
+ : "=&c" (d0), "=&D" (d1)
+ :"a" (0),"1" (page),"0" (1024 << order)
+ :"memory");
+}
void mmx_clear_page(void * page)
{
if(unlikely(in_interrupt()))
- slow_zero_page(page);
+ slow_clear_page(page);
else
fast_clear_page(page);
}
+void mmx_zero_page(void * page, int order)
+{
+ if(unlikely(in_interrupt()))
+ slow_zero_page(page, order);
+ else
+ fast_zero_page(page, order);
+}
+
static void slow_copy_page(void *to, void *from)
{
int d0, d1, d2;
Index: linux-2.6.9/arch/i386/mm/pgtable.c
===================================================================
--- linux-2.6.9.orig/arch/i386/mm/pgtable.c 2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/i386/mm/pgtable.c 2004-12-21 10:19:37.000000000 -0800
@@ -132,10 +132,7 @@
pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
- return pte;
+ return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
}
struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -143,12 +140,10 @@
struct page *pte;
#ifdef CONFIG_HIGHPTE
- pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
#else
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
#endif
- if (pte)
- clear_highpage(pte);
return pte;
}
Index: linux-2.6.9/arch/i386/kernel/i386_ksyms.c
===================================================================
--- linux-2.6.9.orig/arch/i386/kernel/i386_ksyms.c 2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/i386/kernel/i386_ksyms.c 2004-12-21 10:19:37.000000000 -0800
@@ -126,6 +126,7 @@
#ifdef CONFIG_X86_USE_3DNOW
EXPORT_SYMBOL(_mmx_memcpy);
EXPORT_SYMBOL(mmx_clear_page);
+EXPORT_SYMBOL(mmx_zero_page);
EXPORT_SYMBOL(mmx_copy_page);
#endif
Index: linux-2.6.9/drivers/block/pktcdvd.c
===================================================================
--- linux-2.6.9.orig/drivers/block/pktcdvd.c 2004-12-17 14:40:12.000000000 -0800
+++ linux-2.6.9/drivers/block/pktcdvd.c 2004-12-21 10:19:37.000000000 -0800
@@ -125,22 +125,19 @@
int i;
struct packet_data *pkt;
- pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL);
+ pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL|__GFP_ZERO);
if (!pkt)
goto no_pkt;
- memset(pkt, 0, sizeof(struct packet_data));
pkt->w_bio = pkt_bio_alloc(PACKET_MAX_SIZE);
if (!pkt->w_bio)
goto no_bio;
for (i = 0; i < PAGES_PER_PACKET; i++) {
- pkt->pages[i] = alloc_page(GFP_KERNEL);
+ pkt->pages[i] = alloc_page(GFP_KERNEL|__GFP_ZERO);
if (!pkt->pages[i])
goto no_page;
}
- for (i = 0; i < PAGES_PER_PACKET; i++)
- clear_page(page_address(pkt->pages[i]));
spin_lock_init(&pkt->lock);
Index: linux-2.6.9/arch/x86_64/lib/zero_page.S
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/arch/x86_64/lib/zero_page.S 2004-12-21 10:19:37.000000000 -0800
@@ -0,0 +1,52 @@
+/*
+ * Zero a page.
+ * rdi page
+ */
+ .globl zero_page
+ .p2align 4
+zero_page:
+ xorl %eax,%eax
+ movl $4096/64,%ecx
+ shl %ecx, %esi
+ .p2align 4
+.Lloop:
+ decl %ecx
+#define PUT(x) movq %rax,x*8(%rdi)
+ movq %rax,(%rdi)
+ PUT(1)
+ PUT(2)
+ PUT(3)
+ PUT(4)
+ PUT(5)
+ PUT(6)
+ PUT(7)
+ leaq 64(%rdi),%rdi
+ jnz .Lloop
+ nop
+ ret
+zero_page_end:
+
+ /* C stepping K8 run faster using the string instructions.
+ It is also a lot simpler. Use this when possible */
+
+#include <asm/cpufeature.h>
+
+ .section .altinstructions,"a"
+ .align 8
+ .quad zero_page
+ .quad zero_page_c
+ .byte X86_FEATURE_K8_C
+ .byte zero_page_end-clear_page
+ .byte zero_page_c_end-clear_page_c
+ .previous
+
+ .section .altinstr_replacement,"ax"
+zero_page_c:
+ movl $4096/8,%ecx
+ shl %ecx, %esi
+ xorl %eax,%eax
+ rep
+ stosq
+ ret
+zero_page_c_end:
+ .previous
Index: linux-2.6.9/arch/x86_64/lib/Makefile
===================================================================
--- linux-2.6.9.orig/arch/x86_64/lib/Makefile 2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/arch/x86_64/lib/Makefile 2004-12-21 10:19:37.000000000 -0800
@@ -7,7 +7,7 @@
obj-y := io.o
lib-y := csum-partial.o csum-copy.o csum-wrappers.o delay.o \
- usercopy.o getuser.o putuser.o \
+ usercopy.o getuser.o putuser.o zero_page.S \
thunk.o clear_page.o copy_page.o bitstr.o bitops.o
lib-y += memcpy.o memmove.o memset.o copy_user.o
Index: linux-2.6.9/include/asm-x86_64/mmx.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/mmx.h 2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-x86_64/mmx.h 2004-12-21 10:19:37.000000000 -0800
@@ -9,6 +9,7 @@
extern void *_mmx_memcpy(void *to, const void *from, size_t size);
extern void mmx_clear_page(void *page);
+extern void mmx_zero_page(void *page, int order);
extern void mmx_copy_page(void *to, void *from);
#endif
Index: linux-2.6.9/arch/x86_64/kernel/x8664_ksyms.c
===================================================================
--- linux-2.6.9.orig/arch/x86_64/kernel/x8664_ksyms.c 2004-12-17 14:40:11.000000000 -0800
+++ linux-2.6.9/arch/x86_64/kernel/x8664_ksyms.c 2004-12-21 10:19:37.000000000 -0800
@@ -110,6 +110,7 @@
EXPORT_SYMBOL(copy_page);
EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(zero_page);
EXPORT_SYMBOL(cpu_pda);
#ifdef CONFIG_SMP
^ permalink raw reply [flat|nested] 89+ messages in thread
* Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd
2004-12-21 19:55 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
2004-12-21 19:56 ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Christoph Lameter
@ 2004-12-21 19:57 ` Christoph Lameter
2005-01-01 2:22 ` Nick Piggin
2004-12-21 19:57 ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing Christoph Lameter
` (2 subsequent siblings)
4 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:57 UTC (permalink / raw)
To: Nick Piggin
Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
linux-mm, linux-kernel
o Add page zeroing
o Add scrub daemon
o Add ability to view amount of zeroed information in /proc/meminfo
Index: linux-2.6.9/mm/page_alloc.c
===================================================================
--- linux-2.6.9.orig/mm/page_alloc.c 2004-12-21 10:19:37.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c 2004-12-21 11:01:40.000000000 -0800
@@ -12,6 +12,7 @@
* Zone balancing, Kanoj Sarcar, SGI, Jan 2000
* Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
* (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ * Support for page zeroing, Christoph Lameter, SGI, Dec 2004
*/
#include <linux/config.h>
@@ -32,6 +33,7 @@
#include <linux/sysctl.h>
#include <linux/cpu.h>
#include <linux/nodemask.h>
+#include <linux/scrub.h>
#include <asm/tlbflush.h>
@@ -179,7 +181,7 @@
* -- wli
*/
-static inline void __free_pages_bulk (struct page *page, struct page *base,
+static inline int __free_pages_bulk (struct page *page, struct page *base,
struct zone *zone, struct free_area *area, unsigned int order)
{
unsigned long page_idx, index, mask;
@@ -192,11 +194,10 @@
BUG();
index = page_idx >> (1 + order);
- zone->free_pages += 1 << order;
while (order < MAX_ORDER-1) {
struct page *buddy1, *buddy2;
- BUG_ON(area >= zone->free_area + MAX_ORDER);
+ BUG_ON(area >= zone->free_area[ZEROED] + MAX_ORDER);
if (!__test_and_change_bit(index, area->map))
/*
* the buddy page is still allocated.
@@ -216,6 +217,7 @@
page_idx &= mask;
}
list_add(&(base + page_idx)->lru, &area->free_list);
+ return order;
}
static inline void free_pages_check(const char *function, struct page *page)
@@ -258,7 +260,7 @@
int ret = 0;
base = zone->zone_mem_map;
- area = zone->free_area + order;
+ area = zone->free_area[NOT_ZEROED] + order;
spin_lock_irqsave(&zone->lock, flags);
zone->all_unreclaimable = 0;
zone->pages_scanned = 0;
@@ -266,7 +268,10 @@
page = list_entry(list->prev, struct page, lru);
/* have to delete it as __free_pages_bulk list manipulates */
list_del(&page->lru);
- __free_pages_bulk(page, base, zone, area, order);
+ zone->free_pages += 1 << order;
+ if (__free_pages_bulk(page, base, zone, area, order)
+ >= sysctl_scrub_start)
+ wakeup_kscrubd(zone);
ret++;
}
spin_unlock_irqrestore(&zone->lock, flags);
@@ -288,6 +293,21 @@
free_pages_bulk(page_zone(page), 1, &list, order);
}
+void end_zero_page(struct page *page)
+{
+ unsigned long flags;
+ int order = page->index;
+ struct zone * zone = page_zone(page);
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ zone->zero_pages += 1 << order;
+ __free_pages_bulk(page, zone->zone_mem_map, zone, zone->free_area[ZEROED] + order, order);
+
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+
#define MARK_USED(index, order, area) \
__change_bit((index) >> (1+(order)), (area)->map)
@@ -366,25 +386,46 @@
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct zone *zone, struct free_area *area, int order)
+{
+ list_del(&page->lru);
+ if (order != MAX_ORDER-1)
+ MARK_USED(page - zone->zone_mem_map, order, area);
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order)
+{
+ unsigned long flags;
+ struct page *page = NULL;
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ if (!list_empty(&area->free_list)) {
+ page = list_entry(area->free_list.next, struct page, lru);
+
+ rmpage(page, zone, area, order);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
{
struct free_area * area;
unsigned int current_order;
struct page *page;
- unsigned int index;
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
- area = zone->free_area + current_order;
+ area = zone->free_area[zero] + current_order;
if (list_empty(&area->free_list))
continue;
page = list_entry(area->free_list.next, struct page, lru);
- list_del(&page->lru);
- index = page - zone->zone_mem_map;
- if (current_order != MAX_ORDER-1)
- MARK_USED(index, current_order, area);
+ rmpage(page, zone, area, current_order);
zone->free_pages -= 1UL << order;
- return expand(zone, page, index, order, current_order, area);
+ if (zero)
+ zone->zero_pages -= 1UL << order;
+ return expand(zone, page, page - zone->zone_mem_map, order, current_order, area);
}
return NULL;
@@ -396,7 +437,7 @@
* Returns the number of new pages which were placed at *list.
*/
static int rmqueue_bulk(struct zone *zone, unsigned int order,
- unsigned long count, struct list_head *list)
+ unsigned long count, struct list_head *list, int zero)
{
unsigned long flags;
int i;
@@ -405,7 +446,7 @@
spin_lock_irqsave(&zone->lock, flags);
for (i = 0; i < count; ++i) {
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, zero);
if (page == NULL)
break;
allocated++;
@@ -546,7 +587,9 @@
{
unsigned long flags;
struct page *page = NULL;
- int cold = !!(gfp_flags & __GFP_COLD);
+ int nr_pages = 1 << order;
+ int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+ int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;
if (order == 0) {
struct per_cpu_pages *pcp;
@@ -555,7 +598,7 @@
local_irq_save(flags);
if (pcp->count <= pcp->low)
pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list);
+ pcp->batch, &pcp->list, zero);
if (pcp->count) {
page = list_entry(pcp->list.next, struct page, lru);
list_del(&page->lru);
@@ -567,19 +610,30 @@
if (page == NULL) {
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order);
+
+ page = __rmqueue(zone, order, zero);
+
+ /*
+ * If we failed to obtain a zero and/or unzeroed page
+ * then we may still be able to obtain the other
+ * type of page.
+ */
+ if (!page) {
+ page = __rmqueue(zone, order, !zero);
+ zero = 0;
+ }
+
spin_unlock_irqrestore(&zone->lock, flags);
}
if (page != NULL) {
BUG_ON(bad_range(zone, page));
- mod_page_state_zone(zone, pgalloc, 1 << order);
- prep_new_page(page, order);
+ mod_page_state_zone(zone, pgalloc, nr_pages);
- if (gfp_flags & __GFP_ZERO) {
+ if ((gfp_flags & __GFP_ZERO) && !zero) {
#ifdef CONFIG_HIGHMEM
if (PageHighMem(page)) {
- int n = 1 << order;
+ int n = nr_pages;
while (n-- >0)
clear_highpage(page + n);
@@ -587,6 +641,7 @@
#endif
zero_page(page_address(page), order);
}
+ prep_new_page(page, order);
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
}
@@ -974,7 +1029,7 @@
}
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free, struct pglist_data *pgdat)
+ unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
{
struct zone *zones = pgdat->node_zones;
int i;
@@ -982,27 +1037,31 @@
*active = 0;
*inactive = 0;
*free = 0;
+ *zero = 0;
for (i = 0; i < MAX_NR_ZONES; i++) {
*active += zones[i].nr_active;
*inactive += zones[i].nr_inactive;
*free += zones[i].free_pages;
+ *zero += zones[i].zero_pages;
}
}
void get_zone_counts(unsigned long *active,
- unsigned long *inactive, unsigned long *free)
+ unsigned long *inactive, unsigned long *free, unsigned long *zero)
{
struct pglist_data *pgdat;
*active = 0;
*inactive = 0;
*free = 0;
+ *zero = 0;
for_each_pgdat(pgdat) {
- unsigned long l, m, n;
- __get_zone_counts(&l, &m, &n, pgdat);
+ unsigned long l, m, n,o;
+ __get_zone_counts(&l, &m, &n, &o, pgdat);
*active += l;
*inactive += m;
*free += n;
+ *zero += o;
}
}
@@ -1039,6 +1098,7 @@
#define K(x) ((x) << (PAGE_SHIFT-10))
+const char *temp[3] = { "hot", "cold", "zero" };
/*
* Show free area list (used inside shift_scroll-lock stuff)
* We also calculate the percentage fragmentation. We do this by counting the
@@ -1051,6 +1111,7 @@
unsigned long active;
unsigned long inactive;
unsigned long free;
+ unsigned long zero;
struct zone *zone;
for_each_zone(zone) {
@@ -1071,10 +1132,10 @@
pageset = zone->pageset + cpu;
- for (temperature = 0; temperature < 2; temperature++)
+ for (temperature = 0; temperature < 3; temperature++)
printk("cpu %d %s: low %d, high %d, batch %d\n",
cpu,
- temperature ? "cold" : "hot",
+ temp[temperature],
pageset->pcp[temperature].low,
pageset->pcp[temperature].high,
pageset->pcp[temperature].batch);
@@ -1082,20 +1143,21 @@
}
get_page_state(&ps);
- get_zone_counts(&active, &inactive, &free);
+ get_zone_counts(&active, &inactive, &free, &zero);
printk("\nFree pages: %11ukB (%ukB HighMem)\n",
K(nr_free_pages()),
K(nr_free_highpages()));
printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
- "unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+ "unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
active,
inactive,
ps.nr_dirty,
ps.nr_writeback,
ps.nr_unstable,
nr_free_pages(),
+ zero,
ps.nr_slab,
ps.nr_mapped,
ps.nr_page_table_pages);
@@ -1146,7 +1208,7 @@
spin_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
nr = 0;
- list_for_each(elem, &zone->free_area[order].free_list)
+ list_for_each(elem, &zone->free_area[NOT_ZEROED][order].free_list)
++nr;
total += nr << order;
printk("%lu*%lukB ", nr, K(1UL) << order);
@@ -1470,14 +1532,18 @@
for (order = 0; ; order++) {
unsigned long bitmap_size;
- INIT_LIST_HEAD(&zone->free_area[order].free_list);
+ INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+ INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
if (order == MAX_ORDER-1) {
- zone->free_area[order].map = NULL;
+ zone->free_area[NOT_ZEROED][order].map = NULL;
+ zone->free_area[ZEROED][order].map = NULL;
break;
}
bitmap_size = pages_to_bitmap_size(order, size);
- zone->free_area[order].map =
+ zone->free_area[NOT_ZEROED][order].map =
+ (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
+ zone->free_area[ZEROED][order].map =
(unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
}
}
@@ -1503,6 +1569,7 @@
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
+ init_waitqueue_head(&pgdat->kscrubd_wait);
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
@@ -1525,6 +1592,7 @@
spin_lock_init(&zone->lru_lock);
zone->zone_pgdat = pgdat;
zone->free_pages = 0;
+ zone->zero_pages = 0;
zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
@@ -1558,6 +1626,13 @@
pcp->high = 2 * batch;
pcp->batch = 1 * batch;
INIT_LIST_HEAD(&pcp->list);
+
+ pcp = &zone->pageset[cpu].pcp[2]; /* zero pages */
+ pcp->count = 0;
+ pcp->low = 0;
+ pcp->high = 2 * batch;
+ pcp->batch = 1 * batch;
+ INIT_LIST_HEAD(&pcp->list);
}
printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
zone_names[j], realsize, batch);
@@ -1687,7 +1762,7 @@
unsigned long nr_bufs = 0;
struct list_head *elem;
- list_for_each(elem, &(zone->free_area[order].free_list))
+ list_for_each(elem, &(zone->free_area[NOT_ZEROED][order].free_list))
++nr_bufs;
seq_printf(m, "%6lu ", nr_bufs);
}
Index: linux-2.6.9/include/linux/mmzone.h
===================================================================
--- linux-2.6.9.orig/include/linux/mmzone.h 2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/mmzone.h 2004-12-21 11:01:15.000000000 -0800
@@ -51,7 +51,7 @@
};
struct per_cpu_pageset {
- struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */
+ struct per_cpu_pages pcp[3]; /* 0: hot. 1: cold 2: cold zeroed pages */
#ifdef CONFIG_NUMA
unsigned long numa_hit; /* allocated in intended node */
unsigned long numa_miss; /* allocated in non intended node */
@@ -107,10 +107,14 @@
* ZONE_HIGHMEM > 896 MB only page cache and user processes
*/
+#define NOT_ZEROED 0
+#define ZEROED 1
+
struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long free_pages;
unsigned long pages_min, pages_low, pages_high;
+ unsigned long zero_pages;
/*
* protection[] is a pre-calculated number of extra pages that must be
* available in a zone in order for __alloc_pages() to allocate memory
@@ -131,7 +135,7 @@
* free areas of different sizes
*/
spinlock_t lock;
- struct free_area free_area[MAX_ORDER];
+ struct free_area free_area[2][MAX_ORDER];
ZONE_PADDING(_pad1_)
@@ -265,6 +269,9 @@
struct pglist_data *pgdat_next;
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
+
+ wait_queue_head_t kscrubd_wait;
+ struct task_struct *kscrubd;
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
@@ -274,9 +281,9 @@
extern struct pglist_data *pgdat_list;
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free, struct pglist_data *pgdat);
+ unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
void get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free);
+ unsigned long *free, unsigned long *zero);
void build_all_zonelists(void);
void wakeup_kswapd(struct zone *zone);
Index: linux-2.6.9/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.9.orig/fs/proc/proc_misc.c 2004-12-17 14:40:15.000000000 -0800
+++ linux-2.6.9/fs/proc/proc_misc.c 2004-12-21 11:01:15.000000000 -0800
@@ -158,13 +158,14 @@
unsigned long inactive;
unsigned long active;
unsigned long free;
+ unsigned long zero;
unsigned long vmtot;
unsigned long committed;
unsigned long allowed;
struct vmalloc_info vmi;
get_page_state(&ps);
- get_zone_counts(&active, &inactive, &free);
+ get_zone_counts(&active, &inactive, &free, &zero);
/*
* display in kilobytes.
@@ -187,6 +188,7 @@
len = sprintf(page,
"MemTotal: %8lu kB\n"
"MemFree: %8lu kB\n"
+ "MemZero: %8lu kB\n"
"Buffers: %8lu kB\n"
"Cached: %8lu kB\n"
"SwapCached: %8lu kB\n"
@@ -210,6 +212,7 @@
"VmallocChunk: %8lu kB\n",
K(i.totalram),
K(i.freeram),
+ K(zero),
K(i.bufferram),
K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
K(total_swapcache_pages),
Index: linux-2.6.9/mm/readahead.c
===================================================================
--- linux-2.6.9.orig/mm/readahead.c 2004-10-18 14:53:11.000000000 -0700
+++ linux-2.6.9/mm/readahead.c 2004-12-21 11:01:15.000000000 -0800
@@ -570,7 +570,8 @@
unsigned long active;
unsigned long inactive;
unsigned long free;
+ unsigned long zero;
- __get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+ __get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
return min(nr, (inactive + free) / 2);
}
Index: linux-2.6.9/drivers/base/node.c
===================================================================
--- linux-2.6.9.orig/drivers/base/node.c 2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/drivers/base/node.c 2004-12-21 11:01:15.000000000 -0800
@@ -41,13 +41,15 @@
unsigned long inactive;
unsigned long active;
unsigned long free;
+ unsigned long zero;
si_meminfo_node(&i, nid);
- __get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+ __get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));
n = sprintf(buf, "\n"
"Node %d MemTotal: %8lu kB\n"
"Node %d MemFree: %8lu kB\n"
+ "Node %d MemZero: %8lu kB\n"
"Node %d MemUsed: %8lu kB\n"
"Node %d Active: %8lu kB\n"
"Node %d Inactive: %8lu kB\n"
@@ -57,6 +59,7 @@
"Node %d LowFree: %8lu kB\n",
nid, K(i.totalram),
nid, K(i.freeram),
+ nid, K(zero),
nid, K(i.totalram - i.freeram),
nid, K(active),
nid, K(inactive),
Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-12-21 11:01:15.000000000 -0800
@@ -715,6 +715,7 @@
#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
#define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */
+#define PF_KSCRUBD 0x00800000 /* I am kscrubd */
#ifdef CONFIG_SMP
extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.9/mm/Makefile
===================================================================
--- linux-2.6.9.orig/mm/Makefile 2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/Makefile 2004-12-21 11:01:15.000000000 -0800
@@ -5,7 +5,7 @@
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
- vmalloc.o
+ vmalloc.o scrubd.o
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
page_alloc.o page-writeback.o pdflush.o prio_tree.o \
Index: linux-2.6.9/mm/scrubd.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/mm/scrubd.c 2004-12-21 11:01:15.000000000 -0800
@@ -0,0 +1,148 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = MAX_ORDER; /* Off */
+unsigned int sysctl_scrub_stop = 2; /* Mininum order of page to zero */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ proc_dointvec(table, write, file, buffer, length, ppos);
+ if (sysctl_scrub_start < MAX_ORDER) {
+ struct zone *zone;
+
+ for_each_zone(zone)
+ wakeup_kscrubd(zone);
+ }
+ return 0;
+}
+
+
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+ int order;
+
+ for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+ struct free_area *area = z->free_area[NOT_ZEROED] + order;
+ if (!list_empty(&area->free_list)) {
+ struct page *page = scrubd_rmpage(z, area, order);
+ struct list_head *l;
+
+ if (!page)
+ continue;
+
+ page->index = order;
+
+ list_for_each(l, &zero_drivers) {
+ struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+ unsigned long size = PAGE_SIZE << order;
+
+ if (driver->start(page_address(page), size) == 0) {
+
+ unsigned ticks = (size*HZ)/driver->rate;
+ if (ticks) {
+ /* Wait the minimum time of the transfer */
+ current->state = TASK_INTERRUPTIBLE;
+ schedule_timeout(ticks);
+ }
+ /* Then keep on checking until transfer is complete */
+ while (!driver->check())
+ schedule();
+ goto out;
+ }
+ }
+
+ /* Unable to find a zeroing device that would
+ * deal with this page so just do it on our own.
+ * This will likely thrash the cpu caches.
+ */
+ cond_resched();
+ zero_page(page_address(page), order);
+out:
+ end_zero_page(page);
+ cond_resched();
+ return 1 << order;
+ }
+ }
+ return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+ int i;
+ unsigned long pages_zeroed;
+
+ if (system_state != SYSTEM_RUNNING)
+ return;
+
+ do {
+ pages_zeroed = 0;
+ for (i = 0; i < pgdat->nr_zones; i++) {
+ struct zone *zone = pgdat->node_zones + i;
+
+ pages_zeroed += zero_highest_order_page(zone);
+ }
+ } while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+ pg_data_t *pgdat = (pg_data_t*)p;
+ struct task_struct *tsk = current;
+ DEFINE_WAIT(wait);
+ cpumask_t cpumask;
+
+ daemonize("kscrubd%d", pgdat->node_id);
+ cpumask = node_to_cpumask(pgdat->node_id);
+ if (!cpus_empty(cpumask))
+ set_cpus_allowed(tsk, cpumask);
+
+ tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+ for ( ; ; ) {
+ if (current->flags & PF_FREEZE)
+ refrigerator(PF_FREEZE);
+ prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+ schedule();
+ finish_wait(&pgdat->kscrubd_wait, &wait);
+
+ scrub_pgdat(pgdat);
+ }
+ return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+ pg_data_t *pgdat;
+ for_each_pgdat(pgdat)
+ pgdat->kscrubd
+ = find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+ return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.9/include/linux/scrub.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/include/linux/scrub.h 2004-12-21 11:01:15.000000000 -0800
@@ -0,0 +1,48 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+ int (*start)(void *, unsigned length); /* Start bzero transfer */
+ int (*check)(void); /* Check if bzero is complete */
+ int rate; /* bzero rate in MB/sec */
+ struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+ list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+ list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+ if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+ return;
+ wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page);
+#endif
Index: linux-2.6.9/kernel/sysctl.c
===================================================================
--- linux-2.6.9.orig/kernel/sysctl.c 2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/kernel/sysctl.c 2004-12-21 11:01:15.000000000 -0800
@@ -40,6 +40,7 @@
#include <linux/times.h>
#include <linux/limits.h>
#include <linux/dcache.h>
+#include <linux/scrub.h>
#include <linux/syscalls.h>
#include <asm/uaccess.h>
@@ -816,6 +817,24 @@
.strategy = &sysctl_jiffies,
},
#endif
+ {
+ .ctl_name = VM_SCRUB_START,
+ .procname = "scrub_start",
+ .data = &sysctl_scrub_start,
+ .maxlen = sizeof(sysctl_scrub_start),
+ .mode = 0644,
+ .proc_handler = &scrub_start_handler,
+ .strategy = &sysctl_intvec,
+ },
+ {
+ .ctl_name = VM_SCRUB_STOP,
+ .procname = "scrub_stop",
+ .data = &sysctl_scrub_stop,
+ .maxlen = sizeof(sysctl_scrub_stop),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ },
{ .ctl_name = 0 }
};
Index: linux-2.6.9/include/linux/sysctl.h
===================================================================
--- linux-2.6.9.orig/include/linux/sysctl.h 2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/sysctl.h 2004-12-21 11:01:15.000000000 -0800
@@ -168,6 +168,8 @@
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+ VM_SCRUB_START=30, /* percentage * 10 at which to start scrubd */
+ VM_SCRUB_STOP=31, /* percentage * 10 at which to stop scrubd */
};
^ permalink raw reply [flat|nested] 89+ messages in thread
* Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing
2004-12-21 19:55 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
2004-12-21 19:56 ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Christoph Lameter
2004-12-21 19:57 ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd Christoph Lameter
@ 2004-12-21 19:57 ` Christoph Lameter
2004-12-22 12:46 ` Robin Holt
2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
2004-12-24 18:31 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Andrea Arcangeli
4 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:57 UTC (permalink / raw)
To: Nick Piggin
Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
linux-mm, linux-kernel
o Use the Block Transfer Engine in the Altix SN2 SHub for background zeroing
Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c 2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/bte.c 2004-12-21 11:03:49.000000000 -0800
@@ -4,6 +4,8 @@
* for more details.
*
* Copyright (c) 2000-2003 Silicon Graphics, Inc. All Rights Reserved.
+ *
+ * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
*/
#include <linux/config.h>
@@ -20,6 +22,8 @@
#include <linux/bootmem.h>
#include <linux/string.h>
#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scrub.h>
#include <asm/sn/bte.h>
@@ -30,7 +34,11 @@
/* two interfaces on two btes */
#define MAX_INTERFACES_TO_TRY 4
-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+DEFINE_PER_CPU(u64 *, bte_zero_notify);
+
+#define bte_zero_notify __get_cpu_var(bte_zero_notify)
+
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
{
nodepda_t *tmp_nodepda;
@@ -132,7 +140,6 @@
if (bte == NULL) {
continue;
}
-
if (spin_trylock(&bte->spinlock)) {
if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
(BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
@@ -157,7 +164,7 @@
}
} while (1);
- if (notification == NULL) {
+ if (notification == NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
/* User does not want to be notified. */
bte->most_rcnt_na = &bte->notify;
} else {
@@ -192,6 +199,8 @@
itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);
+ if (mode & BTE_NOTIFY_AND_GET_POINTER)
+ *(u64 volatile **)(notification) = &bte->notify;
spin_unlock_irqrestore(&bte->spinlock, irq_flags);
if (notification != NULL) {
@@ -449,5 +458,31 @@
mynodepda->bte_if[i].cleanup_active = 0;
mynodepda->bte_if[i].bh_error = 0;
}
+}
+
+static int bte_check_bzero(void)
+{
+ return *bte_zero_notify != BTE_WORD_BUSY;
+}
+
+static int bte_start_bzero(void *p, unsigned long len)
+{
+ /* Check limitations.
+ 1. System must be running (weird things happen during bootup)
+ 2. Size >64KB. Smaller requests cause too much bte traffic
+ */
+ if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
+ return EINVAL;
+
+ return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, &bte_zero_notify);
+}
+
+static struct zero_driver bte_bzero = {
+ .start = bte_start_bzero,
+ .check = bte_check_bzero,
+ .rate = 500000000 /* 500 MB /sec */
+};
+void sn_bte_bzero_init(void) {
+ register_zero_driver(&bte_bzero);
}
Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c 2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/setup.c 2004-12-21 11:02:35.000000000 -0800
@@ -243,6 +243,7 @@
int pxm;
int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
extern void sn_cpu_init(void);
+ extern void sn_bte_bzero_init(void);
/*
* If the generic code has enabled vga console support - lets
@@ -333,6 +334,7 @@
screen_info = sn_screen_info;
sn_timer_init();
+ sn_bte_bzero_init();
}
/**
Index: linux-2.6.9/include/asm-ia64/sn/bte.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/sn/bte.h 2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/asm-ia64/sn/bte.h 2004-12-21 11:02:35.000000000 -0800
@@ -48,6 +48,8 @@
#define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
/* Use a reserved bit to let the caller specify a wait for any BTE */
#define BTE_WACQUIRE (0x4000)
+/* Return the pointer to the notification cacheline to the user */
+#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
/* Use the BTE on the node with the destination memory */
#define BTE_USE_DEST (BTE_WACQUIRE << 1)
/* Use any available BTE interface on any node for the transfer */
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing
2004-12-21 19:57 ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing Christoph Lameter
@ 2004-12-22 12:46 ` Robin Holt
2004-12-22 19:56 ` Christoph Lameter
0 siblings, 1 reply; 89+ messages in thread
From: Robin Holt @ 2004-12-22 12:46 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, Luck, Tony, Robin Holt, Adam Litke, linux-ia64,
torvalds, linux-mm, linux-kernel
We still need to talk. This is a much smaller patch, which I like. The
problem I see in my 30 second review is you are doing things per-cpu
when they really need to be done per-node. It is very likely that
there will be M-Bricks in the system (cranberry2 has one if you want
to test your code out there or you can take any altix and disable the
cpus on a C-Brick). With M-Bricks, you will essentially limit
yourself to one zero operation per controlling node instead of one
per node.
I think the easy answer is to not have the structure allocated
within bte_copy(), but rather within bte_start_zero and passed
in as the notification address.
Give me a call sometime today (Wed. I am in the office from about
10:00 CDT until around 4:00 CDT) Maybe we can get this straightened
out quickly. If you are not calling from the office, email me with
other arrangements.
Thanks,
Robin
On Tue, Dec 21, 2004 at 11:57:57AM -0800, Christoph Lameter wrote:
> o Use the Block Transfer Engine in the Altix SN2 SHub for background zeroing
>
> Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
> ===================================================================
> --- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c 2004-12-17 14:40:10.000000000 -0800
> +++ linux-2.6.9/arch/ia64/sn/kernel/bte.c 2004-12-21 11:03:49.000000000 -0800
> @@ -4,6 +4,8 @@
> * for more details.
> *
> * Copyright (c) 2000-2003 Silicon Graphics, Inc. All Rights Reserved.
> + *
> + * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
> */
>
> #include <linux/config.h>
> @@ -20,6 +22,8 @@
> #include <linux/bootmem.h>
> #include <linux/string.h>
> #include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/scrub.h>
>
> #include <asm/sn/bte.h>
>
> @@ -30,7 +34,11 @@
> /* two interfaces on two btes */
> #define MAX_INTERFACES_TO_TRY 4
>
> -static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
> +DEFINE_PER_CPU(u64 *, bte_zero_notify);
> +
> +#define bte_zero_notify __get_cpu_var(bte_zero_notify)
> +
> +static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
> {
> nodepda_t *tmp_nodepda;
>
> @@ -132,7 +140,6 @@
> if (bte == NULL) {
> continue;
> }
> -
> if (spin_trylock(&bte->spinlock)) {
> if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
> (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
> @@ -157,7 +164,7 @@
> }
> } while (1);
>
> - if (notification == NULL) {
> + if (notification == NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
> /* User does not want to be notified. */
> bte->most_rcnt_na = &bte->notify;
> } else {
> @@ -192,6 +199,8 @@
>
> itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);
>
> + if (mode & BTE_NOTIFY_AND_GET_POINTER)
> + *(u64 volatile **)(notification) = &bte->notify;
> spin_unlock_irqrestore(&bte->spinlock, irq_flags);
>
> if (notification != NULL) {
> @@ -449,5 +458,31 @@
> mynodepda->bte_if[i].cleanup_active = 0;
> mynodepda->bte_if[i].bh_error = 0;
> }
> +}
> +
> +static int bte_check_bzero(void)
> +{
> + return *bte_zero_notify != BTE_WORD_BUSY;
> +}
> +
> +static int bte_start_bzero(void *p, unsigned long len)
> +{
> + /* Check limitations.
> + 1. System must be running (weird things happen during bootup)
> + 2. Size >64KB. Smaller requests cause too much bte traffic
> + */
> + if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
> + return EINVAL;
> +
> + return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, &bte_zero_notify);
> +}
> +
> +static struct zero_driver bte_bzero = {
> + .start = bte_start_bzero,
> + .check = bte_check_bzero,
> + .rate = 500000000 /* 500 MB /sec */
> +};
>
> +void sn_bte_bzero_init(void) {
> + register_zero_driver(&bte_bzero);
> }
> Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
> ===================================================================
> --- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c 2004-12-17 14:40:10.000000000 -0800
> +++ linux-2.6.9/arch/ia64/sn/kernel/setup.c 2004-12-21 11:02:35.000000000 -0800
> @@ -243,6 +243,7 @@
> int pxm;
> int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
> extern void sn_cpu_init(void);
> + extern void sn_bte_bzero_init(void);
>
> /*
> * If the generic code has enabled vga console support - lets
> @@ -333,6 +334,7 @@
> screen_info = sn_screen_info;
>
> sn_timer_init();
> + sn_bte_bzero_init();
> }
>
> /**
> Index: linux-2.6.9/include/asm-ia64/sn/bte.h
> ===================================================================
> --- linux-2.6.9.orig/include/asm-ia64/sn/bte.h 2004-12-17 14:40:16.000000000 -0800
> +++ linux-2.6.9/include/asm-ia64/sn/bte.h 2004-12-21 11:02:35.000000000 -0800
> @@ -48,6 +48,8 @@
> #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
> /* Use a reserved bit to let the caller specify a wait for any BTE */
> #define BTE_WACQUIRE (0x4000)
> +/* Return the pointer to the notification cacheline to the user */
> +#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
> /* Use the BTE on the node with the destination memory */
> #define BTE_USE_DEST (BTE_WACQUIRE << 1)
> /* Use any available BTE interface on any node for the transfer */
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing
2004-12-22 12:46 ` Robin Holt
@ 2004-12-22 19:56 ` Christoph Lameter
0 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2004-12-22 19:56 UTC (permalink / raw)
To: Robin Holt
Cc: Nick Piggin, Luck, Tony, Adam Litke, linux-ia64, torvalds,
linux-mm, linux-kernel
I have done some additional tests with a 128 cpu SMP machine and it shows
that the bte slows things down during memory benchmarking by about 10-20%
although its causing less load when the system is not under high stress.
So its not always win and I may drop bte support in a future version. Can
we talk off list about this since this is mostly an SGI thing?
On Wed, 22 Dec 2004, Robin Holt wrote:
> We still need to talk. This is a much smaller patch, which I like. The
> problem I see in my 30 second review is you are doing things per-cpu
> when they really need to be done per-node. It is very likely that
> there will be M-Bricks in the system (cranberry2 has one if you want
> to test your code out there or you can take any altix and disable the
> cpus on a C-Brick). With M-Bricks, you will essentially limit
> yourself to one zero operation per controlling node instead of one
> per node.
>
> I think the easy answer is to not have the structure allocated
> within bte_copy(), but rather within bte_start_zero and passed
> in as the notification address.
>
> Give me a call sometime today (Wed. I am in the office from about
> 10:00 CDT until around 4:00 CDT) Maybe we can get this straightened
> out quickly. If you are not calling from the office, email me with
> other arrangements.
>
> Thanks,
> Robin
>
> On Tue, Dec 21, 2004 at 11:57:57AM -0800, Christoph Lameter wrote:
> > o Use the Block Transfer Engine in the Altix SN2 SHub for background zeroing
> >
> > Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
> > ===================================================================
> > --- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c 2004-12-17 14:40:10.000000000 -0800
> > +++ linux-2.6.9/arch/ia64/sn/kernel/bte.c 2004-12-21 11:03:49.000000000 -0800
> > @@ -4,6 +4,8 @@
> > * for more details.
> > *
> > * Copyright (c) 2000-2003 Silicon Graphics, Inc. All Rights Reserved.
> > + *
> > + * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
> > */
> >
> > #include <linux/config.h>
> > @@ -20,6 +22,8 @@
> > #include <linux/bootmem.h>
> > #include <linux/string.h>
> > #include <linux/sched.h>
> > +#include <linux/mm.h>
> > +#include <linux/scrub.h>
> >
> > #include <asm/sn/bte.h>
> >
> > @@ -30,7 +34,11 @@
> > /* two interfaces on two btes */
> > #define MAX_INTERFACES_TO_TRY 4
> >
> > -static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
> > +DEFINE_PER_CPU(u64 *, bte_zero_notify);
> > +
> > +#define bte_zero_notify __get_cpu_var(bte_zero_notify)
> > +
> > +static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
> > {
> > nodepda_t *tmp_nodepda;
> >
> > @@ -132,7 +140,6 @@
> > if (bte == NULL) {
> > continue;
> > }
> > -
> > if (spin_trylock(&bte->spinlock)) {
> > if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
> > (BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
> > @@ -157,7 +164,7 @@
> > }
> > } while (1);
> >
> > - if (notification == NULL) {
> > + if (notification == NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
> > /* User does not want to be notified. */
> > bte->most_rcnt_na = &bte->notify;
> > } else {
> > @@ -192,6 +199,8 @@
> >
> > itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);
> >
> > + if (mode & BTE_NOTIFY_AND_GET_POINTER)
> > + *(u64 volatile **)(notification) = &bte->notify;
> > spin_unlock_irqrestore(&bte->spinlock, irq_flags);
> >
> > if (notification != NULL) {
> > @@ -449,5 +458,31 @@
> > mynodepda->bte_if[i].cleanup_active = 0;
> > mynodepda->bte_if[i].bh_error = 0;
> > }
> > +}
> > +
> > +static int bte_check_bzero(void)
> > +{
> > + return *bte_zero_notify != BTE_WORD_BUSY;
> > +}
> > +
> > +static int bte_start_bzero(void *p, unsigned long len)
> > +{
> > + /* Check limitations.
> > + 1. System must be running (weird things happen during bootup)
> > + 2. Size >64KB. Smaller requests cause too much bte traffic
> > + */
> > + if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
> > + return EINVAL;
> > +
> > + return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, &bte_zero_notify);
> > +}
> > +
> > +static struct zero_driver bte_bzero = {
> > + .start = bte_start_bzero,
> > + .check = bte_check_bzero,
> > + .rate = 500000000 /* 500 MB /sec */
> > +};
> >
> > +void sn_bte_bzero_init(void) {
> > + register_zero_driver(&bte_bzero);
> > }
> > Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
> > ===================================================================
> > --- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c 2004-12-17 14:40:10.000000000 -0800
> > +++ linux-2.6.9/arch/ia64/sn/kernel/setup.c 2004-12-21 11:02:35.000000000 -0800
> > @@ -243,6 +243,7 @@
> > int pxm;
> > int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
> > extern void sn_cpu_init(void);
> > + extern void sn_bte_bzero_init(void);
> >
> > /*
> > * If the generic code has enabled vga console support - lets
> > @@ -333,6 +334,7 @@
> > screen_info = sn_screen_info;
> >
> > sn_timer_init();
> > + sn_bte_bzero_init();
> > }
> >
> > /**
> > Index: linux-2.6.9/include/asm-ia64/sn/bte.h
> > ===================================================================
> > --- linux-2.6.9.orig/include/asm-ia64/sn/bte.h 2004-12-17 14:40:16.000000000 -0800
> > +++ linux-2.6.9/include/asm-ia64/sn/bte.h 2004-12-21 11:02:35.000000000 -0800
> > @@ -48,6 +48,8 @@
> > #define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
> > /* Use a reserved bit to let the caller specify a wait for any BTE */
> > #define BTE_WACQUIRE (0x4000)
> > +/* Return the pointer to the notification cacheline to the user */
> > +#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
> > /* Use the BTE on the node with the destination memory */
> > #define BTE_USE_DEST (BTE_WACQUIRE << 1)
> > /* Use any available BTE interface on any node for the transfer */
>
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V2 [0/3]: Why and When it works
2004-12-21 19:55 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
` (2 preceding siblings ...)
2004-12-21 19:57 ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing Christoph Lameter
@ 2004-12-23 19:29 ` Christoph Lameter
2004-12-23 19:33 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
` (4 more replies)
2004-12-24 18:31 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Andrea Arcangeli
4 siblings, 5 replies; 89+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:29 UTC (permalink / raw)
Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
Change from V1 to V2:
o Add explanation--and some bench results--as to why and when this optimization works
and why other approaches have not worked.
o Instead of zero_page(p,order) extend clear_page to take second argument
o Update all architectures to accept second argument for clear_pages
o Extensive removal of all page allocs/clear_page combination from all archs
o Blank / typo fixups
o SGI BTE zero driver update: Use node specific variables instead of cpu specific
since a cpu may be responsible for multiple nodes.
The patches increasing the page fault rate (introduction of atomic pte operations
and anticipatory prefaulting) do so by reducing the locking overhead and are
therefore mainly of interest for applications running in SMP systems with a high
number of cpus. The single thread performance does just show minor increases.
Only the performance of multi-threaded applications increase significantly.
The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page. This zeroing means that all
cachelines of the faulted page (on Altix that means all 128 cachelines of
128 byte each) must be loaded and later written back. This patch allows to
avoid having to load all cachelines if only a part of the cachelines of
that page is needed immediately after the fault.
Thus the patch will only be effective for sparsely accessed memory which
is typicalfor anonymous memory and pte maps. Prezeroed pages will be used
for those purposes. Unzeroed pages will be used as usual for the other
purposes.
Others have also thought that prezeroing could be a benefit and have tried
provide a way to provide zeroed pages to the page fault handler:
http://marc.theaimsgroup.com/?t=109914559100004&r=1&w=2
http://marc.theaimsgroup.com/?t=109777267500005&r=1&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=104931944213955&w=2
However, these attempt have tried to zero pages soon to be
accessed (and which may already have recently been accessed). Elements of
these pages are thus already in the cache. Approaches like that will only
shift processing a bit and not yield performance benefits.
Prezeroing only makes sense for pages that are not currently needed and
that are not in the cpu caches. Pages that have recently been touched and
that soon will be touched again are better hot zeroed since the zeroing
will largely be done to cachelines already in the cpu caches.
The patch makes prezeroing very effective by:
1. Aggregating zeroing operations to only apply to pages of higher order,
which results in many pages that will later become order 0 to be
zeroed in one go. For that purpose the existing clear_page function is
extended and made to take an additional argument specifying the order of
the page to be cleared.
2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.
The result is a significant increase of the page fault performance even for
single threaded applications:
w/o patch:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 3 1 0.146s 11.155s 11.030s 69584.896 69566.852
w/patch
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
1 1 1 0.014s 0.110s 0.012s524292.194 517665.538
The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmarks the system could potentially
run out of zeroed pages but the efficient algorithm for page zeroing still
shows this to be a winner:
(8 way system with 6 GB RAM, no hardware zeroing support)
w/o patch:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 3 1 0.146s 11.155s 11.030s 69584.896 69566.852
4 3 2 0.170s 14.909s 7.097s 52150.369 98643.687
4 3 4 0.181s 16.597s 5.079s 46869.167 135642.420
4 3 8 0.166s 23.239s 4.037s 33599.215 179791.120
w/patch
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
4 3 1 0.183s 2.750s 2.093s268077.996 267952.890
4 3 2 0.185s 4.876s 2.097s155344.562 263967.292
4 3 4 0.150s 6.617s 2.097s116205.793 264774.080
4 3 8 0.186s 13.693s 3.054s 56659.819 221701.073
Note that zeroing of pages makes no sense if the application
touches all cache lines of a page allocated (there is no influence of
prezeroing on benchmarks like lmbench for that reason) since the extensive
caching of modern cpus means that the zeroes written to a hot zeroed page
will then be overwritten by the application in the cpu cache and thus
the zeros will never make it to memory! The test program used above only
touches one 128 byte cache line of a 16k page (ia64).
Here is another test in order to gauge the influence of the number of cache
lines touched on the performance of the prezero enhancements:
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec
1 1 1 1 0.01s 0.12s 0.01s500813.853 497925.891
1 1 1 2 0.01s 0.11s 0.01s493453.103 472877.725
1 1 1 4 0.02s 0.10s 0.01s479351.658 471507.415
1 1 1 8 0.01s 0.13s 0.01s424742.054 416725.013
1 1 1 16 0.05s 0.12s 0.01s347715.359 336983.834
1 1 1 32 0.12s 0.13s 0.02s258112.286 256246.731
1 1 1 64 0.24s 0.14s 0.03s169896.381 168189.283
1 1 1 128 0.49s 0.14s 0.06s102300.257 101674.435
The benefits of prezeroing become smaller the more cache lines of
a page are touched. Prezeroing can only be effective if memory is not
immediately touched after the anonymous page fault.
The patch is composed of 4 parts:
[1/4] Introduce __GFP_ZERO
Modifies the page allocator to be able to take the __GFP_ZERO flag
and returns zeroed memory on request. Modifies locations throughout
the linux sources that retrieve a page and then zero it to request
a zeroed page.
[2/4] Architecture specific clear_page updates
Adds second order argument to clear_page and updates all arches.
Note: The two first pages may be used alone if no zeroing engine is wanted.
[3/4] Page Zeroing
Adds management of ZEROED and NOT_ZEROED pages and a background daemon
called scrubd. scrubd is disabled by default but can be enabled
by writing an order number to /proc/sys/vm/scrub_start. If a page
is coalesced of that order or higher then the scrub daemon will
start zeroing until all pages of order /proc/sys/vm/scrub_stop and
higher are zeroed and then go back to sleep.
In an SMP environment the scrub daemon is typically
running on the most idle cpu. Thus a single threaded application running
on one cpu may have the other cpu zeroing pages for it etc. The scrub
daemon is hardly noticable and usually finished zeroing quickly since most
processors are optimized for linear memory filling.
[4/4] SGI Altix Block Transfer Engine Support
Implements a driver to shift the zeroing off the cpu into hardware.
With hardware support there will be minimal impact of zeroing
on the performance of the system.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal
2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
@ 2004-12-23 19:33 ` Christoph Lameter
2004-12-23 19:33 ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
` (3 more replies)
2004-12-23 19:49 ` Prezeroing V2 [0/3]: Why and When it works Arjan van de Ven
` (3 subsequent siblings)
4 siblings, 4 replies; 89+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:33 UTC (permalink / raw)
To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
to request zeroed pages from the page allocator.
o Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set
o Replace all page zeroing after allocating pages by request for
zeroed pages.
o requires arch updates to clear_page in order to function properly.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.9/mm/page_alloc.c
===================================================================
--- linux-2.6.9.orig/mm/page_alloc.c 2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c 2004-12-22 17:23:43.000000000 -0800
@@ -575,6 +575,18 @@
BUG_ON(bad_range(zone, page));
mod_page_state_zone(zone, pgalloc, 1 << order);
prep_new_page(page, order);
+
+ if (gfp_flags & __GFP_ZERO) {
+#ifdef CONFIG_HIGHMEM
+ if (PageHighMem(page)) {
+ int n = 1 << order;
+
+ while (n-- >0)
+ clear_highpage(page + n);
+ } else
+#endif
+ clear_page(page_address(page), order);
+ }
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
}
@@ -767,12 +779,9 @@
*/
BUG_ON(gfp_mask & __GFP_HIGHMEM);
- page = alloc_pages(gfp_mask, 0);
- if (page) {
- void *address = page_address(page);
- clear_page(address);
- return (unsigned long) address;
- }
+ page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+ if (page)
+ return (unsigned long) page_address(page);
return 0;
}
Index: linux-2.6.9/include/linux/gfp.h
===================================================================
--- linux-2.6.9.orig/include/linux/gfp.h 2004-10-18 14:53:44.000000000 -0700
+++ linux-2.6.9/include/linux/gfp.h 2004-12-22 17:23:43.000000000 -0800
@@ -37,6 +37,7 @@
#define __GFP_NORETRY 0x1000 /* Do not retry. Might fail */
#define __GFP_NO_GROW 0x2000 /* Slab internal usage */
#define __GFP_COMP 0x4000 /* Add compound page metadata */
+#define __GFP_ZERO 0x8000 /* Return zeroed page on success */
#define __GFP_BITS_SHIFT 16 /* Room for 16 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)
/* Flag - indicates that the buffer will be suitable for DMA. Ignored on some
platforms, used as appropriate on others */
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/mm/memory.c 2004-12-22 17:23:43.000000000 -0800
@@ -1445,10 +1445,9 @@
if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
- page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
if (!page)
goto no_mem;
- clear_user_highpage(page, addr);
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.9/kernel/profile.c
===================================================================
--- linux-2.6.9.orig/kernel/profile.c 2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/kernel/profile.c 2004-12-22 17:23:43.000000000 -0800
@@ -326,17 +326,15 @@
node = cpu_to_node(cpu);
per_cpu(cpu_profile_flip, cpu) = 0;
if (!per_cpu(cpu_profile_hits, cpu)[1]) {
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
return NOTIFY_BAD;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
}
if (!per_cpu(cpu_profile_hits, cpu)[0]) {
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
goto out_free;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
}
break;
@@ -510,16 +508,14 @@
int node = cpu_to_node(cpu);
struct page *page;
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
goto out_cleanup;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[1]
= (struct profile_hit *)page_address(page);
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
goto out_cleanup;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[0]
= (struct profile_hit *)page_address(page);
}
Index: linux-2.6.9/mm/shmem.c
===================================================================
--- linux-2.6.9.orig/mm/shmem.c 2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/mm/shmem.c 2004-12-22 17:23:43.000000000 -0800
@@ -369,9 +369,8 @@
}
spin_unlock(&info->lock);
- page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+ page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
if (page) {
- clear_highpage(page);
page->nr_swapped = 0;
}
spin_lock(&info->lock);
@@ -910,7 +909,7 @@
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
pvma.vm_pgoff = idx;
pvma.vm_end = PAGE_SIZE;
- page = alloc_page_vma(gfp, &pvma, 0);
+ page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
mpol_free(pvma.vm_policy);
return page;
}
@@ -926,7 +925,7 @@
shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
unsigned long idx)
{
- return alloc_page(gfp);
+ return alloc_page(gfp | __GFP_ZERO);
}
#endif
@@ -1135,7 +1134,6 @@
info->alloced++;
spin_unlock(&info->lock);
- clear_highpage(filepage);
flush_dcache_page(filepage);
SetPageUptodate(filepage);
}
Index: linux-2.6.9/mm/hugetlb.c
===================================================================
--- linux-2.6.9.orig/mm/hugetlb.c 2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/hugetlb.c 2004-12-22 17:23:43.000000000 -0800
@@ -77,7 +77,6 @@
struct page *alloc_huge_page(void)
{
struct page *page;
- int i;
spin_lock(&hugetlb_lock);
page = dequeue_huge_page();
@@ -88,8 +87,7 @@
spin_unlock(&hugetlb_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
- for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
- clear_highpage(&page[i]);
+ clear_page(page_address(page), HUGETLB_PAGE_ORDER);
return page;
}
Index: linux-2.6.9/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgalloc.h 2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgalloc.h 2004-12-22 17:23:43.000000000 -0800
@@ -61,9 +61,7 @@
pgd_t *pgd = pgd_alloc_one_fast(mm);
if (unlikely(pgd == NULL)) {
- pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
- if (likely(pgd != NULL))
- clear_page(pgd);
+ pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
}
return pgd;
}
@@ -107,10 +105,8 @@
static inline pmd_t*
pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
{
- pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
- if (likely(pmd != NULL))
- clear_page(pmd);
return pmd;
}
@@ -141,20 +137,16 @@
static inline struct page *
pte_alloc_one (struct mm_struct *mm, unsigned long addr)
{
- struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
- if (likely(pte != NULL))
- clear_page(page_address(pte));
return pte;
}
static inline pte_t *
pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
- if (likely(pte != NULL))
- clear_page(pte);
return pte;
}
Index: linux-2.6.9/arch/i386/mm/pgtable.c
===================================================================
--- linux-2.6.9.orig/arch/i386/mm/pgtable.c 2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/i386/mm/pgtable.c 2004-12-22 17:23:43.000000000 -0800
@@ -132,10 +132,7 @@
pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
- return pte;
+ return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
}
struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -143,12 +140,10 @@
struct page *pte;
#ifdef CONFIG_HIGHPTE
- pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
#else
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
#endif
- if (pte)
- clear_highpage(pte);
return pte;
}
Index: linux-2.6.9/drivers/block/pktcdvd.c
===================================================================
--- linux-2.6.9.orig/drivers/block/pktcdvd.c 2004-12-22 16:48:15.000000000 -0800
+++ linux-2.6.9/drivers/block/pktcdvd.c 2004-12-22 17:23:43.000000000 -0800
@@ -125,22 +125,19 @@
int i;
struct packet_data *pkt;
- pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL);
+ pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL|__GFP_ZERO);
if (!pkt)
goto no_pkt;
- memset(pkt, 0, sizeof(struct packet_data));
pkt->w_bio = pkt_bio_alloc(PACKET_MAX_SIZE);
if (!pkt->w_bio)
goto no_bio;
for (i = 0; i < PAGES_PER_PACKET; i++) {
- pkt->pages[i] = alloc_page(GFP_KERNEL);
+ pkt->pages[i] = alloc_page(GFP_KERNEL|__GFP_ZERO);
if (!pkt->pages[i])
goto no_page;
}
- for (i = 0; i < PAGES_PER_PACKET; i++)
- clear_page(page_address(pkt->pages[i]));
spin_lock_init(&pkt->lock);
Index: linux-2.6.9/arch/m68k/mm/motorola.c
===================================================================
--- linux-2.6.9.orig/arch/m68k/mm/motorola.c 2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/m68k/mm/motorola.c 2004-12-22 17:23:43.000000000 -0800
@@ -1,4 +1,4 @@
-/*
+*
* linux/arch/m68k/motorola.c
*
* Routines specific to the Motorola MMU, originally from:
@@ -50,7 +50,7 @@
ptablep = (pte_t *)alloc_bootmem_low_pages(PAGE_SIZE);
- clear_page(ptablep);
+ clear_page(ptablep, 0);
__flush_page_to_ram(ptablep);
flush_tlb_kernel_page(ptablep);
nocache_page(ptablep);
@@ -90,7 +90,7 @@
if (((unsigned long)last_pgtable & ~PAGE_MASK) == 0) {
last_pgtable = (pmd_t *)alloc_bootmem_low_pages(PAGE_SIZE);
- clear_page(last_pgtable);
+ clear_page(last_pgtable, 0);
__flush_page_to_ram(last_pgtable);
flush_tlb_kernel_page(last_pgtable);
nocache_page(last_pgtable);
Index: linux-2.6.9/include/asm-mips/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-mips/pgalloc.h 2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-mips/pgalloc.h 2004-12-22 17:23:43.000000000 -0800
@@ -56,9 +56,7 @@
{
pte_t *pte;
- pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PTE_ORDER);
- if (pte)
- clear_page(pte);
+ pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER);
return pte;
}
Index: linux-2.6.9/arch/alpha/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/alpha/mm/init.c 2004-10-18 14:55:07.000000000 -0700
+++ linux-2.6.9/arch/alpha/mm/init.c 2004-12-22 17:23:43.000000000 -0800
@@ -42,10 +42,9 @@
{
pgd_t *ret, *init;
- ret = (pgd_t *)__get_free_page(GFP_KERNEL);
+ ret = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
init = pgd_offset(&init_mm, 0UL);
if (ret) {
- clear_page(ret);
#ifdef CONFIG_ALPHA_LARGE_VMALLOC
memcpy (ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
(PTRS_PER_PGD - USER_PTRS_PER_PGD - 1)*sizeof(pgd_t));
@@ -63,9 +62,7 @@
pte_t *
pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pte;
}
Index: linux-2.6.9/include/asm-parisc/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-parisc/pgalloc.h 2004-10-18 14:55:28.000000000 -0700
+++ linux-2.6.9/include/asm-parisc/pgalloc.h 2004-12-22 17:23:43.000000000 -0800
@@ -120,18 +120,14 @@
static inline struct page *
pte_alloc_one(struct mm_struct *mm, unsigned long address)
{
- struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
- if (likely(page != NULL))
- clear_page(page_address(page));
+ struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return page;
}
static inline pte_t *
pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (likely(pte != NULL))
- clear_page(pte);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pte;
}
Index: linux-2.6.9/arch/sh/mm/pg-sh4.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/pg-sh4.c 2004-10-18 14:53:46.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/pg-sh4.c 2004-12-22 17:23:43.000000000 -0800
@@ -34,7 +34,7 @@
{
__set_bit(PG_mapped, &page->flags);
if (((address ^ (unsigned long)to) & CACHE_ALIAS) == 0)
- clear_page(to);
+ clear_page(to, 0);
else {
pgprot_t pgprot = __pgprot(_PAGE_PRESENT |
_PAGE_RW | _PAGE_CACHABLE |
Index: linux-2.6.9/include/asm-sparc64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-sparc64/pgalloc.h 2004-10-18 14:55:28.000000000 -0700
+++ linux-2.6.9/include/asm-sparc64/pgalloc.h 2004-12-22 17:23:43.000000000 -0800
@@ -73,10 +73,9 @@
struct page *page;
preempt_enable();
- page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+ page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
if (page) {
ret = (struct page *)page_address(page);
- clear_page(ret);
page->lru.prev = (void *) 2UL;
preempt_disable();
Index: linux-2.6.9/include/asm-sh/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-sh/pgalloc.h 2004-10-18 14:54:08.000000000 -0700
+++ linux-2.6.9/include/asm-sh/pgalloc.h 2004-12-22 17:23:43.000000000 -0800
@@ -44,9 +44,7 @@
{
pte_t *pte;
- pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
return pte;
}
@@ -56,9 +54,7 @@
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
- if (pte)
- clear_page(page_address(pte));
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
return pte;
}
Index: linux-2.6.9/include/asm-m32r/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-m32r/pgalloc.h 2004-10-18 14:55:07.000000000 -0700
+++ linux-2.6.9/include/asm-m32r/pgalloc.h 2004-12-22 17:23:43.000000000 -0800
@@ -23,10 +23,7 @@
*/
static __inline__ pgd_t *pgd_alloc(struct mm_struct *mm)
{
- pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-
- if (pgd)
- clear_page(pgd);
+ pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
return pgd;
}
@@ -39,10 +36,7 @@
static __inline__ pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
-
- if (pte)
- clear_page(pte);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
return pte;
}
@@ -50,10 +44,8 @@
static __inline__ struct page *pte_alloc_one(struct mm_struct *mm,
unsigned long address)
{
- struct page *pte = alloc_page(GFP_KERNEL);
+ struct page *pte = alloc_page(GFP_KERNEL|__GFP_ZERO);
- if (pte)
- clear_page(page_address(pte));
return pte;
}
Index: linux-2.6.9/arch/um/kernel/mem.c
===================================================================
--- linux-2.6.9.orig/arch/um/kernel/mem.c 2004-10-18 14:53:51.000000000 -0700
+++ linux-2.6.9/arch/um/kernel/mem.c 2004-12-22 17:23:43.000000000 -0800
@@ -307,9 +307,7 @@
{
pte_t *pte;
- pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pte;
}
@@ -317,9 +315,7 @@
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
- if (pte)
- clear_highpage(pte);
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
return pte;
}
Index: linux-2.6.9/arch/ppc64/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/ppc64/mm/init.c 2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/ppc64/mm/init.c 2004-12-22 17:23:43.000000000 -0800
@@ -761,7 +761,7 @@
void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
{
- clear_page(page);
+ clear_page(page, 0);
if (cur_cpu_spec->cpu_features & CPU_FTR_COHERENT_ICACHE)
return;
Index: linux-2.6.9/include/asm-sh64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-sh64/pgalloc.h 2004-10-18 14:53:21.000000000 -0700
+++ linux-2.6.9/include/asm-sh64/pgalloc.h 2004-12-22 17:23:43.000000000 -0800
@@ -112,9 +112,7 @@
{
pte_t *pte;
- pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT|__GFP_ZERO);
return pte;
}
@@ -123,9 +121,7 @@
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
- if (pte)
- clear_page(page_address(pte));
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
return pte;
}
@@ -150,9 +146,7 @@
static __inline__ pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
{
pmd_t *pmd;
- pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pmd)
- clear_page(pmd);
+ pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pmd;
}
Index: linux-2.6.9/include/asm-cris/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-cris/pgalloc.h 2004-10-18 14:55:06.000000000 -0700
+++ linux-2.6.9/include/asm-cris/pgalloc.h 2004-12-22 17:23:43.000000000 -0800
@@ -24,18 +24,14 @@
extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pte;
}
extern inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
- if (pte)
- clear_page(page_address(pte));
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
return pte;
}
Index: linux-2.6.9/arch/ppc/mm/pgtable.c
===================================================================
--- linux-2.6.9.orig/arch/ppc/mm/pgtable.c 2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/ppc/mm/pgtable.c 2004-12-22 17:23:43.000000000 -0800
@@ -85,8 +85,7 @@
{
pgd_t *ret;
- if ((ret = (pgd_t *)__get_free_pages(GFP_KERNEL, PGDIR_ORDER)) != NULL)
- clear_pages(ret, PGDIR_ORDER);
+ ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER);
return ret;
}
@@ -102,7 +101,7 @@
extern void *early_get_page(void);
if (mem_init_done) {
- pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
if (pte) {
struct page *ptepage = virt_to_page(pte);
ptepage->mapping = (void *) mm;
@@ -110,8 +109,6 @@
}
} else
pte = (pte_t *)early_get_page();
- if (pte)
- clear_page(pte);
return pte;
}
Index: linux-2.6.9/arch/ppc/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/ppc/mm/init.c 2004-10-18 14:53:43.000000000 -0700
+++ linux-2.6.9/arch/ppc/mm/init.c 2004-12-22 17:23:43.000000000 -0800
@@ -595,7 +595,7 @@
}
void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
{
- clear_page(page);
+ clear_page(page, 0);
clear_bit(PG_arch_1, &pg->flags);
}
Index: linux-2.6.9/fs/afs/file.c
===================================================================
--- linux-2.6.9.orig/fs/afs/file.c 2004-10-18 14:55:36.000000000 -0700
+++ linux-2.6.9/fs/afs/file.c 2004-12-22 17:23:43.000000000 -0800
@@ -172,7 +172,7 @@
(size_t) PAGE_SIZE);
desc.buffer = kmap(page);
- clear_page(desc.buffer);
+ clear_page(desc.buffer, 0);
/* read the contents of the file from the server into the
* page */
Index: linux-2.6.9/include/asm-alpha/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-alpha/pgalloc.h 2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-alpha/pgalloc.h 2004-12-22 17:23:43.000000000 -0800
@@ -40,9 +40,7 @@
static inline pmd_t *
pmd_alloc_one(struct mm_struct *mm, unsigned long address)
{
- pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (ret)
- clear_page(ret);
+ pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return ret;
}
Index: linux-2.6.9/include/linux/highmem.h
===================================================================
--- linux-2.6.9.orig/include/linux/highmem.h 2004-10-18 14:54:54.000000000 -0700
+++ linux-2.6.9/include/linux/highmem.h 2004-12-22 17:23:43.000000000 -0800
@@ -47,7 +47,7 @@
static inline void clear_highpage(struct page *page)
{
void *kaddr = kmap_atomic(page, KM_USER0);
- clear_page(kaddr);
+ clear_page(kaddr, 0);
kunmap_atomic(kaddr, KM_USER0);
}
Index: linux-2.6.9/arch/sh64/mm/ioremap.c
===================================================================
--- linux-2.6.9.orig/arch/sh64/mm/ioremap.c 2004-10-18 14:54:32.000000000 -0700
+++ linux-2.6.9/arch/sh64/mm/ioremap.c 2004-12-22 17:23:43.000000000 -0800
@@ -399,7 +399,7 @@
if (pte_none(*ptep) || !pte_present(*ptep))
return;
- clear_page((void *)ptep);
+ clear_page((void *)ptep, 0);
pte_clear(ptep);
}
Index: linux-2.6.9/include/asm-m68k/motorola_pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-m68k/motorola_pgalloc.h 2004-10-18 14:55:36.000000000 -0700
+++ linux-2.6.9/include/asm-m68k/motorola_pgalloc.h 2004-12-22 17:23:43.000000000 -0800
@@ -12,9 +12,8 @@
{
pte_t *pte;
- pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
if (pte) {
- clear_page(pte);
__flush_page_to_ram(pte);
flush_tlb_kernel_page(pte);
nocache_page(pte);
@@ -31,7 +30,7 @@
static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
{
- struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
pte_t *pte;
if(!page)
@@ -39,7 +38,6 @@
pte = kmap(page);
if (pte) {
- clear_page(pte);
__flush_page_to_ram(pte);
flush_tlb_kernel_page(pte);
nocache_page(pte);
Index: linux-2.6.9/arch/sh/mm/pg-sh7705.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/pg-sh7705.c 2004-12-22 16:48:15.000000000 -0800
+++ linux-2.6.9/arch/sh/mm/pg-sh7705.c 2004-12-22 17:23:43.000000000 -0800
@@ -78,13 +78,13 @@
__set_bit(PG_mapped, &page->flags);
if (((address ^ (unsigned long)to) & CACHE_ALIAS) == 0) {
- clear_page(to);
+ clear_page(to, 0);
__flush_wback_region(to, PAGE_SIZE);
} else {
__flush_purge_virtual_region(to,
(void *)(address & 0xfffff000),
PAGE_SIZE);
- clear_page(to);
+ clear_page(to, 0);
__flush_wback_region(to, PAGE_SIZE);
}
}
Index: linux-2.6.9/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/sparc64/mm/init.c 2004-12-22 16:48:15.000000000 -0800
+++ linux-2.6.9/arch/sparc64/mm/init.c 2004-12-22 17:23:43.000000000 -0800
@@ -1687,13 +1687,12 @@
* Set up the zero page, mark it reserved, so that page count
* is not manipulated when freeing the page from user ptes.
*/
- mem_map_zero = alloc_pages(GFP_KERNEL, 0);
+ mem_map_zero = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
if (mem_map_zero == NULL) {
prom_printf("paging_init: Cannot alloc zero page.\n");
prom_halt();
}
SetPageReserved(mem_map_zero);
- clear_page(page_address(mem_map_zero));
codepages = (((unsigned long) _etext) - ((unsigned long) _start));
codepages = PAGE_ALIGN(codepages) >> PAGE_SHIFT;
Index: linux-2.6.9/include/asm-arm/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-arm/pgalloc.h 2004-10-18 14:55:27.000000000 -0700
+++ linux-2.6.9/include/asm-arm/pgalloc.h 2004-12-22 17:23:43.000000000 -0800
@@ -50,9 +50,8 @@
{
pte_t *pte;
- pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
if (pte) {
- clear_page(pte);
clean_dcache_area(pte, sizeof(pte_t) * PTRS_PER_PTE);
pte += PTRS_PER_PTE;
}
@@ -65,10 +64,9 @@
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
if (pte) {
void *page = page_address(pte);
- clear_page(page);
clean_dcache_area(page, sizeof(pte_t) * PTRS_PER_PTE);
}
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
2004-12-23 19:33 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
@ 2004-12-23 19:33 ` Christoph Lameter
2004-12-24 8:33 ` Pavel Machek
` (2 more replies)
2004-12-23 19:34 ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Christoph Lameter
` (2 subsequent siblings)
3 siblings, 3 replies; 89+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:33 UTC (permalink / raw)
To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
o Extend clear_page to take an order parameter for all architectures.
Known to work:
ia64
i386
Trivial modification expected to simply work:
arm
cris
h8300
m68k
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um
Modification made but it would be good to have some feedback from the arch maintainers:
x86_64
s390
alpha
sparc64
sh
mips
m32r
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.9/include/asm-ia64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/page.h 2004-10-18 14:53:21.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -56,7 +56,7 @@
# ifdef __KERNEL__
# define STRICT_MM_TYPECHECKS
-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
extern void copy_page (void *to, void *from);
/*
@@ -65,7 +65,7 @@
*/
#define clear_user_page(addr, vaddr, page) \
do { \
- clear_page(addr); \
+ clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
Index: linux-2.6.9/include/asm-i386/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/page.h 2004-12-22 16:48:19.000000000 -0800
+++ linux-2.6.9/include/asm-i386/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -18,7 +18,7 @@
#include <asm/mmx.h>
-#define clear_page(page) mmx_clear_page((void *)(page))
+#define clear_page(page, order) mmx_clear_page((void *)(page),order)
#define copy_page(to,from) mmx_copy_page(to,from)
#else
@@ -28,12 +28,12 @@
* Maybe the K6-III ?
*/
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
#endif
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.9/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/page.h 2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/include/asm-x86_64/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -32,10 +32,10 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-void clear_page(void *);
+void clear_page(void *, int);
void copy_page(void *, void *);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.9/include/asm-sparc/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-sparc/page.h 2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/include/asm-sparc/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -28,10 +28,10 @@
#ifndef __ASSEMBLY__
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
sparc_flush_page_to_ram(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.9/include/asm-s390/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-s390/page.h 2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/include/asm-s390/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -22,12 +22,12 @@
#ifndef __s390x__
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
register_pair rp;
rp.subreg.even = (unsigned long) page;
- rp.subreg.odd = (unsigned long) 4096;
+ rp.subreg.odd = (unsigned long) 4096 << order;
asm volatile (" slr 1,1\n"
" mvcl %0,0"
: "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@
#else /* __s390x__ */
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
- asm volatile (" lgr 2,%0\n"
+ int nr = 1 << order;
+
+ while (nr-- >0) {
+ asm volatile (" lgr 2,%0\n"
" lghi 3,4096\n"
" slgr 1,1\n"
" mvcl 2,0"
: : "a" ((void *) (page))
: "memory", "cc", "1", "2", "3" );
+ page += PAGE_SIZE;
+ }
}
static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@
#endif /* __s390x__ */
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/* Pure 2^n version of get_order */
Index: linux-2.6.9/arch/i386/lib/mmx.c
===================================================================
--- linux-2.6.9.orig/arch/i386/lib/mmx.c 2004-10-18 14:54:23.000000000 -0700
+++ linux-2.6.9/arch/i386/lib/mmx.c 2004-12-23 07:44:14.000000000 -0800
@@ -128,7 +128,7 @@
* other MMX using processors do not.
*/
-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
{
int i;
@@ -138,7 +138,7 @@
" pxor %%mm0, %%mm0\n" : :
);
- for(i=0;i<4096/64;i++)
+ for(i=0;i<((4096/64) << order);i++)
{
__asm__ __volatile__ (
" movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@
* Generic MMX implementation without K7 specific streaming
*/
-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
{
int i;
@@ -267,7 +267,7 @@
" pxor %%mm0, %%mm0\n" : :
);
- for(i=0;i<4096/128;i++)
+ for(i=0;i<((4096/128) << order);i++)
{
__asm__ __volatile__ (
" movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@
* Favour MMX for page clear and copy.
*/
-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
{
int d0, d1;
__asm__ __volatile__( \
"cld\n\t" \
"rep ; stosl" \
: "=&c" (d0), "=&D" (d1)
- :"a" (0),"1" (page),"0" (1024)
+ :"a" (0),"1" (page),"0" (1024 << order)
:"memory");
}
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
{
if(unlikely(in_interrupt()))
- slow_zero_page(page);
+ slow_clear_page(page, order);
else
- fast_clear_page(page);
+ fast_clear_page(page, order);
}
static void slow_copy_page(void *to, void *from)
Index: linux-2.6.9/include/asm-x86_64/mmx.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/mmx.h 2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-x86_64/mmx.h 2004-12-23 07:44:14.000000000 -0800
@@ -8,7 +8,7 @@
#include <linux/types.h>
extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
extern void mmx_copy_page(void *to, void *from);
#endif
Index: linux-2.6.9/arch/ia64/lib/clear_page.S
===================================================================
--- linux-2.6.9.orig/arch/ia64/lib/clear_page.S 2004-10-18 14:53:10.000000000 -0700
+++ linux-2.6.9/arch/ia64/lib/clear_page.S 2004-12-23 07:44:14.000000000 -0800
@@ -7,6 +7,7 @@
* 1/06/01 davidm Tuned for Itanium.
* 2/12/02 kchen Tuned for both Itanium and McKinley
* 3/08/02 davidm Some more tweaking
+ * 12/10/04 clameter Make it work on pages of order size
*/
#include <linux/config.h>
@@ -29,27 +30,33 @@
#define dst4 r11
#define dst_last r31
+#define totsize r14
GLOBAL_ENTRY(clear_page)
.prologue
- .regstk 1,0,0,0
- mov r16 = PAGE_SIZE/L3_LINE_SIZE-1 // main loop count, -1=repeat/until
+ .regstk 2,0,0,0
+ mov r16 = PAGE_SIZE/L3_LINE_SIZE // main loop count
+ mov totsize = PAGE_SIZE
.save ar.lc, saved_lc
mov saved_lc = ar.lc
-
+ ;;
.body
+ adds dst1 = 16, in0
mov ar.lc = (PREFETCH_LINES - 1)
mov dst_fetch = in0
- adds dst1 = 16, in0
adds dst2 = 32, in0
+ shl r16 = r16, in1
+ shl totsize = totsize, in1
;;
.fetch: stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
adds dst3 = 48, in0 // executing this multiple times is harmless
br.cloop.sptk.few .fetch
+ add r16 = -1,r16
+ add dst_last = totsize, dst_fetch
+ adds dst4 = 64, in0
;;
- addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
mov ar.lc = r16 // one L3 line per iteration
- adds dst4 = 64, in0
+ adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
;;
#ifdef CONFIG_ITANIUM
// Optimized for Itanium
Index: linux-2.6.9/arch/x86_64/lib/clear_page.S
===================================================================
--- linux-2.6.9.orig/arch/x86_64/lib/clear_page.S 2004-10-18 14:54:07.000000000 -0700
+++ linux-2.6.9/arch/x86_64/lib/clear_page.S 2004-12-23 07:44:14.000000000 -0800
@@ -7,6 +7,7 @@
clear_page:
xorl %eax,%eax
movl $4096/64,%ecx
+ shl %esi, %ecx
.p2align 4
.Lloop:
decl %ecx
@@ -42,6 +43,7 @@
.section .altinstr_replacement,"ax"
clear_page_c:
movl $4096/8,%ecx
+ shl %esi, %ecx
xorl %eax,%eax
rep
stosq
Index: linux-2.6.9/include/asm-sh/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-sh/page.h 2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/include/asm-sh/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -36,12 +36,22 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
extern void (*copy_page)(void *to, void *from);
extern void clear_page_slow(void *to);
extern void copy_page_slow(void *to, void *from);
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- >0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
#if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
struct page;
extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@
extern void __clear_user_page(void *to, void *orig_to);
extern void __copy_user_page(void *to, void *from, void *orig_to);
#elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#elif defined(CONFIG_CPU_SH4)
struct page;
Index: linux-2.6.9/include/asm-i386/mmx.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/mmx.h 2004-10-18 14:54:27.000000000 -0700
+++ linux-2.6.9/include/asm-i386/mmx.h 2004-12-23 07:44:14.000000000 -0800
@@ -8,7 +8,7 @@
#include <linux/types.h>
extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
extern void mmx_copy_page(void *to, void *from);
#endif
Index: linux-2.6.9/arch/alpha/lib/clear_page.S
===================================================================
--- linux-2.6.9.orig/arch/alpha/lib/clear_page.S 2004-10-18 14:55:06.000000000 -0700
+++ linux-2.6.9/arch/alpha/lib/clear_page.S 2004-12-23 07:44:14.000000000 -0800
@@ -6,11 +6,10 @@
.text
.align 4
- .global clear_page
- .ent clear_page
-clear_page:
+ .global _clear_page
+ .ent _clear_page
+_clear_page:
.prologue 0
-
lda $0,128
nop
unop
@@ -36,4 +35,4 @@
unop
nop
- .end clear_page
+ .end _clear_page
Index: linux-2.6.9/include/asm-sh64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-sh64/page.h 2004-10-18 14:54:07.000000000 -0700
+++ linux-2.6.9/include/asm-sh64/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -50,12 +50,20 @@
extern void sh64_page_clear(void *page);
extern void sh64_page_copy(void *from, void *to);
-#define clear_page(page) sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+ int nr = 1 << order;
+
+ while (nr-- >0) {
+ sh64_page_clear(page++, 0);
+ }
+}
+
#define copy_page(to,from) sh64_page_copy(from, to)
#if defined(CONFIG_DCACHE_DISABLED)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) sh_clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#else
Index: linux-2.6.9/include/asm-h8300/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-h8300/page.h 2004-10-18 14:55:06.000000000 -0700
+++ linux-2.6.9/include/asm-h8300/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -24,10 +24,10 @@
#define get_user_page(vaddr) __get_free_page(GFP_KERNEL)
#define free_user_page(page, addr) free_page(addr)
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.9/include/asm-arm/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-arm/page.h 2004-12-22 16:48:19.000000000 -0800
+++ linux-2.6.9/include/asm-arm/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -128,7 +128,7 @@
preempt_enable(); \
} while (0)
-#define clear_page(page) memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order) memzero((void *)(page), PAGE_SIZE << (order))
extern void copy_page(void *to, const void *from);
#undef STRICT_MM_TYPECHECKS
Index: linux-2.6.9/include/asm-ppc64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-ppc64/page.h 2004-12-22 16:48:20.000000000 -0800
+++ linux-2.6.9/include/asm-ppc64/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -102,12 +102,12 @@
#define REGION_MASK (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
#define REGION_STRIDE (1UL << REGION_SHIFT)
-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, int order)
{
unsigned long lines, line_size;
line_size = systemcfg->dCacheL1LineSize;
- lines = naca->dCacheL1LinesPerPage;
+ lines = naca->dCacheL1LinesPerPage << order;
__asm__ __volatile__(
"mtctr %1 # clear_page\n\
Index: linux-2.6.9/include/asm-m32r/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-m32r/page.h 2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/include/asm-m32r/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -11,10 +11,22 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- > 0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
+
extern void copy_page(void *to, void *from);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.9/include/asm-alpha/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-alpha/page.h 2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/include/asm-alpha/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -15,8 +15,20 @@
#define STRICT_MM_TYPECHECKS
-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+ int nr = 1 << order;
+
+ while (nr--)
+ {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
extern void copy_page(void * _to, void * _from);
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
Index: linux-2.6.9/arch/mips/mm/pg-sb1.c
===================================================================
--- linux-2.6.9.orig/arch/mips/mm/pg-sb1.c 2004-10-18 14:55:36.000000000 -0700
+++ linux-2.6.9/arch/mips/mm/pg-sb1.c 2004-12-23 07:44:14.000000000 -0800
@@ -42,7 +42,7 @@
#ifdef CONFIG_SIBYTE_DMA_PAGEOPS
static inline void clear_page_cpu(void *page)
#else
-void clear_page(void *page)
+void _clear_page(void *page)
#endif
{
unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@
IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
}
-void clear_page(void *page)
+void _clear_page(void *page)
{
int cpu = smp_processor_id();
/* if the page is above Kseg0, use old way */
if (KSEGX(page) != CAC_BASE)
return clear_page_cpu(page);
-
page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@
#endif
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(copy_page);
Index: linux-2.6.9/include/asm-m68k/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-m68k/page.h 2004-10-18 14:55:36.000000000 -0700
+++ linux-2.6.9/include/asm-m68k/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -50,7 +50,7 @@
);
}
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
unsigned long tmp;
unsigned long *sp = page;
@@ -69,16 +69,16 @@
"dbra %1,1b\n\t"
: "=a" (sp), "=d" (tmp)
: "a" (page), "0" (sp),
- "1" ((PAGE_SIZE - 16) / 16 - 1));
+ "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
}
#else
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, 0) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
#endif
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.9/include/asm-mips/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-mips/page.h 2004-12-22 16:48:19.000000000 -0800
+++ linux-2.6.9/include/asm-mips/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -39,7 +39,18 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- >0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
extern void copy_page(void * to, void * from);
extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@
{
extern void (*flush_data_cache_page)(unsigned long addr);
- clear_page(addr);
+ clear_page(addr, 0);
if (pages_do_alias((unsigned long) addr, vaddr))
flush_data_cache_page((unsigned long)addr);
}
Index: linux-2.6.9/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-m68knommu/page.h 2004-10-18 14:54:07.000000000 -0700
+++ linux-2.6.9/include/asm-m68knommu/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -24,10 +24,10 @@
#define get_user_page(vaddr) __get_free_page(GFP_KERNEL)
#define free_user_page(page, addr) free_page(addr)
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.9/include/asm-cris/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-cris/page.h 2004-10-18 14:53:46.000000000 -0700
+++ linux-2.6.9/include/asm-cris/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -15,10 +15,10 @@
#ifdef __KERNEL__
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.9/include/asm-v850/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-v850/page.h 2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/include/asm-v850/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -37,11 +37,11 @@
#define STRICT_MM_TYPECHECKS
-#define clear_page(page) memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset ((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to, from) memcpy ((void *)(to), (void *)from, PAGE_SIZE)
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.9/include/asm-parisc/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-parisc/page.h 2004-10-18 14:53:43.000000000 -0700
+++ linux-2.6.9/include/asm-parisc/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -13,7 +13,7 @@
#include <asm/types.h>
#include <asm/cache.h>
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) copy_user_page_asm((void *)(to), (void *)(from))
struct page;
Index: linux-2.6.9/arch/arm/mm/copypage-v6.c
===================================================================
--- linux-2.6.9.orig/arch/arm/mm/copypage-v6.c 2004-12-23 07:44:04.000000000 -0800
+++ linux-2.6.9/arch/arm/mm/copypage-v6.c 2004-12-23 07:44:14.000000000 -0800
@@ -47,7 +47,7 @@
*/
void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
{
- clear_page(kaddr);
+ _clear_page(kaddr);
}
/*
@@ -116,7 +116,7 @@
set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
flush_tlb_kernel_page(to);
- clear_page((void *)to);
+ _clear_page((void *)to);
spin_unlock(&v6_lock);
}
Index: linux-2.6.9/arch/m32r/mm/page.S
===================================================================
--- linux-2.6.9.orig/arch/m32r/mm/page.S 2004-10-18 14:54:31.000000000 -0700
+++ linux-2.6.9/arch/m32r/mm/page.S 2004-12-23 07:44:14.000000000 -0800
@@ -51,7 +51,7 @@
jmp r14
.text
- .global clear_page
+ .global _clear_page
/*
* clear_page (to)
*
@@ -60,7 +60,7 @@
* 16 * 256
*/
.align 4
-clear_page:
+_clear_page:
ldi r2, #255
ldi r4, #0
ld r3, @r0 /* cache line allocate */
Index: linux-2.6.9/include/asm-ppc/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-ppc/page.h 2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/include/asm-ppc/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -85,7 +85,7 @@
struct page;
extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define clear_page clear_pages
extern void copy_page(void *to, void *from);
extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.9/arch/alpha/kernel/alpha_ksyms.c
===================================================================
--- linux-2.6.9.orig/arch/alpha/kernel/alpha_ksyms.c 2004-12-22 16:48:13.000000000 -0800
+++ linux-2.6.9/arch/alpha/kernel/alpha_ksyms.c 2004-12-23 07:44:14.000000000 -0800
@@ -88,7 +88,7 @@
EXPORT_SYMBOL(__memsetw);
EXPORT_SYMBOL(__constant_c_memset);
EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(__direct_map_base);
EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.9/arch/alpha/lib/ev6-clear_page.S
===================================================================
--- linux-2.6.9.orig/arch/alpha/lib/ev6-clear_page.S 2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/arch/alpha/lib/ev6-clear_page.S 2004-12-23 07:44:14.000000000 -0800
@@ -6,9 +6,9 @@
.text
.align 4
- .global clear_page
- .ent clear_page
-clear_page:
+ .global _clear_page
+ .ent _clear_page
+_clear_page:
.prologue 0
lda $0,128
@@ -51,4 +51,4 @@
nop
nop
- .end clear_page
+ .end _clear_page
Index: linux-2.6.9/arch/sh/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/init.c 2004-10-18 14:54:55.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/init.c 2004-12-23 07:44:14.000000000 -0800
@@ -57,7 +57,7 @@
#endif
void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);
void show_mem(void)
{
@@ -255,7 +255,7 @@
* later in the boot process if a better method is available.
*/
copy_page = copy_page_slow;
- clear_page = clear_page_slow;
+ _clear_page = clear_page_slow;
/* this will put all low memory onto the freelists */
totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.9/arch/sh/mm/pg-dma.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/pg-dma.c 2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/pg-dma.c 2004-12-23 07:44:14.000000000 -0800
@@ -78,7 +78,7 @@
return ret;
copy_page = copy_page_dma;
- clear_page = clear_page_dma;
+ _clear_page = clear_page_dma;
return ret;
}
Index: linux-2.6.9/arch/sh/mm/pg-nommu.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/pg-nommu.c 2004-10-18 14:53:51.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/pg-nommu.c 2004-12-23 07:44:14.000000000 -0800
@@ -27,7 +27,7 @@
static int __init pg_nommu_init(void)
{
copy_page = copy_page_nommu;
- clear_page = clear_page_nommu;
+ _clear_page = clear_page_nommu;
return 0;
}
Index: linux-2.6.9/arch/mips/mm/pg-r4k.c
===================================================================
--- linux-2.6.9.orig/arch/mips/mm/pg-r4k.c 2004-12-22 16:48:14.000000000 -0800
+++ linux-2.6.9/arch/mips/mm/pg-r4k.c 2004-12-23 07:44:14.000000000 -0800
@@ -39,9 +39,9 @@
static unsigned int clear_page_array[0x130 / 4];
-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
/*
* Maximum sizes:
Index: linux-2.6.9/arch/m32r/kernel/m32r_ksyms.c
===================================================================
--- linux-2.6.9.orig/arch/m32r/kernel/m32r_ksyms.c 2004-10-18 14:53:45.000000000 -0700
+++ linux-2.6.9/arch/m32r/kernel/m32r_ksyms.c 2004-12-23 07:44:14.000000000 -0800
@@ -102,7 +102,7 @@
EXPORT_SYMBOL(memcmp);
EXPORT_SYMBOL(memscan);
EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(strcat);
EXPORT_SYMBOL(strchr);
Index: linux-2.6.9/include/asm-arm26/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-arm26/page.h 2004-10-18 14:54:39.000000000 -0700
+++ linux-2.6.9/include/asm-arm26/page.h 2004-12-23 07:44:14.000000000 -0800
@@ -25,7 +25,7 @@
preempt_enable(); \
} while (0)
-#define clear_page(page) memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order) memzero((void *)(page), PAGE_SIZE << (order))
#define copy_page(to, from) __copy_user_page(to, from, 0);
#undef STRICT_MM_TYPECHECKS
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps
2004-12-23 19:33 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
2004-12-23 19:33 ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
@ 2004-12-23 19:34 ` Christoph Lameter
2004-12-23 19:35 ` Prezeroing V2 [4/4]: Hardware Zeroing through SGI BTE Christoph Lameter
2004-12-23 20:08 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Brian Gerst
3 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:34 UTC (permalink / raw)
To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
o Add page zeroing
o Add scrub daemon
o Add ability to view amount of zeroed information in /proc/meninfo
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.9/mm/page_alloc.c
===================================================================
--- linux-2.6.9.orig/mm/page_alloc.c 2004-12-22 13:31:02.000000000 -0800
+++ linux-2.6.9/mm/page_alloc.c 2004-12-22 14:24:56.000000000 -0800
@@ -12,6 +12,7 @@
* Zone balancing, Kanoj Sarcar, SGI, Jan 2000
* Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
* (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ * Support for page zeroing, Christoph Lameter, SGI, Dec 2004
*/
#include <linux/config.h>
@@ -32,6 +33,7 @@
#include <linux/sysctl.h>
#include <linux/cpu.h>
#include <linux/nodemask.h>
+#include <linux/scrub.h>
#include <asm/tlbflush.h>
@@ -179,7 +181,7 @@
* -- wli
*/
-static inline void __free_pages_bulk (struct page *page, struct page *base,
+static inline int __free_pages_bulk (struct page *page, struct page *base,
struct zone *zone, struct free_area *area, unsigned int order)
{
unsigned long page_idx, index, mask;
@@ -192,11 +194,10 @@
BUG();
index = page_idx >> (1 + order);
- zone->free_pages += 1 << order;
while (order < MAX_ORDER-1) {
struct page *buddy1, *buddy2;
- BUG_ON(area >= zone->free_area + MAX_ORDER);
+ BUG_ON(area >= zone->free_area[ZEROED] + MAX_ORDER);
if (!__test_and_change_bit(index, area->map))
/*
* the buddy page is still allocated.
@@ -216,6 +217,7 @@
page_idx &= mask;
}
list_add(&(base + page_idx)->lru, &area->free_list);
+ return order;
}
static inline void free_pages_check(const char *function, struct page *page)
@@ -258,7 +260,7 @@
int ret = 0;
base = zone->zone_mem_map;
- area = zone->free_area + order;
+ area = zone->free_area[NOT_ZEROED] + order;
spin_lock_irqsave(&zone->lock, flags);
zone->all_unreclaimable = 0;
zone->pages_scanned = 0;
@@ -266,7 +268,10 @@
page = list_entry(list->prev, struct page, lru);
/* have to delete it as __free_pages_bulk list manipulates */
list_del(&page->lru);
- __free_pages_bulk(page, base, zone, area, order);
+ zone->free_pages += 1 << order;
+ if (__free_pages_bulk(page, base, zone, area, order)
+ >= sysctl_scrub_start)
+ wakeup_kscrubd(zone);
ret++;
}
spin_unlock_irqrestore(&zone->lock, flags);
@@ -288,6 +293,21 @@
free_pages_bulk(page_zone(page), 1, &list, order);
}
+void end_zero_page(struct page *page)
+{
+ unsigned long flags;
+ int order = page->index;
+ struct zone * zone = page_zone(page);
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ zone->zero_pages += 1 << order;
+ __free_pages_bulk(page, zone->zone_mem_map, zone, zone->free_area[ZEROED] + order, order);
+
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+
#define MARK_USED(index, order, area) \
__change_bit((index) >> (1+(order)), (area)->map)
@@ -366,25 +386,46 @@
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct zone *zone, struct free_area *area, int order)
+{
+ list_del(&page->lru);
+ if (order != MAX_ORDER-1)
+ MARK_USED(page - zone->zone_mem_map, order, area);
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order)
+{
+ unsigned long flags;
+ struct page *page = NULL;
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ if (!list_empty(&area->free_list)) {
+ page = list_entry(area->free_list.next, struct page, lru);
+
+ rmpage(page, zone, area, order);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
{
struct free_area * area;
unsigned int current_order;
struct page *page;
- unsigned int index;
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
- area = zone->free_area + current_order;
+ area = zone->free_area[zero] + current_order;
if (list_empty(&area->free_list))
continue;
page = list_entry(area->free_list.next, struct page, lru);
- list_del(&page->lru);
- index = page - zone->zone_mem_map;
- if (current_order != MAX_ORDER-1)
- MARK_USED(index, current_order, area);
+ rmpage(page, zone, area, current_order);
zone->free_pages -= 1UL << order;
- return expand(zone, page, index, order, current_order, area);
+ if (zero)
+ zone->zero_pages -= 1UL << order;
+ return expand(zone, page, page - zone->zone_mem_map, order, current_order, area);
}
return NULL;
@@ -396,7 +437,7 @@
* Returns the number of new pages which were placed at *list.
*/
static int rmqueue_bulk(struct zone *zone, unsigned int order,
- unsigned long count, struct list_head *list)
+ unsigned long count, struct list_head *list, int zero)
{
unsigned long flags;
int i;
@@ -405,7 +446,7 @@
spin_lock_irqsave(&zone->lock, flags);
for (i = 0; i < count; ++i) {
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, zero);
if (page == NULL)
break;
allocated++;
@@ -546,7 +587,9 @@
{
unsigned long flags;
struct page *page = NULL;
- int cold = !!(gfp_flags & __GFP_COLD);
+ int nr_pages = 1 << order;
+ int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+ int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;
if (order == 0) {
struct per_cpu_pages *pcp;
@@ -555,7 +598,7 @@
local_irq_save(flags);
if (pcp->count <= pcp->low)
pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list);
+ pcp->batch, &pcp->list, zero);
if (pcp->count) {
page = list_entry(pcp->list.next, struct page, lru);
list_del(&page->lru);
@@ -567,19 +610,30 @@
if (page == NULL) {
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order);
+
+ page = __rmqueue(zone, order, zero);
+
+ /*
+ * If we failed to obtain a zero and/or unzeroed page
+ * then we may still be able to obtain the other
+ * type of page.
+ */
+ if (!page) {
+ page = __rmqueue(zone, order, !zero);
+ zero = 0;
+ }
+
spin_unlock_irqrestore(&zone->lock, flags);
}
if (page != NULL) {
BUG_ON(bad_range(zone, page));
- mod_page_state_zone(zone, pgalloc, 1 << order);
- prep_new_page(page, order);
+ mod_page_state_zone(zone, pgalloc, nr_pages);
- if (gfp_flags & __GFP_ZERO) {
+ if ((gfp_flags & __GFP_ZERO) && !zero) {
#ifdef CONFIG_HIGHMEM
if (PageHighMem(page)) {
- int n = 1 << order;
+ int n = nr_pages;
while (n-- >0)
clear_highpage(page + n);
@@ -587,6 +641,7 @@
#endif
clear_page(page_address(page), order);
}
+ prep_new_page(page, order);
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
}
@@ -974,7 +1029,7 @@
}
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free, struct pglist_data *pgdat)
+ unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
{
struct zone *zones = pgdat->node_zones;
int i;
@@ -982,27 +1037,31 @@
*active = 0;
*inactive = 0;
*free = 0;
+ *zero = 0;
for (i = 0; i < MAX_NR_ZONES; i++) {
*active += zones[i].nr_active;
*inactive += zones[i].nr_inactive;
*free += zones[i].free_pages;
+ *zero += zones[i].zero_pages;
}
}
void get_zone_counts(unsigned long *active,
- unsigned long *inactive, unsigned long *free)
+ unsigned long *inactive, unsigned long *free, unsigned long *zero)
{
struct pglist_data *pgdat;
*active = 0;
*inactive = 0;
*free = 0;
+ *zero = 0;
for_each_pgdat(pgdat) {
- unsigned long l, m, n;
- __get_zone_counts(&l, &m, &n, pgdat);
+ unsigned long l, m, n,o;
+ __get_zone_counts(&l, &m, &n, &o, pgdat);
*active += l;
*inactive += m;
*free += n;
+ *zero += o;
}
}
@@ -1039,6 +1098,7 @@
#define K(x) ((x) << (PAGE_SHIFT-10))
+const char *temp[3] = { "hot", "cold", "zero" };
/*
* Show free area list (used inside shift_scroll-lock stuff)
* We also calculate the percentage fragmentation. We do this by counting the
@@ -1051,6 +1111,7 @@
unsigned long active;
unsigned long inactive;
unsigned long free;
+ unsigned long zero;
struct zone *zone;
for_each_zone(zone) {
@@ -1071,10 +1132,10 @@
pageset = zone->pageset + cpu;
- for (temperature = 0; temperature < 2; temperature++)
+ for (temperature = 0; temperature < 3; temperature++)
printk("cpu %d %s: low %d, high %d, batch %d\n",
cpu,
- temperature ? "cold" : "hot",
+ temp[temperature],
pageset->pcp[temperature].low,
pageset->pcp[temperature].high,
pageset->pcp[temperature].batch);
@@ -1082,20 +1143,21 @@
}
get_page_state(&ps);
- get_zone_counts(&active, &inactive, &free);
+ get_zone_counts(&active, &inactive, &free, &zero);
printk("\nFree pages: %11ukB (%ukB HighMem)\n",
K(nr_free_pages()),
K(nr_free_highpages()));
printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
- "unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+ "unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
active,
inactive,
ps.nr_dirty,
ps.nr_writeback,
ps.nr_unstable,
nr_free_pages(),
+ zero,
ps.nr_slab,
ps.nr_mapped,
ps.nr_page_table_pages);
@@ -1146,7 +1208,7 @@
spin_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
nr = 0;
- list_for_each(elem, &zone->free_area[order].free_list)
+ list_for_each(elem, &zone->free_area[NOT_ZEROED][order].free_list)
++nr;
total += nr << order;
printk("%lu*%lukB ", nr, K(1UL) << order);
@@ -1470,14 +1532,18 @@
for (order = 0; ; order++) {
unsigned long bitmap_size;
- INIT_LIST_HEAD(&zone->free_area[order].free_list);
+ INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+ INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
if (order == MAX_ORDER-1) {
- zone->free_area[order].map = NULL;
+ zone->free_area[NOT_ZEROED][order].map = NULL;
+ zone->free_area[ZEROED][order].map = NULL;
break;
}
bitmap_size = pages_to_bitmap_size(order, size);
- zone->free_area[order].map =
+ zone->free_area[NOT_ZEROED][order].map =
+ (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
+ zone->free_area[ZEROED][order].map =
(unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
}
}
@@ -1503,6 +1569,7 @@
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
+ init_waitqueue_head(&pgdat->kscrubd_wait);
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
@@ -1525,6 +1592,7 @@
spin_lock_init(&zone->lru_lock);
zone->zone_pgdat = pgdat;
zone->free_pages = 0;
+ zone->zero_pages = 0;
zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
@@ -1558,6 +1626,13 @@
pcp->high = 2 * batch;
pcp->batch = 1 * batch;
INIT_LIST_HEAD(&pcp->list);
+
+ pcp = &zone->pageset[cpu].pcp[2]; /* zero pages */
+ pcp->count = 0;
+ pcp->low = 0;
+ pcp->high = 2 * batch;
+ pcp->batch = 1 * batch;
+ INIT_LIST_HEAD(&pcp->list);
}
printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
zone_names[j], realsize, batch);
@@ -1687,7 +1762,7 @@
unsigned long nr_bufs = 0;
struct list_head *elem;
- list_for_each(elem, &(zone->free_area[order].free_list))
+ list_for_each(elem, &(zone->free_area[NOT_ZEROED][order].free_list))
++nr_bufs;
seq_printf(m, "%6lu ", nr_bufs);
}
Index: linux-2.6.9/include/linux/mmzone.h
===================================================================
--- linux-2.6.9.orig/include/linux/mmzone.h 2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/mmzone.h 2004-12-22 14:24:56.000000000 -0800
@@ -51,7 +51,7 @@
};
struct per_cpu_pageset {
- struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */
+ struct per_cpu_pages pcp[3]; /* 0: hot. 1: cold 2: cold zeroed pages */
#ifdef CONFIG_NUMA
unsigned long numa_hit; /* allocated in intended node */
unsigned long numa_miss; /* allocated in non intended node */
@@ -107,10 +107,14 @@
* ZONE_HIGHMEM > 896 MB only page cache and user processes
*/
+#define NOT_ZEROED 0
+#define ZEROED 1
+
struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long free_pages;
unsigned long pages_min, pages_low, pages_high;
+ unsigned long zero_pages;
/*
* protection[] is a pre-calculated number of extra pages that must be
* available in a zone in order for __alloc_pages() to allocate memory
@@ -131,7 +135,7 @@
* free areas of different sizes
*/
spinlock_t lock;
- struct free_area free_area[MAX_ORDER];
+ struct free_area free_area[2][MAX_ORDER];
ZONE_PADDING(_pad1_)
@@ -265,6 +269,9 @@
struct pglist_data *pgdat_next;
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
+
+ wait_queue_head_t kscrubd_wait;
+ struct task_struct *kscrubd;
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
@@ -274,9 +281,9 @@
extern struct pglist_data *pgdat_list;
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free, struct pglist_data *pgdat);
+ unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
void get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free);
+ unsigned long *free, unsigned long *zero);
void build_all_zonelists(void);
void wakeup_kswapd(struct zone *zone);
Index: linux-2.6.9/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.9.orig/fs/proc/proc_misc.c 2004-12-17 14:40:15.000000000 -0800
+++ linux-2.6.9/fs/proc/proc_misc.c 2004-12-22 14:24:56.000000000 -0800
@@ -158,13 +158,14 @@
unsigned long inactive;
unsigned long active;
unsigned long free;
+ unsigned long zero;
unsigned long vmtot;
unsigned long committed;
unsigned long allowed;
struct vmalloc_info vmi;
get_page_state(&ps);
- get_zone_counts(&active, &inactive, &free);
+ get_zone_counts(&active, &inactive, &free, &zero);
/*
* display in kilobytes.
@@ -187,6 +188,7 @@
len = sprintf(page,
"MemTotal: %8lu kB\n"
"MemFree: %8lu kB\n"
+ "MemZero: %8lu kB\n"
"Buffers: %8lu kB\n"
"Cached: %8lu kB\n"
"SwapCached: %8lu kB\n"
@@ -210,6 +212,7 @@
"VmallocChunk: %8lu kB\n",
K(i.totalram),
K(i.freeram),
+ K(zero),
K(i.bufferram),
K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
K(total_swapcache_pages),
Index: linux-2.6.9/mm/readahead.c
===================================================================
--- linux-2.6.9.orig/mm/readahead.c 2004-10-18 14:53:11.000000000 -0700
+++ linux-2.6.9/mm/readahead.c 2004-12-22 14:24:56.000000000 -0800
@@ -570,7 +570,8 @@
unsigned long active;
unsigned long inactive;
unsigned long free;
+ unsigned long zero;
- __get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+ __get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
return min(nr, (inactive + free) / 2);
}
Index: linux-2.6.9/drivers/base/node.c
===================================================================
--- linux-2.6.9.orig/drivers/base/node.c 2004-10-18 14:53:22.000000000 -0700
+++ linux-2.6.9/drivers/base/node.c 2004-12-22 14:24:56.000000000 -0800
@@ -41,13 +41,15 @@
unsigned long inactive;
unsigned long active;
unsigned long free;
+ unsigned long zero;
si_meminfo_node(&i, nid);
- __get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+ __get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));
n = sprintf(buf, "\n"
"Node %d MemTotal: %8lu kB\n"
"Node %d MemFree: %8lu kB\n"
+ "Node %d MemZero: %8lu kB\n"
"Node %d MemUsed: %8lu kB\n"
"Node %d Active: %8lu kB\n"
"Node %d Inactive: %8lu kB\n"
@@ -57,6 +59,7 @@
"Node %d LowFree: %8lu kB\n",
nid, K(i.totalram),
nid, K(i.freeram),
+ nid, K(zero),
nid, K(i.totalram - i.freeram),
nid, K(active),
nid, K(inactive),
Index: linux-2.6.9/include/linux/sched.h
===================================================================
--- linux-2.6.9.orig/include/linux/sched.h 2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/sched.h 2004-12-22 14:24:56.000000000 -0800
@@ -715,6 +715,7 @@
#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
#define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */
+#define PF_KSCRUBD 0x00800000 /* I am kscrubd */
#ifdef CONFIG_SMP
extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.9/mm/Makefile
===================================================================
--- linux-2.6.9.orig/mm/Makefile 2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/Makefile 2004-12-22 14:24:56.000000000 -0800
@@ -5,7 +5,7 @@
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
- vmalloc.o
+ vmalloc.o scrubd.o
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
page_alloc.o page-writeback.o pdflush.o prio_tree.o \
Index: linux-2.6.9/mm/scrubd.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/mm/scrubd.c 2004-12-22 14:26:35.000000000 -0800
@@ -0,0 +1,146 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = MAX_ORDER; /* Off */
+unsigned int sysctl_scrub_stop = 2; /* Mininum order of page to zero */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ proc_dointvec(table, write, file, buffer, length, ppos);
+ if (sysctl_scrub_start < MAX_ORDER) {
+ struct zone *zone;
+
+ for_each_zone(zone)
+ wakeup_kscrubd(zone);
+ }
+ return 0;
+}
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+ int order;
+
+ for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+ struct free_area *area = z->free_area[NOT_ZEROED] + order;
+ if (!list_empty(&area->free_list)) {
+ struct page *page = scrubd_rmpage(z, area, order);
+ struct list_head *l;
+
+ if (!page)
+ continue;
+
+ page->index = order;
+
+ list_for_each(l, &zero_drivers) {
+ struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+ unsigned long size = PAGE_SIZE << order;
+
+ if (driver->start(page_address(page), size) == 0) {
+
+ unsigned ticks = (size*HZ)/driver->rate;
+ if (ticks) {
+ /* Wait the minimum time of the transfer */
+ current->state = TASK_INTERRUPTIBLE;
+ schedule_timeout(ticks);
+ }
+ /* Then keep on checking until transfer is complete */
+ while (!driver->check())
+ schedule();
+ goto out;
+ }
+ }
+
+ /* Unable to find a zeroing device that would
+ * deal with this page so just do it on our own.
+ * This will likely thrash the cpu caches.
+ */
+ cond_resched();
+ clear_page(page_address(page), order);
+out:
+ end_zero_page(page);
+ cond_resched();
+ return 1 << order;
+ }
+ }
+ return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+ int i;
+ unsigned long pages_zeroed;
+
+ if (system_state != SYSTEM_RUNNING)
+ return;
+
+ do {
+ pages_zeroed = 0;
+ for (i = 0; i < pgdat->nr_zones; i++) {
+ struct zone *zone = pgdat->node_zones + i;
+
+ pages_zeroed += zero_highest_order_page(zone);
+ }
+ } while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+ pg_data_t *pgdat = (pg_data_t*)p;
+ struct task_struct *tsk = current;
+ DEFINE_WAIT(wait);
+ cpumask_t cpumask;
+
+ daemonize("kscrubd%d", pgdat->node_id);
+ cpumask = node_to_cpumask(pgdat->node_id);
+ if (!cpus_empty(cpumask))
+ set_cpus_allowed(tsk, cpumask);
+
+ tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+ for ( ; ; ) {
+ if (current->flags & PF_FREEZE)
+ refrigerator(PF_FREEZE);
+ prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+ schedule();
+ finish_wait(&pgdat->kscrubd_wait, &wait);
+
+ scrub_pgdat(pgdat);
+ }
+ return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+ pg_data_t *pgdat;
+ for_each_pgdat(pgdat)
+ pgdat->kscrubd
+ = find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+ return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.9/include/linux/scrub.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9/include/linux/scrub.h 2004-12-22 14:24:56.000000000 -0800
@@ -0,0 +1,48 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+ int (*start)(void *, unsigned long); /* Start bzero transfer */
+ int (*check)(void); /* Check if bzero is complete */
+ unsigned long rate; /* zeroing rate in bytes/sec */
+ struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+ list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+ list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+ if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+ return;
+ wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page);
+#endif
Index: linux-2.6.9/kernel/sysctl.c
===================================================================
--- linux-2.6.9.orig/kernel/sysctl.c 2004-12-17 14:40:17.000000000 -0800
+++ linux-2.6.9/kernel/sysctl.c 2004-12-22 14:24:56.000000000 -0800
@@ -40,6 +40,7 @@
#include <linux/times.h>
#include <linux/limits.h>
#include <linux/dcache.h>
+#include <linux/scrub.h>
#include <linux/syscalls.h>
#include <asm/uaccess.h>
@@ -816,6 +817,24 @@
.strategy = &sysctl_jiffies,
},
#endif
+ {
+ .ctl_name = VM_SCRUB_START,
+ .procname = "scrub_start",
+ .data = &sysctl_scrub_start,
+ .maxlen = sizeof(sysctl_scrub_start),
+ .mode = 0644,
+ .proc_handler = &scrub_start_handler,
+ .strategy = &sysctl_intvec,
+ },
+ {
+ .ctl_name = VM_SCRUB_STOP,
+ .procname = "scrub_stop",
+ .data = &sysctl_scrub_stop,
+ .maxlen = sizeof(sysctl_scrub_stop),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ },
{ .ctl_name = 0 }
};
Index: linux-2.6.9/include/linux/sysctl.h
===================================================================
--- linux-2.6.9.orig/include/linux/sysctl.h 2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/linux/sysctl.h 2004-12-22 14:24:56.000000000 -0800
@@ -168,6 +168,8 @@
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+ VM_SCRUB_START=30, /* percentage * 10 at which to start scrubd */
+ VM_SCRUB_STOP=31, /* percentage * 10 at which to stop scrubd */
};
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V2 [4/4]: Hardware Zeroing through SGI BTE
2004-12-23 19:33 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
2004-12-23 19:33 ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
2004-12-23 19:34 ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Christoph Lameter
@ 2004-12-23 19:35 ` Christoph Lameter
2004-12-23 20:08 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Brian Gerst
3 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:35 UTC (permalink / raw)
To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
o Zeroing driver implemented with the Block Transfer Engine in the Altix SN2 SHub
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.9/arch/ia64/sn/kernel/bte.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/sn/kernel/bte.c 2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/bte.c 2004-12-22 12:48:23.000000000 -0800
@@ -4,6 +4,8 @@
* for more details.
*
* Copyright (c) 2000-2003 Silicon Graphics, Inc. All Rights Reserved.
+ *
+ * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
*/
#include <linux/config.h>
@@ -20,6 +22,8 @@
#include <linux/bootmem.h>
#include <linux/string.h>
#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scrub.h>
#include <asm/sn/bte.h>
@@ -30,7 +34,7 @@
/* two interfaces on two btes */
#define MAX_INTERFACES_TO_TRY 4
-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
{
nodepda_t *tmp_nodepda;
@@ -132,7 +136,6 @@
if (bte == NULL) {
continue;
}
-
if (spin_trylock(&bte->spinlock)) {
if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
(BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
@@ -157,7 +160,7 @@
}
} while (1);
- if (notification == NULL) {
+ if (notification == NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
/* User does not want to be notified. */
bte->most_rcnt_na = &bte->notify;
} else {
@@ -192,6 +195,8 @@
itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);
+ if (mode & BTE_NOTIFY_AND_GET_POINTER)
+ *(u64 volatile **)(notification) = &bte->notify;
spin_unlock_irqrestore(&bte->spinlock, irq_flags);
if (notification != NULL) {
@@ -449,5 +454,37 @@
mynodepda->bte_if[i].cleanup_active = 0;
mynodepda->bte_if[i].bh_error = 0;
}
+}
+
+u64 *bte_zero_notify[MAX_COMPACT_NODES];
+
+static int bte_check_bzero(void)
+{
+ int node = get_nasid();
+
+ return *(bte_zero_notify[node]) != BTE_WORD_BUSY;
+}
+
+static int bte_start_bzero(void *p, unsigned long len)
+{
+ int node = get_nasid();
+
+ /* Check limitations.
+ 1. System must be running (weird things happen during bootup)
+ 2. Size >64KB. Smaller requests cause too much bte traffic
+ */
+ if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
+ return EINVAL;
+
+ return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, bte_zero_notify+node);
+}
+
+static struct zero_driver bte_bzero = {
+ .start = bte_start_bzero,
+ .check = bte_check_bzero,
+ .rate = 500000000 /* 500 MB /sec */
+};
+void sn_bte_bzero_init(void) {
+ register_zero_driver(&bte_bzero);
}
Index: linux-2.6.9/arch/ia64/sn/kernel/setup.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/sn/kernel/setup.c 2004-12-17 14:40:10.000000000 -0800
+++ linux-2.6.9/arch/ia64/sn/kernel/setup.c 2004-12-22 12:28:00.000000000 -0800
@@ -243,6 +243,7 @@
int pxm;
int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
extern void sn_cpu_init(void);
+ extern void sn_bte_bzero_init(void);
/*
* If the generic code has enabled vga console support - lets
@@ -333,6 +334,7 @@
screen_info = sn_screen_info;
sn_timer_init();
+ sn_bte_bzero_init();
}
/**
Index: linux-2.6.9/include/asm-ia64/sn/bte.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/sn/bte.h 2004-12-17 14:40:16.000000000 -0800
+++ linux-2.6.9/include/asm-ia64/sn/bte.h 2004-12-22 12:28:00.000000000 -0800
@@ -48,6 +48,8 @@
#define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
/* Use a reserved bit to let the caller specify a wait for any BTE */
#define BTE_WACQUIRE (0x4000)
+/* Return the pointer to the notification cacheline to the user */
+#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
/* Use the BTE on the node with the destination memory */
#define BTE_USE_DEST (BTE_WACQUIRE << 1)
/* Use any available BTE interface on any node for the transfer */
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
2004-12-23 19:33 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
@ 2004-12-23 19:49 ` Arjan van de Ven
2004-12-23 20:57 ` Matt Mackall
` (2 subsequent siblings)
4 siblings, 0 replies; 89+ messages in thread
From: Arjan van de Ven @ 2004-12-23 19:49 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
> The most expensive operation in the page fault handler is (apart of SMP
> locking overhead) the zeroing of the page. This zeroing means that all
> cachelines of the faulted page (on Altix that means all 128 cachelines of
> 128 byte each) must be loaded and later written back. This patch allows to
> avoid having to load all cachelines if only a part of the cachelines of
> that page is needed immediately after the fault.
eh why will all cachelines be loaded? Surely you can avoid the write-
allocate behavior for this case.....
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal
2004-12-23 19:33 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
` (2 preceding siblings ...)
2004-12-23 19:35 ` Prezeroing V2 [4/4]: Hardware Zeroing through SGI BTE Christoph Lameter
@ 2004-12-23 20:08 ` Brian Gerst
2004-12-24 16:24 ` Christoph Lameter
3 siblings, 1 reply; 89+ messages in thread
From: Brian Gerst @ 2004-12-23 20:08 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
Christoph Lameter wrote:
> This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
> to request zeroed pages from the page allocator.
>
> o Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set
>
> o Replace all page zeroing after allocating pages by request for
> zeroed pages.
>
> o requires arch updates to clear_page in order to function properly.
>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>
> @@ -125,22 +125,19 @@
> int i;
> struct packet_data *pkt;
>
> - pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL);
> + pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL|__GFP_ZERO);
> if (!pkt)
> goto no_pkt;
> - memset(pkt, 0, sizeof(struct packet_data));
>
> pkt->w_bio = pkt_bio_alloc(PACKET_MAX_SIZE);
> if (!pkt->w_bio)
This part is wrong. kmalloc() uses the slab allocator instead of
getting a full page.
--
Brian Gerst
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
2004-12-23 19:33 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
2004-12-23 19:49 ` Prezeroing V2 [0/3]: Why and When it works Arjan van de Ven
@ 2004-12-23 20:57 ` Matt Mackall
2004-12-23 21:01 ` Paul Mackerras
2004-12-23 21:11 ` Paul Mackerras
4 siblings, 0 replies; 89+ messages in thread
From: Matt Mackall @ 2004-12-23 20:57 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
On Thu, Dec 23, 2004 at 11:29:10AM -0800, Christoph Lameter wrote:
> 2. Hardware support for offloading zeroing from the cpu. This avoids
> the invalidation of the cpu caches by extensive zeroing operations.
I'm wondering if it would be possible to use typical video cards for
hardware zeroing. We could set aside a page's worth of zeros in video
memory and then use the card's DMA engines to clear pages on the host.
This could be done in fbdev drivers, which would register a zeroer
with the core.
--
Mathematics is the supreme nostalgia of our time.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
` (2 preceding siblings ...)
2004-12-23 20:57 ` Matt Mackall
@ 2004-12-23 21:01 ` Paul Mackerras
2004-12-23 21:11 ` Paul Mackerras
4 siblings, 0 replies; 89+ messages in thread
From: Paul Mackerras @ 2004-12-23 21:01 UTC (permalink / raw)
To: Christoph Lameter
Christoph Lameter writes:
> The most expensive operation in the page fault handler is (apart of SMP
> locking overhead) the zeroing of the page. This zeroing means that all
> cachelines of the faulted page (on Altix that means all 128 cachelines of
> 128 byte each) must be loaded and later written back. This patch allows to
> avoid having to load all cachelines if only a part of the cachelines of
> that page is needed immediately after the fault.
On ppc64 we avoid having to zero newly-allocated page table pages by
using a slab cache for them, with a constructor function that zeroes
them. Page table pages naturally end up being full of zeroes when
they are freed, since ptep_get_and_clear, pmd_clear or pgd_clear has
been used on every non-zero entry by that stage. Thus there is no
extra work required either when allocating them or freeing them.
I don't see any point in your patches for systems which don't have
some magic hardware for zeroing pages. Your patch seems like a lot of
extra code that only benefits a very small number of machines.
Paul.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
` (3 preceding siblings ...)
2004-12-23 21:01 ` Paul Mackerras
@ 2004-12-23 21:11 ` Paul Mackerras
2004-12-23 21:37 ` Andrew Morton
2004-12-23 21:48 ` Linus Torvalds
4 siblings, 2 replies; 89+ messages in thread
From: Paul Mackerras @ 2004-12-23 21:11 UTC (permalink / raw)
To: Christoph Lameter
Christoph Lameter writes:
> The most expensive operation in the page fault handler is (apart of SMP
> locking overhead) the zeroing of the page.
Re-reading this I see that you mean the zeroing of the page that is
mapped into the process address space, not the page table pages. So
ignore my previous reply.
Do you have any statistics on how often a page fault needs to supply a
page of zeroes versus supplying a copy of an existing page, for real
applications?
In any case, unless you have magic page-zeroing hardware, I am still
inclined to think that zeroing the page at the time of the fault is
the most efficient, since that means the page will be hot in the cache
for the process to use. If you zero it earlier using CPU stores, it
can only cause more overall memory traffic, as far as I can see.
I did some measurements once on my G5 powermac (running a ppc64 linux
kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
page. This is real-life elapsed time in the kernel, not just some
cache-hot benchmark measurement. Thus I don't think your patch will
gain us anything on ppc64.
Paul.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-23 21:11 ` Paul Mackerras
@ 2004-12-23 21:37 ` Andrew Morton
2004-12-23 23:00 ` Paul Mackerras
2004-12-23 21:48 ` Linus Torvalds
1 sibling, 1 reply; 89+ messages in thread
From: Andrew Morton @ 2004-12-23 21:37 UTC (permalink / raw)
To: Paul Mackerras; +Cc: clameter, linux-ia64, torvalds, linux-mm, linux-kernel
Paul Mackerras <paulus@samba.org> wrote:
>
> Christoph Lameter writes:
>
> > The most expensive operation in the page fault handler is (apart of SMP
> > locking overhead) the zeroing of the page.
>
> Re-reading this I see that you mean the zeroing of the page that is
> mapped into the process address space, not the page table pages. So
> ignore my previous reply.
>
> Do you have any statistics on how often a page fault needs to supply a
> page of zeroes versus supplying a copy of an existing page, for real
> applications?
When the workload is a gcc run, the pagefault handler dominates the system
time. That's the page zeroing.
> In any case, unless you have magic page-zeroing hardware, I am still
> inclined to think that zeroing the page at the time of the fault is
> the most efficient, since that means the page will be hot in the cache
> for the process to use. If you zero it earlier using CPU stores, it
> can only cause more overall memory traffic, as far as I can see.
x86's movnta instructions provide a way of initialising memory without
trashing the caches and it has pretty good bandwidth, I believe. We should
wire that up to these patches and see if it speeds things up.
> I did some measurements once on my G5 powermac (running a ppc64 linux
> kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> page.
40GB/s. Is that straight into L1 or does the measurement include writeback?
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-23 21:11 ` Paul Mackerras
2004-12-23 21:37 ` Andrew Morton
@ 2004-12-23 21:48 ` Linus Torvalds
2004-12-23 22:34 ` Zwane Mwaikambo
` (2 more replies)
1 sibling, 3 replies; 89+ messages in thread
From: Linus Torvalds @ 2004-12-23 21:48 UTC (permalink / raw)
To: Paul Mackerras
Cc: Christoph Lameter, Andrew Morton, linux-ia64, torvalds, linux-mm,
Kernel Mailing List
On Fri, 24 Dec 2004, Paul Mackerras wrote:
>
> I did some measurements once on my G5 powermac (running a ppc64 linux
> kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> page. This is real-life elapsed time in the kernel, not just some
> cache-hot benchmark measurement. Thus I don't think your patch will
> gain us anything on ppc64.
Well, the thing is, if we really _know_ the machine is idle (and not just
waiting for something like disk IO), it might be a good idea to just
pre-zero everything we can.
The question to me is whether we can have a good enough heuristic to
notice that it triggers often enough to matter, but seldom enough that it
really won't disturb anybody.
And "disturb" very much includes things like laptop battery life,
scheduling latencies, memory bus traffic _and_ cache contents.
And I really don't see a very good heuristic. Maybe it might literally be
something like "five-second load average goes down to zero" (we've got
fixed-point arithmetic with eleven fractional bits, so we can tune just
how close to "zero" we want to get). The load average is system-wide and
takes disk load (which tends to imply latency-critical work) into account,
so that might actually work out reasonably well as a "the system really is
quiescent".
So if we make the "what load is considered low" tunable, a system
administrator can use that to make it more aggressive. And indeed, you
might have a cron-job that says "be more aggressive at clearing pages
between 2AM and 4AM in the morning" or something - if you have so much
memory that it actually matters if you clear the memory just occasionally.
And the tunable load-average check has another advantage: if you want to
benchmark it, you can first set it to true zero (basically never), and run
the benchmark, and then you can set it to something very agressive ("clear
pages every five seconds regardless of load") and re-run.
Does this sound sane? Christoph - can you try making the "scrub deamon" do
that? Instead of the "scrub-low" and "scrub-high" (or in _addition_ to
them), do a "scub-load" thing that takes a scaled integer, and compares it
with "avenrun[0]" in kernel/timer.c: calc_load() when the average is
updated every five seconds..
Personally, at least for a desktop usage, I think that the load average
would work wonderfully well. I know my machines are often at basically
zero load, and then having low-latency zero-pages when I sit down sounds
like a good idea. Whether there is _enough_ free memory around for a
5-second thing to work out well, I have no idea..
Linus
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-23 21:48 ` Linus Torvalds
@ 2004-12-23 22:34 ` Zwane Mwaikambo
2004-12-24 9:14 ` Arjan van de Ven
2004-12-24 16:17 ` Christoph Lameter
2 siblings, 0 replies; 89+ messages in thread
From: Zwane Mwaikambo @ 2004-12-23 22:34 UTC (permalink / raw)
To: Linus Torvalds
Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
linux-mm, Kernel Mailing List
On Thu, 23 Dec 2004, Linus Torvalds wrote:
> Personally, at least for a desktop usage, I think that the load average
> would work wonderfully well. I know my machines are often at basically
> zero load, and then having low-latency zero-pages when I sit down sounds
> like a good idea. Whether there is _enough_ free memory around for a
> 5-second thing to work out well, I have no idea..
Isn't the basic premise very similar to the following paper;
http://www.usenix.org/publications/library/proceedings/osdi99/full_papers/dougan/dougan_html/dougan.html
In fact i thought ppc32 did something akin to this.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-23 21:37 ` Andrew Morton
@ 2004-12-23 23:00 ` Paul Mackerras
0 siblings, 0 replies; 89+ messages in thread
From: Paul Mackerras @ 2004-12-23 23:00 UTC (permalink / raw)
To: Andrew Morton; +Cc: clameter, linux-ia64, torvalds, linux-mm, linux-kernel
Andrew Morton writes:
> When the workload is a gcc run, the pagefault handler dominates the system
> time. That's the page zeroing.
For a program which uses a lot of heap and doesn't fork, that sounds
reasonable.
> x86's movnta instructions provide a way of initialising memory without
> trashing the caches and it has pretty good bandwidth, I believe. We should
> wire that up to these patches and see if it speeds things up.
Yes. I don't know the movnta instruction, but surely, whatever scheme
is used, there has to be a snoop for every cache line's worth of
memory that is zeroed.
The other point is that having the page hot in the cache may well be a
benefit to the program. Using any sort of cache-bypassing zeroing
might not actually make things faster, when the user time as well as
the system time is taken into account.
> > I did some measurements once on my G5 powermac (running a ppc64 linux
> > kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> > page.
>
> 40GB/s. Is that straight into L1 or does the measurement include writeback?
It is the average elapsed time in clear_page, so it would include the
writeback of any cache lines displaced by the zeroing, but not the
writeback of the newly-zeroed cache lines (which we hope will be
modified by the program before they get written back anyway).
This is using the dcbz (data cache block zero) instruction, which
establishes a cache line in modified state with zero contents without
any memory traffic other than a cache line kill transaction sent to
the other CPUs and possible writeback of a dirty cache line displaced
by the newly-zeroed cache line. The new cache line is established in
the L2 cache, because the L1 is write-through on the G5, and all
stores and dcbz instructions have to go to the L2 cache.
Thus, on the G5 (and POWER4, which is similar) I don't think there
will be much if any benefit from having pre-zeroed cache-cold pages.
We can establish the zero lines in cache much faster using dcbz than
we can by reading them in from main memory. If the program uses only
a few cache lines out of each new page, then reading them from memory
might be faster, but that seems unlikely.
Paul.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
2004-12-23 19:33 ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
@ 2004-12-24 8:33 ` Pavel Machek
2004-12-24 16:18 ` Christoph Lameter
2004-12-24 17:05 ` David S. Miller
2005-01-01 10:24 ` Geert Uytterhoeven
2 siblings, 1 reply; 89+ messages in thread
From: Pavel Machek @ 2004-12-24 8:33 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
Hi!
> o Extend clear_page to take an order parameter for all architectures.
>
I believe you sould leave clear_page() as is, and introduce
clear_pages() with two arguments.
Pavel
> -extern void clear_page (void *page);
> +extern void clear_page (void *page, int order);
> extern void copy_page (void *to, void *from);
>
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-23 21:48 ` Linus Torvalds
2004-12-23 22:34 ` Zwane Mwaikambo
@ 2004-12-24 9:14 ` Arjan van de Ven
2004-12-24 18:21 ` Linus Torvalds
2004-12-24 16:17 ` Christoph Lameter
2 siblings, 1 reply; 89+ messages in thread
From: Arjan van de Ven @ 2004-12-24 9:14 UTC (permalink / raw)
To: Linus Torvalds
Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
linux-mm, Kernel Mailing List
> Personally, at least for a desktop usage, I think that the load average
> would work wonderfully well. I know my machines are often at basically
> zero load, and then having low-latency zero-pages when I sit down sounds
> like a good idea. Whether there is _enough_ free memory around for a
> 5-second thing to work out well, I have no idea..
problem is.. will it buy you anything if you use the page again
anyway... since such pages will be cold cached now. So for sure some of
it is only shifting latency from kernel side to userspace side, but
readprofile doesn't measure the later so it *looks* better...
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-23 21:48 ` Linus Torvalds
2004-12-23 22:34 ` Zwane Mwaikambo
2004-12-24 9:14 ` Arjan van de Ven
@ 2004-12-24 16:17 ` Christoph Lameter
2 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2004-12-24 16:17 UTC (permalink / raw)
To: Linus Torvalds
Cc: Paul Mackerras, Andrew Morton, linux-ia64, linux-mm, Kernel Mailing List
On Thu, 23 Dec 2004, Linus Torvalds wrote:
> So if we make the "what load is considered low" tunable, a system
> administrator can use that to make it more aggressive. And indeed, you
> might have a cron-job that says "be more aggressive at clearing pages
> between 2AM and 4AM in the morning" or something - if you have so much
> memory that it actually matters if you clear the memory just occasionally.
>
> And the tunable load-average check has another advantage: if you want to
> benchmark it, you can first set it to true zero (basically never), and run
> the benchmark, and then you can set it to something very agressive ("clear
> pages every five seconds regardless of load") and re-run.
>
> Does this sound sane? Christoph - can you try making the "scrub deamon" do
> that? Instead of the "scrub-low" and "scrub-high" (or in _addition_ to
> them), do a "scub-load" thing that takes a scaled integer, and compares it
> with "avenrun[0]" in kernel/timer.c: calc_load() when the average is
> updated every five seconds..
Sure V3 will have that. So far the impact of zeroing is quite minimal
on IA64 (even without using hardware), the big zeroing happens immediately
after activating it anyways. I have not seen any measurable effect on
benchmarks even with 4G allocations on a 6G machine.
> Personally, at least for a desktop usage, I think that the load average
> would work wonderfully well. I know my machines are often at basically
> zero load, and then having low-latency zero-pages when I sit down sounds
> like a good idea. Whether there is _enough_ free memory around for a
> 5-second thing to work out well, I have no idea..
The CPU can do a couple of Gigs of zeroing per second per CPU and the
zeroing zeros local RAM. On my 6G machine with 8 Cpus it can only
take a fraction of a second to zero all RAM.
Merry Christmas, I am off till now next year. SGI mandatory holiday
shutdown so all addicts have to go cold turkey ;-)
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
2004-12-24 8:33 ` Pavel Machek
@ 2004-12-24 16:18 ` Christoph Lameter
2004-12-24 16:27 ` Pavel Machek
0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2004-12-24 16:18 UTC (permalink / raw)
To: Pavel Machek; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
On Fri, 24 Dec 2004, Pavel Machek wrote:
> Hi!
>
> > o Extend clear_page to take an order parameter for all architectures.
> >
>
> I believe you sould leave clear_page() as is, and introduce
> clear_pages() with two arguments.
Did that in V1 and Andi Kleen complained about it.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal
2004-12-23 20:08 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Brian Gerst
@ 2004-12-24 16:24 ` Christoph Lameter
0 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2004-12-24 16:24 UTC (permalink / raw)
To: Brian Gerst; +Cc: linux-ia64, linux-mm, linux-kernel
On Thu, 23 Dec 2004, Brian Gerst wrote:
> This part is wrong. kmalloc() uses the slab allocator instead of
> getting a full page.
Thanks for finding that. V3 will have that fixed.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
2004-12-24 16:18 ` Christoph Lameter
@ 2004-12-24 16:27 ` Pavel Machek
2004-12-24 17:02 ` David S. Miller
0 siblings, 1 reply; 89+ messages in thread
From: Pavel Machek @ 2004-12-24 16:27 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
Hi!
> > > o Extend clear_page to take an order parameter for all architectures.
> > >
> >
> > I believe you sould leave clear_page() as is, and introduce
> > clear_pages() with two arguments.
>
> Did that in V1 and Andi Kleen complained about it.
I do not know what Andi said, but having clear_page clearing two
page*s* seems wrong to me.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
2004-12-24 16:27 ` Pavel Machek
@ 2004-12-24 17:02 ` David S. Miller
0 siblings, 0 replies; 89+ messages in thread
From: David S. Miller @ 2004-12-24 17:02 UTC (permalink / raw)
To: Pavel Machek; +Cc: clameter, akpm, linux-ia64, torvalds, linux-mm, linux-kernel
On Fri, 24 Dec 2004 17:27:45 +0100
Pavel Machek <pavel@ucw.cz> wrote:
> I do not know what Andi said, but having clear_page clearing two
> page*s* seems wrong to me.
It's represented by a single top-level page struct regardless
of it's order, so in that sense it's indeed a single page
no matter it's order.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
2004-12-23 19:33 ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
2004-12-24 8:33 ` Pavel Machek
@ 2004-12-24 17:05 ` David S. Miller
2004-12-27 22:48 ` David S. Miller
2005-01-03 17:52 ` Christoph Lameter
2005-01-01 10:24 ` Geert Uytterhoeven
2 siblings, 2 replies; 89+ messages in thread
From: David S. Miller @ 2004-12-24 17:05 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
On Thu, 23 Dec 2004 11:33:59 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:
> Modification made but it would be good to have some feedback from the arch maintainers:
>
...
> sparc64
I don't see any sparc64 bits in this patch, else I'd
review them :-)
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-24 9:14 ` Arjan van de Ven
@ 2004-12-24 18:21 ` Linus Torvalds
2004-12-24 18:57 ` Arjan van de Ven
2004-12-27 22:50 ` David S. Miller
0 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2004-12-24 18:21 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
linux-mm, Kernel Mailing List
On Fri, 24 Dec 2004, Arjan van de Ven wrote:
>
> problem is.. will it buy you anything if you use the page again
> anyway... since such pages will be cold cached now. So for sure some of
> it is only shifting latency from kernel side to userspace side, but
> readprofile doesn't measure the later so it *looks* better...
Absolutely. I would want to see some real benchmarks before we do this.
Not just some microbenchmark of "how many page faults can we take without
_using_ the page at all".
I agree 100% with you that we shouldn't shift the costs around. Having a
hice hot-spot that we know about is a good thing, and it means that
performance profiles show what the time is really spent on. Often getting
rid of the hotspot just smears out the work over a wider area, making
other optimizations (like trying to make the memory footprint _smaller_
and removing the work entirely that way) totally impossible because now
the performance profile just has a constant background noise and you can't
tell what the real problem is.
Linus
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Increase page fault rate by prezeroing V1 [0/3]: Overview
2004-12-21 19:55 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
` (3 preceding siblings ...)
2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
@ 2004-12-24 18:31 ` Andrea Arcangeli
2005-01-03 17:54 ` Christoph Lameter
4 siblings, 1 reply; 89+ messages in thread
From: Andrea Arcangeli @ 2004-12-24 18:31 UTC (permalink / raw)
To: Christoph Lameter
Cc: Nick Piggin, Luck, Tony, Robin Holt, Adam Litke, linux-ia64,
torvalds, linux-mm, linux-kernel
Did you notice I already implemented full PG_zero caching here with
prezeroing on top of it?
http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2
http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2-no-zerolist-reserve-1
I was about to push this in SP1, but it was a bit late.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-24 18:21 ` Linus Torvalds
@ 2004-12-24 18:57 ` Arjan van de Ven
2004-12-27 22:50 ` David S. Miller
1 sibling, 0 replies; 89+ messages in thread
From: Arjan van de Ven @ 2004-12-24 18:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
linux-mm, Kernel Mailing List
On Fri, 2004-12-24 at 10:21 -0800, Linus Torvalds wrote:
>
> On Fri, 24 Dec 2004, Arjan van de Ven wrote:
> >
> > problem is.. will it buy you anything if you use the page again
> > anyway... since such pages will be cold cached now. So for sure some of
> > it is only shifting latency from kernel side to userspace side, but
> > readprofile doesn't measure the later so it *looks* better...
>
> Absolutely. I would want to see some real benchmarks before we do this.
> Not just some microbenchmark of "how many page faults can we take without
> _using_ the page at all".
>
> I agree 100% with you that we shouldn't shift the costs around. Having a
> hice hot-spot that we know about is a good thing, and it means that
> performance profiles show what the time is really spent on. Often getting
> rid of the hotspot just smears out the work over a wider area, making
> other optimizations (like trying to make the memory footprint _smaller_
> and removing the work entirely that way) totally impossible because now
> the performance profile just has a constant background noise and you can't
> tell what the real problem is.
I suspect it's even worse.
Think about it; you can spew 4k of zeroes into your L1 cache really fast
(assuming your cpu is smart enough to avoid write-allocate for rep
stosl; not sure which cpus are). I suspect you can do that faster than a
cachemiss or two. And at that point the page is cache hot... so reads
don't miss either.
all this makes me wonder if there is any scenario where this thing will
be a gain, other than cpus that aren't smart enough to avoid the write-
allocate.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
2004-12-24 17:05 ` David S. Miller
@ 2004-12-27 22:48 ` David S. Miller
2005-01-03 17:52 ` Christoph Lameter
1 sibling, 0 replies; 89+ messages in thread
From: David S. Miller @ 2004-12-27 22:48 UTC (permalink / raw)
To: David S. Miller
Cc: clameter, akpm, linux-ia64, torvalds, linux-mm, linux-kernel
On Fri, 24 Dec 2004 09:05:39 -0800
"David S. Miller" <davem@davemloft.net> wrote:
> On Thu, 23 Dec 2004 11:33:59 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > Modification made but it would be good to have some feedback from the arch maintainers:
> >
> ...
> > sparc64
>
> I don't see any sparc64 bits in this patch, else I'd
> review them :-)
So I found time to implement the missing sparc64 clear_page()
changes, here they are:
===== arch/sparc64/lib/clear_page.S 1.1 vs edited =====
--- 1.1/arch/sparc64/lib/clear_page.S 2004-08-08 19:54:07 -07:00
+++ edited/arch/sparc64/lib/clear_page.S 2004-12-24 08:53:29 -08:00
@@ -28,9 +28,12 @@
.text
.globl _clear_page
-_clear_page: /* %o0=dest */
+_clear_page: /* %o0=dest, %o1=order */
+ sethi %hi(PAGE_SIZE/64), %o2
+ clr %o4
+ or %o2, %lo(PAGE_SIZE/64), %o2
ba,pt %xcc, clear_page_common
- clr %o4
+ sllx %o2, %o1, %o1
/* This thing is pretty important, it shows up
* on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@ clear_user_page: /* %o0=dest, %o1=vaddr
flush %g6
wrpr %o4, 0x0, %pstate
+ sethi %hi(PAGE_SIZE/64), %o1
mov 1, %o4
+ or %o1, %lo(PAGE_SIZE/64), %o1
clear_page_common:
VISEntryHalf
membar #StoreLoad | #StoreStore | #LoadStore
fzero %f0
- sethi %hi(PAGE_SIZE/64), %o1
mov %o0, %g1 ! remember vaddr for tlbflush
fzero %f2
- or %o1, %lo(PAGE_SIZE/64), %o1
faddd %f0, %f2, %f4
fmuld %f0, %f2, %f6
faddd %f0, %f2, %f8
===== include/asm-sparc64/page.h 1.19 vs edited =====
--- 1.19/include/asm-sparc64/page.h 2004-07-27 12:54:49 -07:00
+++ edited/include/asm-sparc64/page.h 2004-12-24 08:52:17 -08:00
@@ -14,8 +14,8 @@
#ifndef __ASSEMBLY__
-extern void _clear_page(void *page);
-#define clear_page(X) _clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y) _clear_page((void *)(X),(Y))
struct page;
extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
#define copy_page(X,Y) memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-24 18:21 ` Linus Torvalds
2004-12-24 18:57 ` Arjan van de Ven
@ 2004-12-27 22:50 ` David S. Miller
2004-12-28 11:53 ` Marcelo Tosatti
1 sibling, 1 reply; 89+ messages in thread
From: David S. Miller @ 2004-12-27 22:50 UTC (permalink / raw)
To: Linus Torvalds
Cc: arjan, paulus, clameter, akpm, linux-ia64, linux-mm, linux-kernel
On Fri, 24 Dec 2004 10:21:24 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:
> Absolutely. I would want to see some real benchmarks before we do this.
> Not just some microbenchmark of "how many page faults can we take without
> _using_ the page at all".
Here's my small contribution. I did three "make -j3 vmlinux" timed
runs, one running a kernel without the pre-zeroing stuff applied,
one with it applied. It did shave a few seconds off the build
consistently. Here is the before:
real 8m35.248s
user 15m54.132s
sys 1m1.098s
real 8m32.202s
user 15m54.329s
sys 1m0.229s
real 8m31.932s
user 15m54.160s
sys 1m0.245s
and here is the after:
real 8m29.375s
user 15m43.296s
sys 0m59.549s
real 8m28.213s
user 15m39.819s
sys 0m58.790s
real 8m26.140s
user 15m44.145s
sys 0m58.872s
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-27 22:50 ` David S. Miller
@ 2004-12-28 11:53 ` Marcelo Tosatti
0 siblings, 0 replies; 89+ messages in thread
From: Marcelo Tosatti @ 2004-12-28 11:53 UTC (permalink / raw)
To: David S. Miller
Cc: Linus Torvalds, arjan, paulus, clameter, akpm, linux-ia64,
linux-mm, linux-kernel
On Mon, Dec 27, 2004 at 02:50:57PM -0800, David S. Miller wrote:
> On Fri, 24 Dec 2004 10:21:24 -0800 (PST)
> Linus Torvalds <torvalds@osdl.org> wrote:
>
> > Absolutely. I would want to see some real benchmarks before we do this.
> > Not just some microbenchmark of "how many page faults can we take without
> > _using_ the page at all".
>
> Here's my small contribution. I did three "make -j3 vmlinux" timed
> runs, one running a kernel without the pre-zeroing stuff applied,
> one with it applied. It did shave a few seconds off the build
> consistently. Here is the before:
>
> real 8m35.248s
> user 15m54.132s
> sys 1m1.098s
>
> real 8m32.202s
> user 15m54.329s
> sys 1m0.229s
>
> real 8m31.932s
> user 15m54.160s
> sys 1m0.245s
>
> and here is the after:
>
> real 8m29.375s
> user 15m43.296s
> sys 0m59.549s
>
> real 8m28.213s
> user 15m39.819s
> sys 0m58.790s
>
> real 8m26.140s
> user 15m44.145s
> sys 0m58.872s
Christopher and other SGI fellows,
Get your patch into STP, once its there we can do some wider x86 benchmarking
easily.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd
2004-12-21 19:57 ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd Christoph Lameter
@ 2005-01-01 2:22 ` Nick Piggin
2005-01-01 2:55 ` pmarques
0 siblings, 1 reply; 89+ messages in thread
From: Nick Piggin @ 2005-01-01 2:22 UTC (permalink / raw)
To: Christoph Lameter
Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
linux-mm, linux-kernel
Christoph Lameter wrote:
> o Add page zeroing
> o Add scrub daemon
> o Add ability to view amount of zeroed information in /proc/meminfo
>
I quite like how you're handling the page zeroing now. It seems
less intrusive and cleaner in its interface to the page allocator
now.
I think this is pretty close to what I'd be happy with if we decide
to go with zeroing.
Just one small comment - there is a patch in the -mm tree that may
be of use to you; mm-keep-count-of-free-areas.patch is used later
by kswapd to handle and account higher order free areas properly.
You may be able to use it to better implement triggers/watermarks
for the scrub daemon.
Also...
> +
> +/*
> + * zero_highest_order_page takes a page off the freelist
> + * and then hands it off to block zeroing agents.
> + * The cleared pages are added to the back of
> + * the freelist where the page allocator may pick them up.
> + */
> +int zero_highest_order_page(struct zone *z)
> +{
> + int order;
> +
> + for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
> + struct free_area *area = z->free_area[NOT_ZEROED] + order;
> + if (!list_empty(&area->free_list)) {
> + struct page *page = scrubd_rmpage(z, area, order);
> + struct list_head *l;
> +
> + if (!page)
> + continue;
> +
> + page->index = order;
> +
> + list_for_each(l, &zero_drivers) {
> + struct zero_driver *driver = list_entry(l, struct zero_driver, list);
> + unsigned long size = PAGE_SIZE << order;
> +
> + if (driver->start(page_address(page), size) == 0) {
> +
> + unsigned ticks = (size*HZ)/driver->rate;
> + if (ticks) {
> + /* Wait the minimum time of the transfer */
> + current->state = TASK_INTERRUPTIBLE;
> + schedule_timeout(ticks);
> + }
> + /* Then keep on checking until transfer is complete */
> + while (!driver->check())
> + schedule();
> + goto out;
> + }
Would you be better off to just have a driver->zero_me(...) call, with this
logic pushed into those like your BTE which need it? I'm thinking this would
help flexibility if you had say a BTE-thingy that did an interrupt on
completion, or if it was done synchronously by the CPU with cache bypassing
stores.
Also, would there be any use in passing a batch of pages to the zeroing driver?
That may improve performance on some implementations, but could also cut down
the inefficiency in your timeout mechanism due to timer quantization (I guess
probably not much if you are only zeroing quite large areas).
BTW, that while loop is basically a busy-wait. Not a critical problem, but you
may want to renice scrubd to the lowest scheduling priority to be a bit nicer?
(I think you'd want to do that anyway). And put a cpu_relax() call in there?
Just some suggestions.
Nick
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd
2005-01-01 2:22 ` Nick Piggin
@ 2005-01-01 2:55 ` pmarques
0 siblings, 0 replies; 89+ messages in thread
From: pmarques @ 2005-01-01 2:55 UTC (permalink / raw)
To: Nick Piggin
Cc: Christoph Lameter, Luck, Tony, Robin Holt, Adam Litke,
linux-ia64, torvalds, linux-mm, linux-kernel
Quoting Nick Piggin <nickpiggin@yahoo.com.au>:
> [...]
> Would you be better off to just have a driver->zero_me(...) call, with this
> logic pushed into those like your BTE which need it? I'm thinking this would
> help flexibility if you had say a BTE-thingy that did an interrupt on
> completion, or if it was done synchronously by the CPU with cache bypassing
> stores.
It seems that people in this discussion are assuming that PC's don't have
hardware to do this at all.
While there is no _official_ hardware, a bt878 with the brightness setting all
the way down, at 1024 pixels per line, 32 bits per pixel would be able to zero
a full physical page in under 60 microseconds (PAL scanline). It could even
zero a _list_ of pages passed to it and generate an interrupt in the end.
This is just an example, and there might be some problems in the implementation
details that make it impossible to work, but there might also be more hardware
out there that could perform similar functions (graphics cards?).
This might not be worth the bother *at all*, but I can imagine some weird
conversation between two sysadmins:
"My server is wasting a lot of time handling page faults"
"Why don't you install a video aquisition board with a bt878 chip? It did
wonders on my server"
"Yes, I've also weard that a radeon graphics card can really accelerate kernel
compiles"
Well, just my 0.02 euro :)
--
Paulo Marques - www.grupopie.com
"A journey of a thousand miles begins with a single step."
Lao-tzu, The Way of Lao-tzu
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
2004-12-23 19:33 ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
2004-12-24 8:33 ` Pavel Machek
2004-12-24 17:05 ` David S. Miller
@ 2005-01-01 10:24 ` Geert Uytterhoeven
2005-01-04 23:12 ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
2 siblings, 1 reply; 89+ messages in thread
From: Geert Uytterhoeven @ 2005-01-01 10:24 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
Linux Kernel Development
On Thu, 23 Dec 2004, Christoph Lameter wrote:
> o Extend clear_page to take an order parameter for all architectures.
> Index: linux-2.6.9/include/asm-m68k/page.h
> ===================================================================
> --- linux-2.6.9.orig/include/asm-m68k/page.h 2004-10-18 14:55:36.000000000 -0700
> +++ linux-2.6.9/include/asm-m68k/page.h 2004-12-23 07:44:14.000000000 -0800
> @@ -50,7 +50,7 @@
> );
> }
>
> -static inline void clear_page(void *page)
> +static inline void clear_page(void *page, int order)
> {
> unsigned long tmp;
> unsigned long *sp = page;
> @@ -69,16 +69,16 @@
> "dbra %1,1b\n\t"
> : "=a" (sp), "=d" (tmp)
> : "a" (page), "0" (sp),
> - "1" ((PAGE_SIZE - 16) / 16 - 1));
> + "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
> }
>
> #else
> -#define clear_page(page) memset((page), 0, PAGE_SIZE)
> +#define clear_page(page, 0) memset((page), 0, PAGE_SIZE << (order))
^
order
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches
2004-12-24 17:05 ` David S. Miller
2004-12-27 22:48 ` David S. Miller
@ 2005-01-03 17:52 ` Christoph Lameter
1 sibling, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-03 17:52 UTC (permalink / raw)
To: David S. Miller; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel
On Fri, 24 Dec 2004, David S. Miller wrote:
> On Thu, 23 Dec 2004 11:33:59 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > Modification made but it would be good to have some feedback from the arch maintainers:
> >
> ...
> > sparc64
>
> I don't see any sparc64 bits in this patch, else I'd
> review them :-)
>
Sorry here it is:
Index: linux-2.6.9/include/asm-sparc64/page.h
===================================================================
--- linux-2.6.9.orig/include/asm-sparc64/page.h 2004-10-18 14:53:51.000000000 -0700
+++ linux-2.6.9/include/asm-sparc64/page.h 2005-01-03 09:50:16.000000000 -0800
@@ -15,7 +15,17 @@
#ifndef __ASSEMBLY__
extern void _clear_page(void *page);
-#define clear_page(X) _clear_page((void *)(X))
+
+static void inline clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- > 0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
struct page;
extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
#define copy_page(X,Y) memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Increase page fault rate by prezeroing V1 [0/3]: Overview
2004-12-24 18:31 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Andrea Arcangeli
@ 2005-01-03 17:54 ` Christoph Lameter
0 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-03 17:54 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Nick Piggin, Luck, Tony, Robin Holt, Adam Litke, linux-ia64,
torvalds, linux-mm, linux-kernel
On Fri, 24 Dec 2004, Andrea Arcangeli wrote:
> Did you notice I already implemented full PG_zero caching here with
> prezeroing on top of it?
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2-no-zerolist-reserve-1
>
> I was about to push this in SP1, but it was a bit late.
Yes but this did not do the trick and the interface to get zeroed pages is
a bit difficult to handle.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V3 [0/4]: Discussion and i386 performance tests
2005-01-01 10:24 ` Geert Uytterhoeven
@ 2005-01-04 23:12 ` Christoph Lameter
2005-01-04 23:13 ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
` (3 more replies)
0 siblings, 4 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:12 UTC (permalink / raw)
Cc: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
Linux Kernel Development
Change from V2 to V3:
o Updates for clear_page on various platforms
o Performance measurements on i386 (2x PIII-450 384M RAM)
o Port patches to 2.6.10-bk7
o Add scrub_load so that a high load prevents scrubd from running
(So that people may feel better about this approach. Set by
default to 999 so its off. The typical result of not running kscrubd
under high loads is to slow the system down even further since zeroing
large consecutive areas of memory is more efficient than zeroing page
size chunks. Memory subsystems are typically optimized for linear accesses
and reach their peak performance if large areas of memory are written to)
o Various fixes
The patches increasing the page fault rate (introduction of atomic pte
operations and anticipatory prefaulting) do so by reducing the locking
overhead and are therefore mainly of interest for applications running in
SMP systems with a high number of cpus. The single thread performance does
just show minor increases. Only the performance of multi-threaded
applications increases significantly.
The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page that is also done in the page fault
handler. This zeroing means that all cachelines of the faulted page (on Altix
that means all 128 cachelines of 128 byte each) must be loaded and later
written back. This patch allows to avoid having to load all cachelines
if only a part of the cachelines of that page is needed immediately after
the fault. Doing so will only be effective for sparsely accessed memory
which is typical for anonymous memory and pte maps. Prezeroed pages will
only be used for those purposes. Unzeroed pages will be used as usual for
file mapping, page caching etc etc.
Others have also thought that prezeroing could be a benefit and have tried
provide a way to provide zeroed pages to the page fault handler:
http://marc.theaimsgroup.com/?t=109914559100004&r=1&w=2
http://marc.theaimsgroup.com/?t=109777267500005&r=1&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=104931944213955&w=2
However, these attempt have tried to zero pages that are like to be used
soon (and that may have recently been accessed). Elements of these pages
are thus already in the cpu caches. Approaches like that will only shift
processing to somewhere else and not bring any performance benefits.
Prezeroing only makes sense for pages that are not currently needed and that
are not in the cpu caches. Pages that have recently been touched and that
soon will be touched again are better hot zeroed since the zeroing will
largely be done to cachelines already in the cpu caches.
The patch makes prezeroing very effective by:
1. Aggregating zeroing operations to only apply to pages of higher order,
which results in many pages that will later become zero 0 to be zeroed in one
step.
For that purpose the existing clear_page function is extended and made to
take an additional argument specifying the order of the page to be cleared.
2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.
The scrub daemon is invoked when a unzeroed page of a certain order has
been generated so that its worth running it. If no higher order pages are
present then the logic will favor hot zeroing rather than simply shifting
processing around. kscrubd typically runs only for a fraction of a second
and sleeps for long periods of time even under memory benchmarking. kscrubd
performs short bursts of zeroing when needed and tries to stay out off the
processor as much as possible.
The result is a significant increase of the page fault performance even for
single threaded applications (i386 2x PIII-450 384M RAM allocating 256M in
each run):
w/o patch:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
0 1 1 0.006s 0.389s 0.039s157455.320 157070.694
0 1 2 0.007s 0.607s 0.032s101476.689 190350.885
w/patch
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
0 1 1 0.008s 0.083s 0.009s672151.422 664045.899
0 1 2 0.005s 0.129s 0.008s459629.796 741857.373
The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmark the system may run out of these very
fast but the efficient algorithm for page zeroing still makes this a winner
(2 way system with 384MB RAM, no hardware zeroing support). In the following
measurement the test is repeated 10 times allocating 256M each in rapid
succession which would deplete the pool of zeroed pages quickly):
w/o patch:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
0 10 1 0.058s 3.913s 3.097s157335.774 157076.932
0 10 2 0.063s 6.139s 3.027s100756.788 190572.486
w/patch
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
0 10 1 0.059s 1.828s 1.089s330913.517 330225.515
0 10 2 0.082s 1.951s 1.094s307172.100 320680.232
Note that zeroing of pages makes no sense if the application
touches all cache lines of a page allocated (there is no influence of
prezeroing on benchmarks like lmbench for that reason) since the extensive
caching of modern cpus means that the zeroes written to a hot zeroed page
will then be overwritten by the application in the cpu cache and thus
the zeros will never make it to memory! The test program used above only
touches one 128 byte cache line of a 16k page (ia64). Sparsely
populated and accessed areas are typical for lots of applications.
Here is another test in order to gauge the influence of the number of cache
lines touched on the performance of the prezero enhancements:
Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec
1 1 1 1 0.01s 0.12s 0.01s500813.853 497925.891
1 1 1 2 0.01s 0.11s 0.01s493453.103 472877.725
1 1 1 4 0.02s 0.10s 0.01s479351.658 471507.415
1 1 1 8 0.01s 0.13s 0.01s424742.054 416725.013
1 1 1 16 0.05s 0.12s 0.01s347715.359 336983.834
1 1 1 32 0.12s 0.13s 0.02s258112.286 256246.731
1 1 1 64 0.24s 0.14s 0.03s169896.381 168189.283
1 1 1 128 0.49s 0.14s 0.06s102300.257 101674.435
The benefits of prezeroing are reduced to minimal quantities if all
cachelines of a page are touched. Prezeroing can only be effective
if the whole page is not immediately used after the page fault.
The patch is composed of 4 parts:
[1/4] Introduce __GFP_ZERO
Modifies the page allocator to be able to take the __GFP_ZERO flag
and returns zeroed memory on request. Modifies locations throughout
the linux sources that retrieve a page and then zero it to request
a zeroed page.
[2/4] Architecture specific clear_page updates
Adds second order argument to clear_page and updates all arches.
Note: The two first pages may be used alone if no zeroing engine is wanted.
[3/4] Page Zeroing
Adds management of ZEROED and NOT_ZEROED pages and a background daemon
called scrubd. scrubd is disabled by default but can be enabled
by writing an order number to /proc/sys/vm/scrub_start. If a page
is coalesced of that order or higher then the scrub daemon will
start zeroing until all pages of order /proc/sys/vm/scrub_stop and
higher are zeroed and then go back to sleep.
In an SMP environment the scrub daemon is typically
running on the most idle cpu. Thus a single threaded application running
on one cpu may have the other cpu zeroing pages for it etc. The scrub
daemon is hardly noticable and usually finished zeroing quickly since
most processors are optimized for linear memory filling.
[4/4] SGI Altix Block Transfer Engine Support
Implements a driver to shift the zeroing off the cpu into hardware.
With hardware support there will be minimal impact of zeroing
on the performance of the system.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-04 23:12 ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
@ 2005-01-04 23:13 ` Christoph Lameter
2005-01-04 23:45 ` Dave Hansen
` (2 more replies)
2005-01-04 23:14 ` Prezeroing V3 [2/4]: Extension of " Christoph Lameter
` (2 subsequent siblings)
3 siblings, 3 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:13 UTC (permalink / raw)
To: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
Linux Kernel Development
This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
to request zeroed pages from the page allocator.
o Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set
o Replace all page zeroing after allocating pages by request for
zeroed pages.
o requires arch updates to clear_page in order to function properly.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c 2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c 2005-01-04 12:16:49.000000000 -0800
@@ -584,6 +584,18 @@
BUG_ON(bad_range(zone, page));
mod_page_state_zone(zone, pgalloc, 1 << order);
prep_new_page(page, order);
+
+ if (gfp_flags & __GFP_ZERO) {
+#ifdef CONFIG_HIGHMEM
+ if (PageHighMem(page)) {
+ int n = 1 << order;
+
+ while (n-- >0)
+ clear_highpage(page + n);
+ } else
+#endif
+ clear_page(page_address(page), order);
+ }
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
}
@@ -796,12 +808,9 @@
*/
BUG_ON(gfp_mask & __GFP_HIGHMEM);
- page = alloc_pages(gfp_mask, 0);
- if (page) {
- void *address = page_address(page);
- clear_page(address);
- return (unsigned long) address;
- }
+ page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+ if (page)
+ return (unsigned long) page_address(page);
return 0;
}
Index: linux-2.6.10/include/linux/gfp.h
===================================================================
--- linux-2.6.10.orig/include/linux/gfp.h 2004-12-24 13:34:27.000000000 -0800
+++ linux-2.6.10/include/linux/gfp.h 2005-01-04 12:16:49.000000000 -0800
@@ -37,6 +37,7 @@
#define __GFP_NORETRY 0x1000 /* Do not retry. Might fail */
#define __GFP_NO_GROW 0x2000 /* Slab internal usage */
#define __GFP_COMP 0x4000 /* Add compound page metadata */
+#define __GFP_ZERO 0x8000 /* Return zeroed page on success */
#define __GFP_BITS_SHIFT 16 /* Room for 16 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)
/* Flag - indicates that the buffer will be suitable for DMA. Ignored on some
platforms, used as appropriate on others */
Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-04 12:16:49.000000000 -0800
@@ -1650,10 +1650,9 @@
if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
- page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
if (!page)
goto no_mem;
- clear_user_highpage(page, addr);
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.10/kernel/profile.c
===================================================================
--- linux-2.6.10.orig/kernel/profile.c 2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/kernel/profile.c 2005-01-04 12:16:49.000000000 -0800
@@ -326,17 +326,15 @@
node = cpu_to_node(cpu);
per_cpu(cpu_profile_flip, cpu) = 0;
if (!per_cpu(cpu_profile_hits, cpu)[1]) {
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
return NOTIFY_BAD;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
}
if (!per_cpu(cpu_profile_hits, cpu)[0]) {
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
goto out_free;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
}
break;
@@ -510,16 +508,14 @@
int node = cpu_to_node(cpu);
struct page *page;
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
goto out_cleanup;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[1]
= (struct profile_hit *)page_address(page);
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
goto out_cleanup;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[0]
= (struct profile_hit *)page_address(page);
}
Index: linux-2.6.10/mm/shmem.c
===================================================================
--- linux-2.6.10.orig/mm/shmem.c 2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/mm/shmem.c 2005-01-04 12:16:49.000000000 -0800
@@ -369,9 +369,8 @@
}
spin_unlock(&info->lock);
- page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+ page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
if (page) {
- clear_highpage(page);
page->nr_swapped = 0;
}
spin_lock(&info->lock);
@@ -910,7 +909,7 @@
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
pvma.vm_pgoff = idx;
pvma.vm_end = PAGE_SIZE;
- page = alloc_page_vma(gfp, &pvma, 0);
+ page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
mpol_free(pvma.vm_policy);
return page;
}
@@ -926,7 +925,7 @@
shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
unsigned long idx)
{
- return alloc_page(gfp);
+ return alloc_page(gfp | __GFP_ZERO);
}
#endif
@@ -1135,7 +1134,6 @@
info->alloced++;
spin_unlock(&info->lock);
- clear_highpage(filepage);
flush_dcache_page(filepage);
SetPageUptodate(filepage);
}
Index: linux-2.6.10/mm/hugetlb.c
===================================================================
--- linux-2.6.10.orig/mm/hugetlb.c 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/mm/hugetlb.c 2005-01-04 12:16:49.000000000 -0800
@@ -77,7 +77,6 @@
struct page *alloc_huge_page(void)
{
struct page *page;
- int i;
spin_lock(&hugetlb_lock);
page = dequeue_huge_page();
@@ -88,8 +87,7 @@
spin_unlock(&hugetlb_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
- for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
- clear_highpage(&page[i]);
+ clear_page(page_address(page), HUGETLB_PAGE_ORDER);
return page;
}
Index: linux-2.6.10/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgalloc.h 2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgalloc.h 2005-01-04 12:16:49.000000000 -0800
@@ -61,9 +61,7 @@
pgd_t *pgd = pgd_alloc_one_fast(mm);
if (unlikely(pgd == NULL)) {
- pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
- if (likely(pgd != NULL))
- clear_page(pgd);
+ pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
}
return pgd;
}
@@ -106,10 +104,8 @@
static inline pmd_t*
pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
{
- pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
- if (likely(pmd != NULL))
- clear_page(pmd);
return pmd;
}
@@ -140,20 +136,16 @@
static inline struct page *
pte_alloc_one (struct mm_struct *mm, unsigned long addr)
{
- struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
- if (likely(pte != NULL))
- clear_page(page_address(pte));
return pte;
}
static inline pte_t *
pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
- if (likely(pte != NULL))
- clear_page(pte);
return pte;
}
Index: linux-2.6.10/arch/i386/mm/pgtable.c
===================================================================
--- linux-2.6.10.orig/arch/i386/mm/pgtable.c 2005-01-04 12:16:39.000000000 -0800
+++ linux-2.6.10/arch/i386/mm/pgtable.c 2005-01-04 12:16:49.000000000 -0800
@@ -140,10 +140,7 @@
pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
- return pte;
+ return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
}
struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -151,12 +148,10 @@
struct page *pte;
#ifdef CONFIG_HIGHPTE
- pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
#else
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
#endif
- if (pte)
- clear_highpage(pte);
return pte;
}
Index: linux-2.6.10/arch/m68k/mm/motorola.c
===================================================================
--- linux-2.6.10.orig/arch/m68k/mm/motorola.c 2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/m68k/mm/motorola.c 2005-01-04 12:16:49.000000000 -0800
@@ -1,4 +1,4 @@
-/*
+*
* linux/arch/m68k/motorola.c
*
* Routines specific to the Motorola MMU, originally from:
@@ -50,7 +50,7 @@
ptablep = (pte_t *)alloc_bootmem_low_pages(PAGE_SIZE);
- clear_page(ptablep);
+ clear_page(ptablep, 0);
__flush_page_to_ram(ptablep);
flush_tlb_kernel_page(ptablep);
nocache_page(ptablep);
@@ -90,7 +90,7 @@
if (((unsigned long)last_pgtable & ~PAGE_MASK) == 0) {
last_pgtable = (pmd_t *)alloc_bootmem_low_pages(PAGE_SIZE);
- clear_page(last_pgtable);
+ clear_page(last_pgtable, 0);
__flush_page_to_ram(last_pgtable);
flush_tlb_kernel_page(last_pgtable);
nocache_page(last_pgtable);
Index: linux-2.6.10/include/asm-mips/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-mips/pgalloc.h 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-mips/pgalloc.h 2005-01-04 12:16:49.000000000 -0800
@@ -56,9 +56,7 @@
{
pte_t *pte;
- pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PTE_ORDER);
- if (pte)
- clear_page(pte);
+ pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER);
return pte;
}
Index: linux-2.6.10/arch/alpha/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/alpha/mm/init.c 2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/arch/alpha/mm/init.c 2005-01-04 12:16:49.000000000 -0800
@@ -42,10 +42,9 @@
{
pgd_t *ret, *init;
- ret = (pgd_t *)__get_free_page(GFP_KERNEL);
+ ret = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
init = pgd_offset(&init_mm, 0UL);
if (ret) {
- clear_page(ret);
#ifdef CONFIG_ALPHA_LARGE_VMALLOC
memcpy (ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
(PTRS_PER_PGD - USER_PTRS_PER_PGD - 1)*sizeof(pgd_t));
@@ -63,9 +62,7 @@
pte_t *
pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pte;
}
Index: linux-2.6.10/include/asm-parisc/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-parisc/pgalloc.h 2004-12-24 13:35:39.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/pgalloc.h 2005-01-04 12:16:49.000000000 -0800
@@ -120,18 +120,14 @@
static inline struct page *
pte_alloc_one(struct mm_struct *mm, unsigned long address)
{
- struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
- if (likely(page != NULL))
- clear_page(page_address(page));
+ struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return page;
}
static inline pte_t *
pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (likely(pte != NULL))
- clear_page(pte);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pte;
}
Index: linux-2.6.10/arch/sh/mm/pg-sh4.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-sh4.c 2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-sh4.c 2005-01-04 12:16:49.000000000 -0800
@@ -34,7 +34,7 @@
{
__set_bit(PG_mapped, &page->flags);
if (((address ^ (unsigned long)to) & CACHE_ALIAS) == 0)
- clear_page(to);
+ clear_page(to, 0);
else {
pgprot_t pgprot = __pgprot(_PAGE_PRESENT |
_PAGE_RW | _PAGE_CACHABLE |
Index: linux-2.6.10/include/asm-sparc64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/pgalloc.h 2004-12-24 13:35:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/pgalloc.h 2005-01-04 12:16:49.000000000 -0800
@@ -73,10 +73,9 @@
struct page *page;
preempt_enable();
- page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+ page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
if (page) {
ret = (struct page *)page_address(page);
- clear_page(ret);
page->lru.prev = (void *) 2UL;
preempt_disable();
Index: linux-2.6.10/include/asm-sh/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh/pgalloc.h 2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/include/asm-sh/pgalloc.h 2005-01-04 12:16:49.000000000 -0800
@@ -44,9 +44,7 @@
{
pte_t *pte;
- pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
return pte;
}
@@ -56,9 +54,7 @@
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
- if (pte)
- clear_page(page_address(pte));
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
return pte;
}
Index: linux-2.6.10/include/asm-m32r/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/pgalloc.h 2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/pgalloc.h 2005-01-04 12:16:49.000000000 -0800
@@ -23,10 +23,7 @@
*/
static __inline__ pgd_t *pgd_alloc(struct mm_struct *mm)
{
- pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-
- if (pgd)
- clear_page(pgd);
+ pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
return pgd;
}
@@ -39,10 +36,7 @@
static __inline__ pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
-
- if (pte)
- clear_page(pte);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
return pte;
}
@@ -50,10 +44,8 @@
static __inline__ struct page *pte_alloc_one(struct mm_struct *mm,
unsigned long address)
{
- struct page *pte = alloc_page(GFP_KERNEL);
+ struct page *pte = alloc_page(GFP_KERNEL|__GFP_ZERO);
- if (pte)
- clear_page(page_address(pte));
return pte;
}
Index: linux-2.6.10/arch/um/kernel/mem.c
===================================================================
--- linux-2.6.10.orig/arch/um/kernel/mem.c 2005-01-04 12:16:40.000000000 -0800
+++ linux-2.6.10/arch/um/kernel/mem.c 2005-01-04 12:16:49.000000000 -0800
@@ -327,9 +327,7 @@
{
pte_t *pte;
- pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pte;
}
@@ -337,9 +335,7 @@
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
- if (pte)
- clear_highpage(pte);
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
return pte;
}
Index: linux-2.6.10/arch/ppc64/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/ppc64/mm/init.c 2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/ppc64/mm/init.c 2005-01-04 12:16:49.000000000 -0800
@@ -761,7 +761,7 @@
void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
{
- clear_page(page);
+ clear_page(page, 0);
if (cur_cpu_spec->cpu_features & CPU_FTR_COHERENT_ICACHE)
return;
Index: linux-2.6.10/include/asm-sh64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh64/pgalloc.h 2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/pgalloc.h 2005-01-04 12:16:49.000000000 -0800
@@ -112,9 +112,7 @@
{
pte_t *pte;
- pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT|__GFP_ZERO);
return pte;
}
@@ -123,9 +121,7 @@
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
- if (pte)
- clear_page(page_address(pte));
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
return pte;
}
@@ -150,9 +146,7 @@
static __inline__ pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
{
pmd_t *pmd;
- pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pmd)
- clear_page(pmd);
+ pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pmd;
}
Index: linux-2.6.10/include/asm-cris/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/pgalloc.h 2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-cris/pgalloc.h 2005-01-04 12:16:49.000000000 -0800
@@ -24,18 +24,14 @@
extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pte;
}
extern inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
- if (pte)
- clear_page(page_address(pte));
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
return pte;
}
Index: linux-2.6.10/arch/ppc/mm/pgtable.c
===================================================================
--- linux-2.6.10.orig/arch/ppc/mm/pgtable.c 2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/arch/ppc/mm/pgtable.c 2005-01-04 12:16:49.000000000 -0800
@@ -85,8 +85,7 @@
{
pgd_t *ret;
- if ((ret = (pgd_t *)__get_free_pages(GFP_KERNEL, PGDIR_ORDER)) != NULL)
- clear_pages(ret, PGDIR_ORDER);
+ ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER);
return ret;
}
@@ -102,7 +101,7 @@
extern void *early_get_page(void);
if (mem_init_done) {
- pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
if (pte) {
struct page *ptepage = virt_to_page(pte);
ptepage->mapping = (void *) mm;
@@ -110,8 +109,6 @@
}
} else
pte = (pte_t *)early_get_page();
- if (pte)
- clear_page(pte);
return pte;
}
Index: linux-2.6.10/arch/ppc/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/ppc/mm/init.c 2005-01-04 12:16:40.000000000 -0800
+++ linux-2.6.10/arch/ppc/mm/init.c 2005-01-04 12:16:49.000000000 -0800
@@ -594,7 +594,7 @@
}
void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
{
- clear_page(page);
+ clear_page(page, 0);
clear_bit(PG_arch_1, &pg->flags);
}
Index: linux-2.6.10/fs/afs/file.c
===================================================================
--- linux-2.6.10.orig/fs/afs/file.c 2004-12-24 13:35:59.000000000 -0800
+++ linux-2.6.10/fs/afs/file.c 2005-01-04 12:16:49.000000000 -0800
@@ -172,7 +172,7 @@
(size_t) PAGE_SIZE);
desc.buffer = kmap(page);
- clear_page(desc.buffer);
+ clear_page(desc.buffer, 0);
/* read the contents of the file from the server into the
* page */
Index: linux-2.6.10/include/asm-alpha/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/pgalloc.h 2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/pgalloc.h 2005-01-04 12:16:49.000000000 -0800
@@ -40,9 +40,7 @@
static inline pmd_t *
pmd_alloc_one(struct mm_struct *mm, unsigned long address)
{
- pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (ret)
- clear_page(ret);
+ pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return ret;
}
Index: linux-2.6.10/include/linux/highmem.h
===================================================================
--- linux-2.6.10.orig/include/linux/highmem.h 2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h 2005-01-04 12:16:49.000000000 -0800
@@ -45,7 +45,7 @@
static inline void clear_highpage(struct page *page)
{
void *kaddr = kmap_atomic(page, KM_USER0);
- clear_page(kaddr);
+ clear_page(kaddr, 0);
kunmap_atomic(kaddr, KM_USER0);
}
Index: linux-2.6.10/arch/sh64/mm/ioremap.c
===================================================================
--- linux-2.6.10.orig/arch/sh64/mm/ioremap.c 2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/sh64/mm/ioremap.c 2005-01-04 12:16:49.000000000 -0800
@@ -399,7 +399,7 @@
if (pte_none(*ptep) || !pte_present(*ptep))
return;
- clear_page((void *)ptep);
+ clear_page((void *)ptep, 0);
pte_clear(ptep);
}
Index: linux-2.6.10/include/asm-m68k/motorola_pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68k/motorola_pgalloc.h 2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/motorola_pgalloc.h 2005-01-04 12:16:49.000000000 -0800
@@ -12,9 +12,8 @@
{
pte_t *pte;
- pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
if (pte) {
- clear_page(pte);
__flush_page_to_ram(pte);
flush_tlb_kernel_page(pte);
nocache_page(pte);
@@ -31,7 +30,7 @@
static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
{
- struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
pte_t *pte;
if(!page)
@@ -39,7 +38,6 @@
pte = kmap(page);
if (pte) {
- clear_page(pte);
__flush_page_to_ram(pte);
flush_tlb_kernel_page(pte);
nocache_page(pte);
Index: linux-2.6.10/arch/sh/mm/pg-sh7705.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-sh7705.c 2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-sh7705.c 2005-01-04 12:16:49.000000000 -0800
@@ -78,13 +78,13 @@
__set_bit(PG_mapped, &page->flags);
if (((address ^ (unsigned long)to) & CACHE_ALIAS) == 0) {
- clear_page(to);
+ clear_page(to, 0);
__flush_wback_region(to, PAGE_SIZE);
} else {
__flush_purge_virtual_region(to,
(void *)(address & 0xfffff000),
PAGE_SIZE);
- clear_page(to);
+ clear_page(to, 0);
__flush_wback_region(to, PAGE_SIZE);
}
}
Index: linux-2.6.10/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/sparc64/mm/init.c 2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/sparc64/mm/init.c 2005-01-04 12:16:49.000000000 -0800
@@ -1687,13 +1687,12 @@
* Set up the zero page, mark it reserved, so that page count
* is not manipulated when freeing the page from user ptes.
*/
- mem_map_zero = alloc_pages(GFP_KERNEL, 0);
+ mem_map_zero = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
if (mem_map_zero == NULL) {
prom_printf("paging_init: Cannot alloc zero page.\n");
prom_halt();
}
SetPageReserved(mem_map_zero);
- clear_page(page_address(mem_map_zero));
codepages = (((unsigned long) _etext) - ((unsigned long) _start));
codepages = PAGE_ALIGN(codepages) >> PAGE_SHIFT;
Index: linux-2.6.10/include/asm-arm/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/pgalloc.h 2004-12-24 13:35:29.000000000 -0800
+++ linux-2.6.10/include/asm-arm/pgalloc.h 2005-01-04 12:16:49.000000000 -0800
@@ -50,9 +50,8 @@
{
pte_t *pte;
- pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
if (pte) {
- clear_page(pte);
clean_dcache_area(pte, sizeof(pte_t) * PTRS_PER_PTE);
pte += PTRS_PER_PTE;
}
@@ -65,10 +64,9 @@
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
if (pte) {
void *page = page_address(pte);
- clear_page(page);
clean_dcache_area(page, sizeof(pte_t) * PTRS_PER_PTE);
}
Index: linux-2.6.10/drivers/net/tc35815.c
===================================================================
--- linux-2.6.10.orig/drivers/net/tc35815.c 2004-12-24 13:33:48.000000000 -0800
+++ linux-2.6.10/drivers/net/tc35815.c 2005-01-04 12:16:49.000000000 -0800
@@ -657,7 +657,7 @@
dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
#endif
} else {
- clear_page(lp->fd_buf);
+ clear_page(lp->fd_buf, 0);
#ifdef __mips__
dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
#endif
Index: linux-2.6.10/drivers/block/pktcdvd.c
===================================================================
--- linux-2.6.10.orig/drivers/block/pktcdvd.c 2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10/drivers/block/pktcdvd.c 2005-01-04 12:16:49.000000000 -0800
@@ -135,12 +135,10 @@
goto no_bio;
for (i = 0; i < PAGES_PER_PACKET; i++) {
- pkt->pages[i] = alloc_page(GFP_KERNEL);
+ pkt->pages[i] = alloc_page(GFP_KERNEL|| __GFP_ZERO);
if (!pkt->pages[i])
goto no_page;
}
- for (i = 0; i < PAGES_PER_PACKET; i++)
- clear_page(page_address(pkt->pages[i]));
spin_lock_init(&pkt->lock);
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V3 [2/4]: Extension of clear_page to take an order parameter
2005-01-04 23:12 ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
2005-01-04 23:13 ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
@ 2005-01-04 23:14 ` Christoph Lameter
2005-01-05 23:25 ` Christoph Lameter
2005-01-04 23:15 ` Prezeroing V3 [3/4]: Page zeroing through kscrubd Christoph Lameter
2005-01-04 23:16 ` Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix Christoph Lameter
3 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:14 UTC (permalink / raw)
To: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
Linux Kernel Development
o Extend clear_page to take an order parameter for all architectures.
Architecture support:
---------------------
Known to work:
ia64
i386
sparc64
m68k
Trivial modification expected to simply work:
arm
cris
h8300
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um
Modification made but it would be good to have some feedback from the arch maintainers:
x86_64
s390
alpha
sh
mips
m32r
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.10/include/asm-ia64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/page.h 2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -56,7 +56,7 @@
# ifdef __KERNEL__
# define STRICT_MM_TYPECHECKS
-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
extern void copy_page (void *to, void *from);
/*
@@ -65,7 +65,7 @@
*/
#define clear_user_page(addr, vaddr, page) \
do { \
- clear_page(addr); \
+ clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
Index: linux-2.6.10/include/asm-i386/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/page.h 2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -18,7 +18,7 @@
#include <asm/mmx.h>
-#define clear_page(page) mmx_clear_page((void *)(page))
+#define clear_page(page, order) mmx_clear_page((void *)(page),order)
#define copy_page(to,from) mmx_copy_page(to,from)
#else
@@ -28,12 +28,12 @@
* Maybe the K6-III ?
*/
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
#endif
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/page.h 2005-01-04 12:16:41.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -32,10 +32,10 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-void clear_page(void *);
+void clear_page(void *, int);
void copy_page(void *, void *);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-sparc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -28,10 +28,10 @@
#ifndef __ASSEMBLY__
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
sparc_flush_page_to_ram(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.10/include/asm-s390/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/page.h 2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -22,12 +22,12 @@
#ifndef __s390x__
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
register_pair rp;
rp.subreg.even = (unsigned long) page;
- rp.subreg.odd = (unsigned long) 4096;
+ rp.subreg.odd = (unsigned long) 4096 << order;
asm volatile (" slr 1,1\n"
" mvcl %0,0"
: "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@
#else /* __s390x__ */
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
- asm volatile (" lgr 2,%0\n"
+ int nr = 1 << order;
+
+ while (nr-- >0) {
+ asm volatile (" lgr 2,%0\n"
" lghi 3,4096\n"
" slgr 1,1\n"
" mvcl 2,0"
: : "a" ((void *) (page))
: "memory", "cc", "1", "2", "3" );
+ page += PAGE_SIZE;
+ }
}
static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@
#endif /* __s390x__ */
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/* Pure 2^n version of get_order */
Index: linux-2.6.10/arch/i386/lib/mmx.c
===================================================================
--- linux-2.6.10.orig/arch/i386/lib/mmx.c 2004-12-24 13:34:48.000000000 -0800
+++ linux-2.6.10/arch/i386/lib/mmx.c 2005-01-04 12:34:03.000000000 -0800
@@ -128,7 +128,7 @@
* other MMX using processors do not.
*/
-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
{
int i;
@@ -138,7 +138,7 @@
" pxor %%mm0, %%mm0\n" : :
);
- for(i=0;i<4096/64;i++)
+ for(i=0;i<((4096/64) << order);i++)
{
__asm__ __volatile__ (
" movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@
* Generic MMX implementation without K7 specific streaming
*/
-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
{
int i;
@@ -267,7 +267,7 @@
" pxor %%mm0, %%mm0\n" : :
);
- for(i=0;i<4096/128;i++)
+ for(i=0;i<((4096/128) << order);i++)
{
__asm__ __volatile__ (
" movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@
* Favour MMX for page clear and copy.
*/
-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
{
int d0, d1;
__asm__ __volatile__( \
"cld\n\t" \
"rep ; stosl" \
: "=&c" (d0), "=&D" (d1)
- :"a" (0),"1" (page),"0" (1024)
+ :"a" (0),"1" (page),"0" (1024 << order)
:"memory");
}
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
{
if(unlikely(in_interrupt()))
- slow_zero_page(page);
+ slow_clear_page(page, order);
else
- fast_clear_page(page);
+ fast_clear_page(page, order);
}
static void slow_copy_page(void *to, void *from)
Index: linux-2.6.10/include/asm-x86_64/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/mmx.h 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/mmx.h 2005-01-04 12:34:03.000000000 -0800
@@ -8,7 +8,7 @@
#include <linux/types.h>
extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
extern void mmx_copy_page(void *to, void *from);
#endif
Index: linux-2.6.10/arch/ia64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/ia64/lib/clear_page.S 2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10/arch/ia64/lib/clear_page.S 2005-01-04 12:34:03.000000000 -0800
@@ -7,6 +7,7 @@
* 1/06/01 davidm Tuned for Itanium.
* 2/12/02 kchen Tuned for both Itanium and McKinley
* 3/08/02 davidm Some more tweaking
+ * 12/10/04 clameter Make it work on pages of order size
*/
#include <linux/config.h>
@@ -29,27 +30,33 @@
#define dst4 r11
#define dst_last r31
+#define totsize r14
GLOBAL_ENTRY(clear_page)
.prologue
- .regstk 1,0,0,0
- mov r16 = PAGE_SIZE/L3_LINE_SIZE-1 // main loop count, -1=repeat/until
+ .regstk 2,0,0,0
+ mov r16 = PAGE_SIZE/L3_LINE_SIZE // main loop count
+ mov totsize = PAGE_SIZE
.save ar.lc, saved_lc
mov saved_lc = ar.lc
-
+ ;;
.body
+ adds dst1 = 16, in0
mov ar.lc = (PREFETCH_LINES - 1)
mov dst_fetch = in0
- adds dst1 = 16, in0
adds dst2 = 32, in0
+ shl r16 = r16, in1
+ shl totsize = totsize, in1
;;
.fetch: stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
adds dst3 = 48, in0 // executing this multiple times is harmless
br.cloop.sptk.few .fetch
+ add r16 = -1,r16
+ add dst_last = totsize, dst_fetch
+ adds dst4 = 64, in0
;;
- addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
mov ar.lc = r16 // one L3 line per iteration
- adds dst4 = 64, in0
+ adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
;;
#ifdef CONFIG_ITANIUM
// Optimized for Itanium
Index: linux-2.6.10/arch/x86_64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/x86_64/lib/clear_page.S 2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/lib/clear_page.S 2005-01-04 12:34:03.000000000 -0800
@@ -7,6 +7,7 @@
clear_page:
xorl %eax,%eax
movl $4096/64,%ecx
+ shl %esi, %ecx
.p2align 4
.Lloop:
decl %ecx
@@ -42,6 +43,7 @@
.section .altinstr_replacement,"ax"
clear_page_c:
movl $4096/8,%ecx
+ shl %esi, %ecx
xorl %eax,%eax
rep
stosq
Index: linux-2.6.10/include/asm-sh/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh/page.h 2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-sh/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -36,12 +36,22 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
extern void (*copy_page)(void *to, void *from);
extern void clear_page_slow(void *to);
extern void copy_page_slow(void *to, void *from);
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- >0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
#if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
struct page;
extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@
extern void __clear_user_page(void *to, void *orig_to);
extern void __copy_user_page(void *to, void *from, void *orig_to);
#elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#elif defined(CONFIG_CPU_SH4)
struct page;
Index: linux-2.6.10/include/asm-i386/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/mmx.h 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/mmx.h 2005-01-04 12:34:03.000000000 -0800
@@ -8,7 +8,7 @@
#include <linux/types.h>
extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
extern void mmx_copy_page(void *to, void *from);
#endif
Index: linux-2.6.10/arch/alpha/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/clear_page.S 2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/clear_page.S 2005-01-04 12:34:03.000000000 -0800
@@ -6,11 +6,10 @@
.text
.align 4
- .global clear_page
- .ent clear_page
-clear_page:
+ .global _clear_page
+ .ent _clear_page
+_clear_page:
.prologue 0
-
lda $0,128
nop
unop
@@ -36,4 +35,4 @@
unop
nop
- .end clear_page
+ .end _clear_page
Index: linux-2.6.10/include/asm-sh64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh64/page.h 2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -50,12 +50,20 @@
extern void sh64_page_clear(void *page);
extern void sh64_page_copy(void *from, void *to);
-#define clear_page(page) sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+ int nr = 1 << order;
+
+ while (nr-- >0) {
+ sh64_page_clear(page++, 0);
+ }
+}
+
#define copy_page(to,from) sh64_page_copy(from, to)
#if defined(CONFIG_DCACHE_DISABLED)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) sh_clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#else
Index: linux-2.6.10/include/asm-h8300/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-h8300/page.h 2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -24,10 +24,10 @@
#define get_user_page(vaddr) __get_free_page(GFP_KERNEL)
#define free_user_page(page, addr) free_page(addr)
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-arm/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/page.h 2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-arm/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -128,7 +128,7 @@
preempt_enable(); \
} while (0)
-#define clear_page(page) memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order) memzero((void *)(page), PAGE_SIZE << (order))
extern void copy_page(void *to, const void *from);
#undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-ppc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc64/page.h 2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10/include/asm-ppc64/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -102,12 +102,12 @@
#define REGION_MASK (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
#define REGION_STRIDE (1UL << REGION_SHIFT)
-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, int order)
{
unsigned long lines, line_size;
line_size = systemcfg->dCacheL1LineSize;
- lines = naca->dCacheL1LinesPerPage;
+ lines = naca->dCacheL1LinesPerPage << order;
__asm__ __volatile__(
"mtctr %1 # clear_page\n\
Index: linux-2.6.10/include/asm-m32r/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -11,10 +11,22 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- > 0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
+
extern void copy_page(void *to, void *from);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-alpha/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/page.h 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -15,8 +15,20 @@
#define STRICT_MM_TYPECHECKS
-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+ int nr = 1 << order;
+
+ while (nr--)
+ {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
extern void copy_page(void * _to, void * _from);
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
Index: linux-2.6.10/arch/mips/mm/pg-sb1.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-sb1.c 2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-sb1.c 2005-01-04 12:34:03.000000000 -0800
@@ -42,7 +42,7 @@
#ifdef CONFIG_SIBYTE_DMA_PAGEOPS
static inline void clear_page_cpu(void *page)
#else
-void clear_page(void *page)
+void _clear_page(void *page)
#endif
{
unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@
IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
}
-void clear_page(void *page)
+void _clear_page(void *page)
{
int cpu = smp_processor_id();
/* if the page is above Kseg0, use old way */
if (KSEGX(page) != CAC_BASE)
return clear_page_cpu(page);
-
page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@
#endif
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(copy_page);
Index: linux-2.6.10/include/asm-m68k/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68k/page.h 2004-12-24 13:35:49.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -50,7 +50,7 @@
);
}
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
unsigned long tmp;
unsigned long *sp = page;
@@ -69,16 +69,16 @@
"dbra %1,1b\n\t"
: "=a" (sp), "=d" (tmp)
: "a" (page), "0" (sp),
- "1" ((PAGE_SIZE - 16) / 16 - 1));
+ "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
}
#else
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
#endif
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.10/include/asm-mips/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-mips/page.h 2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/include/asm-mips/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -39,7 +39,18 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- >0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
extern void copy_page(void * to, void * from);
extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@
{
extern void (*flush_data_cache_page)(unsigned long addr);
- clear_page(addr);
+ clear_page(addr, 0);
if (pages_do_alias((unsigned long) addr, vaddr))
flush_data_cache_page((unsigned long)addr);
}
Index: linux-2.6.10/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68knommu/page.h 2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -24,10 +24,10 @@
#define get_user_page(vaddr) __get_free_page(GFP_KERNEL)
#define free_user_page(page, addr) free_page(addr)
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-cris/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/page.h 2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -15,10 +15,10 @@
#ifdef __KERNEL__
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-v850/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-v850/page.h 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-v850/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -37,11 +37,11 @@
#define STRICT_MM_TYPECHECKS
-#define clear_page(page) memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset ((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to, from) memcpy ((void *)(to), (void *)from, PAGE_SIZE)
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.10/include/asm-parisc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-parisc/page.h 2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -13,7 +13,7 @@
#include <asm/types.h>
#include <asm/cache.h>
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) copy_user_page_asm((void *)(to), (void *)(from))
struct page;
Index: linux-2.6.10/arch/arm/mm/copypage-v6.c
===================================================================
--- linux-2.6.10.orig/arch/arm/mm/copypage-v6.c 2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/arm/mm/copypage-v6.c 2005-01-04 12:34:03.000000000 -0800
@@ -47,7 +47,7 @@
*/
void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
{
- clear_page(kaddr);
+ _clear_page(kaddr);
}
/*
@@ -116,7 +116,7 @@
set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
flush_tlb_kernel_page(to);
- clear_page((void *)to);
+ _clear_page((void *)to);
spin_unlock(&v6_lock);
}
Index: linux-2.6.10/arch/m32r/mm/page.S
===================================================================
--- linux-2.6.10.orig/arch/m32r/mm/page.S 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/arch/m32r/mm/page.S 2005-01-04 12:34:03.000000000 -0800
@@ -51,7 +51,7 @@
jmp r14
.text
- .global clear_page
+ .global _clear_page
/*
* clear_page (to)
*
@@ -60,7 +60,7 @@
* 16 * 256
*/
.align 4
-clear_page:
+_clear_page:
ldi r2, #255
ldi r4, #0
ld r3, @r0 /* cache line allocate */
Index: linux-2.6.10/include/asm-ppc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-ppc/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -85,7 +85,7 @@
struct page;
extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define clear_page clear_pages
extern void copy_page(void *to, void *from);
extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/alpha/kernel/alpha_ksyms.c 2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c 2005-01-04 12:34:03.000000000 -0800
@@ -88,7 +88,7 @@
EXPORT_SYMBOL(__memsetw);
EXPORT_SYMBOL(__constant_c_memset);
EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(__direct_map_base);
EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.10/arch/alpha/lib/ev6-clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/ev6-clear_page.S 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/ev6-clear_page.S 2005-01-04 12:34:03.000000000 -0800
@@ -6,9 +6,9 @@
.text
.align 4
- .global clear_page
- .ent clear_page
-clear_page:
+ .global _clear_page
+ .ent _clear_page
+_clear_page:
.prologue 0
lda $0,128
@@ -51,4 +51,4 @@
nop
nop
- .end clear_page
+ .end _clear_page
Index: linux-2.6.10/arch/sh/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/init.c 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/init.c 2005-01-04 12:34:03.000000000 -0800
@@ -57,7 +57,7 @@
#endif
void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);
void show_mem(void)
{
@@ -255,7 +255,7 @@
* later in the boot process if a better method is available.
*/
copy_page = copy_page_slow;
- clear_page = clear_page_slow;
+ _clear_page = clear_page_slow;
/* this will put all low memory onto the freelists */
totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.10/arch/sh/mm/pg-dma.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-dma.c 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-dma.c 2005-01-04 12:34:03.000000000 -0800
@@ -78,7 +78,7 @@
return ret;
copy_page = copy_page_dma;
- clear_page = clear_page_dma;
+ _clear_page = clear_page_dma;
return ret;
}
Index: linux-2.6.10/arch/sh/mm/pg-nommu.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-nommu.c 2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-nommu.c 2005-01-04 12:34:03.000000000 -0800
@@ -27,7 +27,7 @@
static int __init pg_nommu_init(void)
{
copy_page = copy_page_nommu;
- clear_page = clear_page_nommu;
+ _clear_page = clear_page_nommu;
return 0;
}
Index: linux-2.6.10/arch/mips/mm/pg-r4k.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-r4k.c 2004-12-24 13:34:49.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-r4k.c 2005-01-04 12:34:03.000000000 -0800
@@ -39,9 +39,9 @@
static unsigned int clear_page_array[0x130 / 4];
-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
/*
* Maximum sizes:
Index: linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/m32r/kernel/m32r_ksyms.c 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c 2005-01-04 12:34:03.000000000 -0800
@@ -102,7 +102,7 @@
EXPORT_SYMBOL(memcmp);
EXPORT_SYMBOL(memscan);
EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(strcat);
EXPORT_SYMBOL(strchr);
Index: linux-2.6.10/include/asm-arm26/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm26/page.h 2004-12-24 13:35:22.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -25,7 +25,7 @@
preempt_enable(); \
} while (0)
-#define clear_page(page) memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order) memzero((void *)(page), PAGE_SIZE << (order))
#define copy_page(to, from) __copy_user_page(to, from, 0);
#undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-sparc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/page.h 2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/page.h 2005-01-04 12:34:03.000000000 -0800
@@ -14,8 +14,8 @@
#ifndef __ASSEMBLY__
-extern void _clear_page(void *page);
-#define clear_page(X) _clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y) _clear_page((void *)(X),(Y))
struct page;
extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
#define copy_page(X,Y) memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
Index: linux-2.6.10/arch/sparc64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/sparc64/lib/clear_page.S 2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/arch/sparc64/lib/clear_page.S 2005-01-04 12:34:03.000000000 -0800
@@ -28,9 +28,12 @@
.text
.globl _clear_page
-_clear_page: /* %o0=dest */
+_clear_page: /* %o0=dest, %o1=order */
+ sethi %hi(PAGE_SIZE/64), %o2
+ clr %o4
+ or %o2, %lo(PAGE_SIZE/64), %o2
ba,pt %xcc, clear_page_common
- clr %o4
+ sllx %o2, %o1, %o1
/* This thing is pretty important, it shows up
* on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@
flush %g6
wrpr %o4, 0x0, %pstate
+ sethi %hi(PAGE_SIZE/64), %o1
mov 1, %o4
+ or %o1, %lo(PAGE_SIZE/64), %o1
clear_page_common:
VISEntryHalf
membar #StoreLoad | #StoreStore | #LoadStore
fzero %f0
- sethi %hi(PAGE_SIZE/64), %o1
mov %o0, %g1 ! remember vaddr for tlbflush
fzero %f2
- or %o1, %lo(PAGE_SIZE/64), %o1
faddd %f0, %f2, %f4
fmuld %f0, %f2, %f6
faddd %f0, %f2, %f8
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V3 [3/4]: Page zeroing through kscrubd
2005-01-04 23:12 ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
2005-01-04 23:13 ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
2005-01-04 23:14 ` Prezeroing V3 [2/4]: Extension of " Christoph Lameter
@ 2005-01-04 23:15 ` Christoph Lameter
2005-01-04 23:16 ` Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix Christoph Lameter
3 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:15 UTC (permalink / raw)
To: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
Linux Kernel Development
o Add page zeroing
o Add scrub daemon
o Add ability to view amount of zeroed information in /proc/meninfo
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c 2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c 2005-01-04 14:17:02.000000000 -0800
@@ -12,6 +12,7 @@
* Zone balancing, Kanoj Sarcar, SGI, Jan 2000
* Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
* (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ * Support for page zeroing, Christoph Lameter, SGI, Dec 2004
*/
#include <linux/config.h>
@@ -33,6 +34,7 @@
#include <linux/cpu.h>
#include <linux/nodemask.h>
#include <linux/vmalloc.h>
+#include <linux/scrub.h>
#include <asm/tlbflush.h>
@@ -180,7 +182,7 @@
* -- wli
*/
-static inline void __free_pages_bulk (struct page *page, struct page *base,
+static inline int __free_pages_bulk (struct page *page, struct page *base,
struct zone *zone, struct free_area *area, unsigned int order)
{
unsigned long page_idx, index, mask;
@@ -193,11 +195,10 @@
BUG();
index = page_idx >> (1 + order);
- zone->free_pages += 1 << order;
while (order < MAX_ORDER-1) {
struct page *buddy1, *buddy2;
- BUG_ON(area >= zone->free_area + MAX_ORDER);
+ BUG_ON(area >= zone->free_area[ZEROED] + MAX_ORDER);
if (!__test_and_change_bit(index, area->map))
/*
* the buddy page is still allocated.
@@ -219,6 +220,7 @@
}
list_add(&(base + page_idx)->lru, &area->free_list);
area->nr_free++;
+ return order;
}
static inline void free_pages_check(const char *function, struct page *page)
@@ -261,7 +263,7 @@
int ret = 0;
base = zone->zone_mem_map;
- area = zone->free_area + order;
+ area = zone->free_area[NOT_ZEROED] + order;
spin_lock_irqsave(&zone->lock, flags);
zone->all_unreclaimable = 0;
zone->pages_scanned = 0;
@@ -269,7 +271,10 @@
page = list_entry(list->prev, struct page, lru);
/* have to delete it as __free_pages_bulk list manipulates */
list_del(&page->lru);
- __free_pages_bulk(page, base, zone, area, order);
+ zone->free_pages += 1 << order;
+ if (__free_pages_bulk(page, base, zone, area, order)
+ >= sysctl_scrub_start)
+ wakeup_kscrubd(zone);
ret++;
}
spin_unlock_irqrestore(&zone->lock, flags);
@@ -291,6 +296,21 @@
free_pages_bulk(page_zone(page), 1, &list, order);
}
+void end_zero_page(struct page *page)
+{
+ unsigned long flags;
+ int order = page->index;
+ struct zone * zone = page_zone(page);
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ zone->zero_pages += 1 << order;
+ __free_pages_bulk(page, zone->zone_mem_map, zone, zone->free_area[ZEROED] + order, order);
+
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+
#define MARK_USED(index, order, area) \
__change_bit((index) >> (1+(order)), (area)->map)
@@ -370,26 +390,47 @@
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct zone *zone, struct free_area *area, int order)
+{
+ list_del(&page->lru);
+ area->nr_free--;
+ if (order != MAX_ORDER-1)
+ MARK_USED(page - zone->zone_mem_map, order, area);
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order)
+{
+ unsigned long flags;
+ struct page *page = NULL;
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ if (!list_empty(&area->free_list)) {
+ page = list_entry(area->free_list.next, struct page, lru);
+
+ rmpage(page, zone, area, order);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
{
struct free_area * area;
unsigned int current_order;
struct page *page;
- unsigned int index;
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
- area = zone->free_area + current_order;
+ area = zone->free_area[zero] + current_order;
if (list_empty(&area->free_list))
continue;
page = list_entry(area->free_list.next, struct page, lru);
- list_del(&page->lru);
- area->nr_free--;
- index = page - zone->zone_mem_map;
- if (current_order != MAX_ORDER-1)
- MARK_USED(index, current_order, area);
+ rmpage(page, zone, area, current_order);
zone->free_pages -= 1UL << order;
- return expand(zone, page, index, order, current_order, area);
+ if (zero)
+ zone->zero_pages -= 1UL << order;
+ return expand(zone, page, page - zone->zone_mem_map, order, current_order, area);
}
return NULL;
@@ -401,7 +442,7 @@
* Returns the number of new pages which were placed at *list.
*/
static int rmqueue_bulk(struct zone *zone, unsigned int order,
- unsigned long count, struct list_head *list)
+ unsigned long count, struct list_head *list, int zero)
{
unsigned long flags;
int i;
@@ -410,7 +451,7 @@
spin_lock_irqsave(&zone->lock, flags);
for (i = 0; i < count; ++i) {
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, zero);
if (page == NULL)
break;
allocated++;
@@ -457,7 +498,7 @@
ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));
for (order = MAX_ORDER - 1; order >= 0; --order)
- list_for_each(curr, &zone->free_area[order].free_list) {
+ list_for_each(curr, &zone->free_area[NOT_ZEROED][order].free_list) {
unsigned long start_pfn, i;
start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -555,7 +596,9 @@
{
unsigned long flags;
struct page *page = NULL;
- int cold = !!(gfp_flags & __GFP_COLD);
+ int nr_pages = 1 << order;
+ int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+ int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;
if (order == 0) {
struct per_cpu_pages *pcp;
@@ -564,7 +607,7 @@
local_irq_save(flags);
if (pcp->count <= pcp->low)
pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list);
+ pcp->batch, &pcp->list, zero);
if (pcp->count) {
page = list_entry(pcp->list.next, struct page, lru);
list_del(&page->lru);
@@ -576,19 +619,30 @@
if (page == NULL) {
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order);
+
+ page = __rmqueue(zone, order, zero);
+
+ /*
+ * If we failed to obtain a zero and/or unzeroed page
+ * then we may still be able to obtain the other
+ * type of page.
+ */
+ if (!page) {
+ page = __rmqueue(zone, order, !zero);
+ zero = 0;
+ }
+
spin_unlock_irqrestore(&zone->lock, flags);
}
if (page != NULL) {
BUG_ON(bad_range(zone, page));
- mod_page_state_zone(zone, pgalloc, 1 << order);
- prep_new_page(page, order);
+ mod_page_state_zone(zone, pgalloc, nr_pages);
- if (gfp_flags & __GFP_ZERO) {
+ if ((gfp_flags & __GFP_ZERO) && !zero) {
#ifdef CONFIG_HIGHMEM
if (PageHighMem(page)) {
- int n = 1 << order;
+ int n = nr_pages;
while (n-- >0)
clear_highpage(page + n);
@@ -596,6 +650,7 @@
#endif
clear_page(page_address(page), order);
}
+ prep_new_page(page, order);
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
}
@@ -622,7 +677,7 @@
return 0;
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
- free_pages -= z->free_area[o].nr_free << o;
+ free_pages -= (z->free_area[NOT_ZEROED][o].nr_free + z->free_area[ZEROED][o].nr_free) << o;
/* Require fewer higher order pages to be free */
min >>= 1;
@@ -1000,7 +1055,7 @@
}
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free, struct pglist_data *pgdat)
+ unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
{
struct zone *zones = pgdat->node_zones;
int i;
@@ -1008,27 +1063,31 @@
*active = 0;
*inactive = 0;
*free = 0;
+ *zero = 0;
for (i = 0; i < MAX_NR_ZONES; i++) {
*active += zones[i].nr_active;
*inactive += zones[i].nr_inactive;
*free += zones[i].free_pages;
+ *zero += zones[i].zero_pages;
}
}
void get_zone_counts(unsigned long *active,
- unsigned long *inactive, unsigned long *free)
+ unsigned long *inactive, unsigned long *free, unsigned long *zero)
{
struct pglist_data *pgdat;
*active = 0;
*inactive = 0;
*free = 0;
+ *zero = 0;
for_each_pgdat(pgdat) {
- unsigned long l, m, n;
- __get_zone_counts(&l, &m, &n, pgdat);
+ unsigned long l, m, n,o;
+ __get_zone_counts(&l, &m, &n, &o, pgdat);
*active += l;
*inactive += m;
*free += n;
+ *zero += o;
}
}
@@ -1065,6 +1124,7 @@
#define K(x) ((x) << (PAGE_SHIFT-10))
+const char *temp[3] = { "hot", "cold", "zero" };
/*
* Show free area list (used inside shift_scroll-lock stuff)
* We also calculate the percentage fragmentation. We do this by counting the
@@ -1077,6 +1137,7 @@
unsigned long active;
unsigned long inactive;
unsigned long free;
+ unsigned long zero;
struct zone *zone;
for_each_zone(zone) {
@@ -1097,10 +1158,10 @@
pageset = zone->pageset + cpu;
- for (temperature = 0; temperature < 2; temperature++)
+ for (temperature = 0; temperature < 3; temperature++)
printk("cpu %d %s: low %d, high %d, batch %d\n",
cpu,
- temperature ? "cold" : "hot",
+ temp[temperature],
pageset->pcp[temperature].low,
pageset->pcp[temperature].high,
pageset->pcp[temperature].batch);
@@ -1108,20 +1169,21 @@
}
get_page_state(&ps);
- get_zone_counts(&active, &inactive, &free);
+ get_zone_counts(&active, &inactive, &free, &zero);
printk("\nFree pages: %11ukB (%ukB HighMem)\n",
K(nr_free_pages()),
K(nr_free_highpages()));
printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
- "unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+ "unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
active,
inactive,
ps.nr_dirty,
ps.nr_writeback,
ps.nr_unstable,
nr_free_pages(),
+ zero,
ps.nr_slab,
ps.nr_mapped,
ps.nr_page_table_pages);
@@ -1170,7 +1232,7 @@
spin_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
- nr = zone->free_area[order].nr_free;
+ nr = zone->free_area[NOT_ZEROED][order].nr_free + zone->free_area[ZEROED][order].nr_free;
total += nr << order;
printk("%lu*%lukB ", nr, K(1UL) << order);
}
@@ -1493,16 +1555,21 @@
for (order = 0; ; order++) {
unsigned long bitmap_size;
- INIT_LIST_HEAD(&zone->free_area[order].free_list);
+ INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+ INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
if (order == MAX_ORDER-1) {
- zone->free_area[order].map = NULL;
+ zone->free_area[NOT_ZEROED][order].map = NULL;
+ zone->free_area[ZEROED][order].map = NULL;
break;
}
bitmap_size = pages_to_bitmap_size(order, size);
- zone->free_area[order].map =
+ zone->free_area[NOT_ZEROED][order].map =
+ (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
+ zone->free_area[ZEROED][order].map =
(unsigned long *) alloc_bootmem_node(pgdat, bitmap_size);
- zone->free_area[order].nr_free = 0;
+ zone->free_area[NOT_ZEROED][order].nr_free = 0;
+ zone->free_area[ZEROED][order].nr_free = 0;
}
}
@@ -1527,6 +1594,7 @@
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
+ init_waitqueue_head(&pgdat->kscrubd_wait);
pgdat->kswapd_max_order = 0;
for (j = 0; j < MAX_NR_ZONES; j++) {
@@ -1550,6 +1618,7 @@
spin_lock_init(&zone->lru_lock);
zone->zone_pgdat = pgdat;
zone->free_pages = 0;
+ zone->zero_pages = 0;
zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
@@ -1583,6 +1652,13 @@
pcp->high = 2 * batch;
pcp->batch = 1 * batch;
INIT_LIST_HEAD(&pcp->list);
+
+ pcp = &zone->pageset[cpu].pcp[2]; /* zero pages */
+ pcp->count = 0;
+ pcp->low = 0;
+ pcp->high = 2 * batch;
+ pcp->batch = 1 * batch;
+ INIT_LIST_HEAD(&pcp->list);
}
printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
zone_names[j], realsize, batch);
@@ -1708,7 +1784,7 @@
spin_lock_irqsave(&zone->lock, flags);
seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
for (order = 0; order < MAX_ORDER; ++order)
- seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+ seq_printf(m, "%6lu ", zone->free_area[NOT_ZEROED][order].nr_free);
spin_unlock_irqrestore(&zone->lock, flags);
seq_putc(m, '\n');
}
Index: linux-2.6.10/include/linux/mmzone.h
===================================================================
--- linux-2.6.10.orig/include/linux/mmzone.h 2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/linux/mmzone.h 2005-01-04 14:17:02.000000000 -0800
@@ -52,7 +52,7 @@
};
struct per_cpu_pageset {
- struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */
+ struct per_cpu_pages pcp[3]; /* 0: hot. 1: cold 2: cold zeroed pages */
#ifdef CONFIG_NUMA
unsigned long numa_hit; /* allocated in intended node */
unsigned long numa_miss; /* allocated in non intended node */
@@ -108,10 +108,14 @@
* ZONE_HIGHMEM > 896 MB only page cache and user processes
*/
+#define NOT_ZEROED 0
+#define ZEROED 1
+
struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long free_pages;
unsigned long pages_min, pages_low, pages_high;
+ unsigned long zero_pages;
/*
* protection[] is a pre-calculated number of extra pages that must be
* available in a zone in order for __alloc_pages() to allocate memory
@@ -132,7 +136,7 @@
* free areas of different sizes
*/
spinlock_t lock;
- struct free_area free_area[MAX_ORDER];
+ struct free_area free_area[2][MAX_ORDER];
ZONE_PADDING(_pad1_)
@@ -267,6 +271,9 @@
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
int kswapd_max_order;
+
+ wait_queue_head_t kscrubd_wait;
+ struct task_struct *kscrubd;
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
@@ -276,9 +283,9 @@
extern struct pglist_data *pgdat_list;
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free, struct pglist_data *pgdat);
+ unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
void get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free);
+ unsigned long *free, unsigned long *zero);
void build_all_zonelists(void);
void wakeup_kswapd(struct zone *zone, int order);
int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Index: linux-2.6.10/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.10.orig/fs/proc/proc_misc.c 2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/fs/proc/proc_misc.c 2005-01-04 14:17:02.000000000 -0800
@@ -158,13 +158,14 @@
unsigned long inactive;
unsigned long active;
unsigned long free;
+ unsigned long zero;
unsigned long vmtot;
unsigned long committed;
unsigned long allowed;
struct vmalloc_info vmi;
get_page_state(&ps);
- get_zone_counts(&active, &inactive, &free);
+ get_zone_counts(&active, &inactive, &free, &zero);
/*
* display in kilobytes.
@@ -187,6 +188,7 @@
len = sprintf(page,
"MemTotal: %8lu kB\n"
"MemFree: %8lu kB\n"
+ "MemZero: %8lu kB\n"
"Buffers: %8lu kB\n"
"Cached: %8lu kB\n"
"SwapCached: %8lu kB\n"
@@ -210,6 +212,7 @@
"VmallocChunk: %8lu kB\n",
K(i.totalram),
K(i.freeram),
+ K(zero),
K(i.bufferram),
K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
K(total_swapcache_pages),
Index: linux-2.6.10/mm/readahead.c
===================================================================
--- linux-2.6.10.orig/mm/readahead.c 2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/mm/readahead.c 2005-01-04 14:17:02.000000000 -0800
@@ -573,7 +573,8 @@
unsigned long active;
unsigned long inactive;
unsigned long free;
+ unsigned long zero;
- __get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+ __get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
return min(nr, (inactive + free) / 2);
}
Index: linux-2.6.10/drivers/base/node.c
===================================================================
--- linux-2.6.10.orig/drivers/base/node.c 2005-01-04 14:17:00.000000000 -0800
+++ linux-2.6.10/drivers/base/node.c 2005-01-04 14:17:02.000000000 -0800
@@ -41,13 +41,15 @@
unsigned long inactive;
unsigned long active;
unsigned long free;
+ unsigned long zero;
si_meminfo_node(&i, nid);
- __get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+ __get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));
n = sprintf(buf, "\n"
"Node %d MemTotal: %8lu kB\n"
"Node %d MemFree: %8lu kB\n"
+ "Node %d MemZero: %8lu kB\n"
"Node %d MemUsed: %8lu kB\n"
"Node %d Active: %8lu kB\n"
"Node %d Inactive: %8lu kB\n"
@@ -57,6 +59,7 @@
"Node %d LowFree: %8lu kB\n",
nid, K(i.totalram),
nid, K(i.freeram),
+ nid, K(zero),
nid, K(i.totalram - i.freeram),
nid, K(active),
nid, K(inactive),
Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h 2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h 2005-01-04 14:17:02.000000000 -0800
@@ -715,6 +715,7 @@
#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
#define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */
+#define PF_KSCRUBD 0x00800000 /* I am kscrubd */
#ifdef CONFIG_SMP
extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.10/mm/Makefile
===================================================================
--- linux-2.6.10.orig/mm/Makefile 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/mm/Makefile 2005-01-04 14:17:02.000000000 -0800
@@ -5,7 +5,7 @@
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
- vmalloc.o
+ vmalloc.o scrubd.o
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
page_alloc.o page-writeback.o pdflush.o prio_tree.o \
Index: linux-2.6.10/mm/scrubd.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/mm/scrubd.c 2005-01-04 14:58:46.000000000 -0800
@@ -0,0 +1,147 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = 7; /* if a page of this order is coalesed then run kscrubd */
+unsigned int sysctl_scrub_stop = 2; /* Mininum order of page to zero */
+unsigned int sysctl_scrub_load = 999; /* Do not run scrubd if load > */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ proc_dointvec(table, write, file, buffer, length, ppos);
+ if (sysctl_scrub_start < MAX_ORDER) {
+ struct zone *zone;
+
+ for_each_zone(zone)
+ wakeup_kscrubd(zone);
+ }
+ return 0;
+}
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+ int order;
+
+ for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+ struct free_area *area = z->free_area[NOT_ZEROED] + order;
+ if (!list_empty(&area->free_list)) {
+ struct page *page = scrubd_rmpage(z, area, order);
+ struct list_head *l;
+
+ if (!page)
+ continue;
+
+ page->index = order;
+
+ list_for_each(l, &zero_drivers) {
+ struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+ unsigned long size = PAGE_SIZE << order;
+
+ if (driver->start(page_address(page), size) == 0) {
+
+ unsigned ticks = (size*HZ)/driver->rate;
+ if (ticks) {
+ /* Wait the minimum time of the transfer */
+ current->state = TASK_INTERRUPTIBLE;
+ schedule_timeout(ticks);
+ }
+ /* Then keep on checking until transfer is complete */
+ while (!driver->check())
+ schedule();
+ goto out;
+ }
+ }
+
+ /* Unable to find a zeroing device that would
+ * deal with this page so just do it on our own.
+ * This will likely thrash the cpu caches.
+ */
+ cond_resched();
+ clear_page(page_address(page), order);
+out:
+ end_zero_page(page);
+ cond_resched();
+ return 1 << order;
+ }
+ }
+ return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+ int i;
+ unsigned long pages_zeroed;
+
+ if (system_state != SYSTEM_RUNNING)
+ return;
+
+ do {
+ pages_zeroed = 0;
+ for (i = 0; i < pgdat->nr_zones; i++) {
+ struct zone *zone = pgdat->node_zones + i;
+
+ pages_zeroed += zero_highest_order_page(zone);
+ }
+ } while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+ pg_data_t *pgdat = (pg_data_t*)p;
+ struct task_struct *tsk = current;
+ DEFINE_WAIT(wait);
+ cpumask_t cpumask;
+
+ daemonize("kscrubd%d", pgdat->node_id);
+ cpumask = node_to_cpumask(pgdat->node_id);
+ if (!cpus_empty(cpumask))
+ set_cpus_allowed(tsk, cpumask);
+
+ tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+ for ( ; ; ) {
+ if (current->flags & PF_FREEZE)
+ refrigerator(PF_FREEZE);
+ prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+ schedule();
+ finish_wait(&pgdat->kscrubd_wait, &wait);
+
+ scrub_pgdat(pgdat);
+ }
+ return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+ pg_data_t *pgdat;
+ for_each_pgdat(pgdat)
+ pgdat->kscrubd
+ = find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+ return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.10/include/linux/scrub.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/include/linux/scrub.h 2005-01-04 14:17:02.000000000 -0800
@@ -0,0 +1,51 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+ int (*start)(void *, unsigned long); /* Start bzero transfer */
+ int (*check)(void); /* Check if bzero is complete */
+ unsigned long rate; /* zeroing rate in bytes/sec */
+ struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+extern unsigned int sysctl_scrub_load;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+ list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+ list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+ if (avenrun[0] >= (unsigned long)sysctl_scrub_load << FSHIFT)
+ return;
+ if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+ return;
+ wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page);
+#endif
Index: linux-2.6.10/kernel/sysctl.c
===================================================================
--- linux-2.6.10.orig/kernel/sysctl.c 2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/kernel/sysctl.c 2005-01-04 14:17:02.000000000 -0800
@@ -40,6 +40,7 @@
#include <linux/times.h>
#include <linux/limits.h>
#include <linux/dcache.h>
+#include <linux/scrub.h>
#include <linux/syscalls.h>
#include <asm/uaccess.h>
@@ -826,6 +827,33 @@
.strategy = &sysctl_jiffies,
},
#endif
+ {
+ .ctl_name = VM_SCRUB_START,
+ .procname = "scrub_start",
+ .data = &sysctl_scrub_start,
+ .maxlen = sizeof(sysctl_scrub_start),
+ .mode = 0644,
+ .proc_handler = &scrub_start_handler,
+ .strategy = &sysctl_intvec,
+ },
+ {
+ .ctl_name = VM_SCRUB_STOP,
+ .procname = "scrub_stop",
+ .data = &sysctl_scrub_stop,
+ .maxlen = sizeof(sysctl_scrub_stop),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ },
+ {
+ .ctl_name = VM_SCRUB_LOAD,
+ .procname = "scrub_load",
+ .data = &sysctl_scrub_load,
+ .maxlen = sizeof(sysctl_scrub_load),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ },
{ .ctl_name = 0 }
};
Index: linux-2.6.10/include/linux/sysctl.h
===================================================================
--- linux-2.6.10.orig/include/linux/sysctl.h 2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/linux/sysctl.h 2005-01-04 14:17:02.000000000 -0800
@@ -169,6 +169,9 @@
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+ VM_SCRUB_START=30, /* percentage * 10 at which to start scrubd */
+ VM_SCRUB_STOP=31, /* percentage * 10 at which to stop scrubd */
+ VM_SCRUB_LOAD=31, /* Load factor at which not to scrub anymore */
};
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix
2005-01-04 23:12 ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
` (2 preceding siblings ...)
2005-01-04 23:15 ` Prezeroing V3 [3/4]: Page zeroing through kscrubd Christoph Lameter
@ 2005-01-04 23:16 ` Christoph Lameter
2005-01-05 2:16 ` Andi Kleen
3 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2005-01-04 23:16 UTC (permalink / raw)
To: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
Linux Kernel Development
o Zeroing driver implemented with the Block Transfer Engine in the Altix
SN2 SHub.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.10/arch/ia64/sn/kernel/bte.c
===================================================================
--- linux-2.6.10.orig/arch/ia64/sn/kernel/bte.c 2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/ia64/sn/kernel/bte.c 2005-01-03 13:36:07.000000000 -0800
@@ -4,6 +4,8 @@
* for more details.
*
* Copyright (c) 2000-2003 Silicon Graphics, Inc. All Rights Reserved.
+ *
+ * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
*/
#include <linux/config.h>
@@ -20,6 +22,8 @@
#include <linux/bootmem.h>
#include <linux/string.h>
#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scrub.h>
#include <asm/sn/bte.h>
@@ -30,7 +34,7 @@
/* two interfaces on two btes */
#define MAX_INTERFACES_TO_TRY 4
-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
{
nodepda_t *tmp_nodepda;
@@ -132,7 +136,6 @@
if (bte == NULL) {
continue;
}
-
if (spin_trylock(&bte->spinlock)) {
if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
(BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
@@ -157,7 +160,7 @@
}
} while (1);
- if (notification == NULL) {
+ if (notification == NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
/* User does not want to be notified. */
bte->most_rcnt_na = &bte->notify;
} else {
@@ -192,6 +195,8 @@
itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);
+ if (mode & BTE_NOTIFY_AND_GET_POINTER)
+ *(u64 volatile **)(notification) = &bte->notify;
spin_unlock_irqrestore(&bte->spinlock, irq_flags);
if (notification != NULL) {
@@ -449,5 +454,37 @@
mynodepda->bte_if[i].cleanup_active = 0;
mynodepda->bte_if[i].bh_error = 0;
}
+}
+
+u64 *bte_zero_notify[MAX_COMPACT_NODES];
+
+static int bte_check_bzero(void)
+{
+ int node = get_nasid();
+
+ return *(bte_zero_notify[node]) != BTE_WORD_BUSY;
+}
+
+static int bte_start_bzero(void *p, unsigned long len)
+{
+ int node = get_nasid();
+
+ /* Check limitations.
+ 1. System must be running (weird things happen during bootup)
+ 2. Size >64KB. Smaller requests cause too much bte traffic
+ */
+ if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
+ return EINVAL;
+
+ return bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, bte_zero_notify+node);
+}
+
+static struct zero_driver bte_bzero = {
+ .start = bte_start_bzero,
+ .check = bte_check_bzero,
+ .rate = 500000000 /* 500 MB /sec */
+};
+void sn_bte_bzero_init(void) {
+ register_zero_driver(&bte_bzero);
}
Index: linux-2.6.10/arch/ia64/sn/kernel/setup.c
===================================================================
--- linux-2.6.10.orig/arch/ia64/sn/kernel/setup.c 2004-12-24 13:34:27.000000000 -0800
+++ linux-2.6.10/arch/ia64/sn/kernel/setup.c 2005-01-03 13:36:07.000000000 -0800
@@ -243,6 +243,7 @@
int pxm;
int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
extern void sn_cpu_init(void);
+ extern void sn_bte_bzero_init(void);
/*
* If the generic code has enabled vga console support - lets
@@ -333,6 +334,7 @@
screen_info = sn_screen_info;
sn_timer_init();
+ sn_bte_bzero_init();
}
/**
Index: linux-2.6.10/include/asm-ia64/sn/bte.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/sn/bte.h 2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/sn/bte.h 2005-01-03 13:36:07.000000000 -0800
@@ -48,6 +48,8 @@
#define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
/* Use a reserved bit to let the caller specify a wait for any BTE */
#define BTE_WACQUIRE (0x4000)
+/* Return the pointer to the notification cacheline to the user */
+#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
/* Use the BTE on the node with the destination memory */
#define BTE_USE_DEST (BTE_WACQUIRE << 1)
/* Use any available BTE interface on any node for the transfer */
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-04 23:13 ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
@ 2005-01-04 23:45 ` Dave Hansen
2005-01-05 1:16 ` Christoph Lameter
2005-01-05 0:34 ` Linus Torvalds
2005-01-08 21:12 ` Hugh Dickins
2 siblings, 1 reply; 89+ messages in thread
From: Dave Hansen @ 2005-01-04 23:45 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
Linux Kernel Development
On Tue, 2005-01-04 at 15:13 -0800, Christoph Lameter wrote:
> + if (gfp_flags & __GFP_ZERO) {
> +#ifdef CONFIG_HIGHMEM
> + if (PageHighMem(page)) {
> + int n = 1 << order;
> +
> + while (n-- >0)
> + clear_highpage(page + n);
> + } else
> +#endif
> + clear_page(page_address(page), order);
> + }
> if (order && (gfp_flags & __GFP_COMP))
> prep_compound_page(page, order);
That #ifdef can probably die. The compiler should get that all by
itself:
> #ifdef CONFIG_HIGHMEM
> #define PageHighMem(page) test_bit(PG_highmem, &(page)->flags)
> #else
> #define PageHighMem(page) 0 /* needed to optimize away at compile time */
> #endif
-- Dave
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-04 23:13 ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
2005-01-04 23:45 ` Dave Hansen
@ 2005-01-05 0:34 ` Linus Torvalds
2005-01-05 0:47 ` Andrew Morton
2005-01-08 21:12 ` Hugh Dickins
2 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2005-01-05 0:34 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, linux-ia64, linux-mm, Linux Kernel Development
On Tue, 4 Jan 2005, Christoph Lameter wrote:
>
> This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
> to request zeroed pages from the page allocator.
Ok, let's start merging this slowly, and in particular, this 1/4 one looks
pretty much like a cleanup regardless of whatever else happen, so let's
just do it. However, for it to really be a cleanup, how about making
_this_ part:
> +
> + if (gfp_flags & __GFP_ZERO) {
> +#ifdef CONFIG_HIGHMEM
> + if (PageHighMem(page)) {
> + int n = 1 << order;
> +
> + while (n-- >0)
> + clear_highpage(page + n);
> + } else
> +#endif
> + clear_page(page_address(page), order);
> + }
Match the existing previous part:
> if (order && (gfp_flags & __GFP_COMP))
> prep_compound_page(page, order);
and just split it up into a "prep_zero_page(page, order)"? I dislike
#ifdef's in the middle of deep functions. In the middle of a _trivial_
function it's much more palatable.
At that point at least part 1 ends up being a nice clean patch on its own,
and should even shrink the code-size a bit. IOW, it not only is a cleanup,
there is even a technical argument for it (even without worrying about the
next stages).
Hmm?
Linus
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-05 0:34 ` Linus Torvalds
@ 2005-01-05 0:47 ` Andrew Morton
2005-01-05 1:15 ` Christoph Lameter
0 siblings, 1 reply; 89+ messages in thread
From: Andrew Morton @ 2005-01-05 0:47 UTC (permalink / raw)
To: Linus Torvalds; +Cc: clameter, linux-ia64, linux-mm, linux-kernel
Linus Torvalds <torvalds@osdl.org> wrote:
>
> On Tue, 4 Jan 2005, Christoph Lameter wrote:
> >
> > This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
> > to request zeroed pages from the page allocator.
>
> Ok, let's start merging this slowly
One week hence, please. Things like the no-bitmaps-for-the-buddy-allocator
have been well tested and should go in first.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-05 0:47 ` Andrew Morton
@ 2005-01-05 1:15 ` Christoph Lameter
0 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-05 1:15 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linus Torvalds, linux-ia64, linux-mm, linux-kernel
On Tue, 4 Jan 2005, Andrew Morton wrote:
> > Ok, let's start merging this slowly
>
> One week hence, please. Things like the no-bitmaps-for-the-buddy-allocator
> have been well tested and should go in first.
The first two patches are basically cleanup type stuff and will not affect
the page allocator in a significant way. On the other hand they touch many
files and are thus difficult to maintain.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-04 23:45 ` Dave Hansen
@ 2005-01-05 1:16 ` Christoph Lameter
2005-01-05 1:26 ` Linus Torvalds
0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2005-01-05 1:16 UTC (permalink / raw)
To: Dave Hansen
Cc: Andrew Morton, linux-ia64, Linus Torvalds, linux-mm,
Linux Kernel Development
On Tue, 4 Jan 2005, Dave Hansen wrote:
> That #ifdef can probably die. The compiler should get that all by
> itself:
>
> > #ifdef CONFIG_HIGHMEM
> > #define PageHighMem(page) test_bit(PG_highmem, &(page)->flags)
> > #else
> > #define PageHighMem(page) 0 /* needed to optimize away at compile time */
> > #endif
Ahh. Great. Do I need to submit a corrected patch that removes those two
lines or is it fine as is?
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-05 1:16 ` Christoph Lameter
@ 2005-01-05 1:26 ` Linus Torvalds
2005-01-05 23:11 ` Christoph Lameter
0 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2005-01-05 1:26 UTC (permalink / raw)
To: Christoph Lameter
Cc: Dave Hansen, Andrew Morton, linux-ia64, linux-mm,
Linux Kernel Development
On Tue, 4 Jan 2005, Christoph Lameter wrote:
>
> Ahh. Great. Do I need to submit a corrected patch that removes those two
> lines or is it fine as is?
Please do split it up into a function of its own. It's going to look a lot
prettier as an intermediate phase. I realize that that touches #3 in the
series, but I suspect that one will also just be prettier as a result.
Linus
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix
2005-01-04 23:16 ` Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix Christoph Lameter
@ 2005-01-05 2:16 ` Andi Kleen
2005-01-05 16:24 ` Christoph Lameter
0 siblings, 1 reply; 89+ messages in thread
From: Andi Kleen @ 2005-01-05 2:16 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-kernel
Christoph Lameter <clameter@sgi.com> writes:
> + /* Check limitations.
> + 1. System must be running (weird things happen during bootup)
> + 2. Size >64KB. Smaller requests cause too much bte traffic
> + */
> + if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
> + return EINVAL;
surely return -EINVAL;
Also have you thought about doing a similar driver for x86/x86-64 using
cache bypassing stores?
-Andi
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix
2005-01-05 2:16 ` Andi Kleen
@ 2005-01-05 16:24 ` Christoph Lameter
0 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-05 16:24 UTC (permalink / raw)
To: Andi Kleen; +Cc: linux-kernel
On Wed, 5 Jan 2005, Andi Kleen wrote:
> Christoph Lameter <clameter@sgi.com> writes:
>
> > + /* Check limitations.
> > + 1. System must be running (weird things happen during bootup)
> > + 2. Size >64KB. Smaller requests cause too much bte traffic
> > + */
> > + if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
> > + return EINVAL;
>
> surely return -EINVAL;
Anything will do as long as its != 0. But yeah that would more closely
follow convention.
> Also have you thought about doing a similar driver for x86/x86-64 using
> cache bypassing stores?
As you know we do ia64 and I am no expert on x86_64. But the interface for
hardware zeroing is designed for purposes like that.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-05 1:26 ` Linus Torvalds
@ 2005-01-05 23:11 ` Christoph Lameter
0 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-05 23:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: Dave Hansen, Andrew Morton, linux-ia64, linux-mm,
Linux Kernel Development
On Tue, 4 Jan 2005, Linus Torvalds wrote:
> Please do split it up into a function of its own. It's going to look a lot
> prettier as an intermediate phase. I realize that that touches #3 in the
> series, but I suspect that one will also just be prettier as a result.
Here is the first patch redone as you wanted. I also removed all
dependencies on the second patch. This should be able to get in
on its own.
I will sent the revised second patch dealing with updating clear_page
later and keep back the last two patches until the bitmap thing has been
changed in the buddy allocator.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
to request zeroed pages from the page allocator.
- Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set
- Replace all page zeroing after allocating pages by prior allocations with
allocations using __GFP_ZERO
Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c 2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c 2005-01-05 09:32:52.000000000 -0800
@@ -549,6 +549,12 @@
* we cheat by calling it from here, in the order > 0 path. Saves a branch
* or two.
*/
+static inline void prep_zero_page(struct page *page, int order) {
+ int i;
+
+ for(i = 0; i < (1 << order); i++)
+ clear_highpage(page + i);
+}
static struct page *
buffered_rmqueue(struct zone *zone, int order, int gfp_flags)
@@ -584,6 +590,10 @@
BUG_ON(bad_range(zone, page));
mod_page_state_zone(zone, pgalloc, 1 << order);
prep_new_page(page, order);
+
+ if (gfp_flags & __GFP_ZERO)
+ prep_zero_page(page, order);
+
if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
}
@@ -796,12 +806,9 @@
*/
BUG_ON(gfp_mask & __GFP_HIGHMEM);
- page = alloc_pages(gfp_mask, 0);
- if (page) {
- void *address = page_address(page);
- clear_page(address);
- return (unsigned long) address;
- }
+ page = alloc_pages(gfp_mask | __GFP_ZERO, 0);
+ if (page)
+ return (unsigned long) page_address(page);
return 0;
}
Index: linux-2.6.10/include/linux/gfp.h
===================================================================
--- linux-2.6.10.orig/include/linux/gfp.h 2004-12-24 13:34:27.000000000 -0800
+++ linux-2.6.10/include/linux/gfp.h 2005-01-05 09:30:39.000000000 -0800
@@ -37,6 +37,7 @@
#define __GFP_NORETRY 0x1000 /* Do not retry. Might fail */
#define __GFP_NO_GROW 0x2000 /* Slab internal usage */
#define __GFP_COMP 0x4000 /* Add compound page metadata */
+#define __GFP_ZERO 0x8000 /* Return zeroed page on success */
#define __GFP_BITS_SHIFT 16 /* Room for 16 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
@@ -52,6 +53,7 @@
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
+#define GFP_HIGHZERO (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO)
/* Flag - indicates that the buffer will be suitable for DMA. Ignored on some
platforms, used as appropriate on others */
Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-05 09:30:39.000000000 -0800
@@ -1650,10 +1650,9 @@
if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
- page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
if (!page)
goto no_mem;
- clear_user_highpage(page, addr);
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.10/kernel/profile.c
===================================================================
--- linux-2.6.10.orig/kernel/profile.c 2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/kernel/profile.c 2005-01-05 09:30:39.000000000 -0800
@@ -326,17 +326,15 @@
node = cpu_to_node(cpu);
per_cpu(cpu_profile_flip, cpu) = 0;
if (!per_cpu(cpu_profile_hits, cpu)[1]) {
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
return NOTIFY_BAD;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
}
if (!per_cpu(cpu_profile_hits, cpu)[0]) {
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
goto out_free;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[0] = page_address(page);
}
break;
@@ -510,16 +508,14 @@
int node = cpu_to_node(cpu);
struct page *page;
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
goto out_cleanup;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[1]
= (struct profile_hit *)page_address(page);
- page = alloc_pages_node(node, GFP_KERNEL, 0);
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
goto out_cleanup;
- clear_highpage(page);
per_cpu(cpu_profile_hits, cpu)[0]
= (struct profile_hit *)page_address(page);
}
Index: linux-2.6.10/mm/shmem.c
===================================================================
--- linux-2.6.10.orig/mm/shmem.c 2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/mm/shmem.c 2005-01-05 09:30:39.000000000 -0800
@@ -369,9 +369,8 @@
}
spin_unlock(&info->lock);
- page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
+ page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
if (page) {
- clear_highpage(page);
page->nr_swapped = 0;
}
spin_lock(&info->lock);
@@ -910,7 +909,7 @@
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
pvma.vm_pgoff = idx;
pvma.vm_end = PAGE_SIZE;
- page = alloc_page_vma(gfp, &pvma, 0);
+ page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
mpol_free(pvma.vm_policy);
return page;
}
@@ -926,7 +925,7 @@
shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
unsigned long idx)
{
- return alloc_page(gfp);
+ return alloc_page(gfp | __GFP_ZERO);
}
#endif
@@ -1135,7 +1134,6 @@
info->alloced++;
spin_unlock(&info->lock);
- clear_highpage(filepage);
flush_dcache_page(filepage);
SetPageUptodate(filepage);
}
Index: linux-2.6.10/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/pgalloc.h 2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/pgalloc.h 2005-01-05 09:30:39.000000000 -0800
@@ -61,9 +61,7 @@
pgd_t *pgd = pgd_alloc_one_fast(mm);
if (unlikely(pgd == NULL)) {
- pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
- if (likely(pgd != NULL))
- clear_page(pgd);
+ pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
}
return pgd;
}
@@ -106,10 +104,8 @@
static inline pmd_t*
pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
{
- pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
- if (likely(pmd != NULL))
- clear_page(pmd);
return pmd;
}
@@ -140,20 +136,16 @@
static inline struct page *
pte_alloc_one (struct mm_struct *mm, unsigned long addr)
{
- struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
- if (likely(pte != NULL))
- clear_page(page_address(pte));
return pte;
}
static inline pte_t *
pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
- if (likely(pte != NULL))
- clear_page(pte);
return pte;
}
Index: linux-2.6.10/arch/i386/mm/pgtable.c
===================================================================
--- linux-2.6.10.orig/arch/i386/mm/pgtable.c 2005-01-04 14:16:59.000000000 -0800
+++ linux-2.6.10/arch/i386/mm/pgtable.c 2005-01-05 09:30:39.000000000 -0800
@@ -140,10 +140,7 @@
pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
- return pte;
+ return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
}
struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -151,12 +148,10 @@
struct page *pte;
#ifdef CONFIG_HIGHPTE
- pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
#else
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
#endif
- if (pte)
- clear_highpage(pte);
return pte;
}
Index: linux-2.6.10/include/asm-mips/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-mips/pgalloc.h 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-mips/pgalloc.h 2005-01-05 09:30:39.000000000 -0800
@@ -56,9 +56,7 @@
{
pte_t *pte;
- pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PTE_ORDER);
- if (pte)
- clear_page(pte);
+ pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER);
return pte;
}
Index: linux-2.6.10/arch/alpha/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/alpha/mm/init.c 2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/arch/alpha/mm/init.c 2005-01-05 09:30:39.000000000 -0800
@@ -42,10 +42,9 @@
{
pgd_t *ret, *init;
- ret = (pgd_t *)__get_free_page(GFP_KERNEL);
+ ret = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
init = pgd_offset(&init_mm, 0UL);
if (ret) {
- clear_page(ret);
#ifdef CONFIG_ALPHA_LARGE_VMALLOC
memcpy (ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
(PTRS_PER_PGD - USER_PTRS_PER_PGD - 1)*sizeof(pgd_t));
@@ -63,9 +62,7 @@
pte_t *
pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pte;
}
Index: linux-2.6.10/include/asm-parisc/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-parisc/pgalloc.h 2004-12-24 13:35:39.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/pgalloc.h 2005-01-05 09:30:39.000000000 -0800
@@ -120,18 +120,14 @@
static inline struct page *
pte_alloc_one(struct mm_struct *mm, unsigned long address)
{
- struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
- if (likely(page != NULL))
- clear_page(page_address(page));
+ struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return page;
}
static inline pte_t *
pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (likely(pte != NULL))
- clear_page(pte);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pte;
}
Index: linux-2.6.10/include/asm-sparc64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/pgalloc.h 2004-12-24 13:35:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/pgalloc.h 2005-01-05 09:30:39.000000000 -0800
@@ -73,10 +73,9 @@
struct page *page;
preempt_enable();
- page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+ page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
if (page) {
ret = (struct page *)page_address(page);
- clear_page(ret);
page->lru.prev = (void *) 2UL;
preempt_disable();
Index: linux-2.6.10/include/asm-sh/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh/pgalloc.h 2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/include/asm-sh/pgalloc.h 2005-01-05 09:30:39.000000000 -0800
@@ -44,9 +44,7 @@
{
pte_t *pte;
- pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
return pte;
}
@@ -56,9 +54,7 @@
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
- if (pte)
- clear_page(page_address(pte));
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
return pte;
}
Index: linux-2.6.10/include/asm-m32r/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/pgalloc.h 2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/pgalloc.h 2005-01-05 09:30:39.000000000 -0800
@@ -23,10 +23,7 @@
*/
static __inline__ pgd_t *pgd_alloc(struct mm_struct *mm)
{
- pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
-
- if (pgd)
- clear_page(pgd);
+ pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
return pgd;
}
@@ -39,10 +36,7 @@
static __inline__ pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL);
-
- if (pte)
- clear_page(pte);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
return pte;
}
@@ -50,10 +44,8 @@
static __inline__ struct page *pte_alloc_one(struct mm_struct *mm,
unsigned long address)
{
- struct page *pte = alloc_page(GFP_KERNEL);
+ struct page *pte = alloc_page(GFP_KERNEL|__GFP_ZERO);
- if (pte)
- clear_page(page_address(pte));
return pte;
}
Index: linux-2.6.10/arch/um/kernel/mem.c
===================================================================
--- linux-2.6.10.orig/arch/um/kernel/mem.c 2005-01-04 14:17:00.000000000 -0800
+++ linux-2.6.10/arch/um/kernel/mem.c 2005-01-05 09:30:39.000000000 -0800
@@ -327,9 +327,7 @@
{
pte_t *pte;
- pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pte;
}
@@ -337,9 +335,7 @@
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
- if (pte)
- clear_highpage(pte);
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
return pte;
}
Index: linux-2.6.10/include/asm-sh64/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh64/pgalloc.h 2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/pgalloc.h 2005-01-05 09:30:39.000000000 -0800
@@ -112,9 +112,7 @@
{
pte_t *pte;
- pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT|__GFP_ZERO);
return pte;
}
@@ -123,9 +121,7 @@
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
- if (pte)
- clear_page(page_address(pte));
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
return pte;
}
@@ -150,9 +146,7 @@
static __inline__ pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
{
pmd_t *pmd;
- pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pmd)
- clear_page(pmd);
+ pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pmd;
}
Index: linux-2.6.10/include/asm-cris/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/pgalloc.h 2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-cris/pgalloc.h 2005-01-05 09:30:39.000000000 -0800
@@ -24,18 +24,14 @@
extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
{
- pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (pte)
- clear_page(pte);
+ pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return pte;
}
extern inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
- if (pte)
- clear_page(page_address(pte));
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
return pte;
}
Index: linux-2.6.10/arch/ppc/mm/pgtable.c
===================================================================
--- linux-2.6.10.orig/arch/ppc/mm/pgtable.c 2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/arch/ppc/mm/pgtable.c 2005-01-05 09:30:39.000000000 -0800
@@ -85,8 +85,7 @@
{
pgd_t *ret;
- if ((ret = (pgd_t *)__get_free_pages(GFP_KERNEL, PGDIR_ORDER)) != NULL)
- clear_pages(ret, PGDIR_ORDER);
+ ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER);
return ret;
}
@@ -102,7 +101,7 @@
extern void *early_get_page(void);
if (mem_init_done) {
- pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
if (pte) {
struct page *ptepage = virt_to_page(pte);
ptepage->mapping = (void *) mm;
@@ -110,8 +109,6 @@
}
} else
pte = (pte_t *)early_get_page();
- if (pte)
- clear_page(pte);
return pte;
}
Index: linux-2.6.10/include/asm-alpha/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/pgalloc.h 2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/pgalloc.h 2005-01-05 09:30:39.000000000 -0800
@@ -40,9 +40,7 @@
static inline pmd_t *
pmd_alloc_one(struct mm_struct *mm, unsigned long address)
{
- pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- if (ret)
- clear_page(ret);
+ pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
return ret;
}
Index: linux-2.6.10/include/asm-m68k/motorola_pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68k/motorola_pgalloc.h 2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/motorola_pgalloc.h 2005-01-05 09:30:39.000000000 -0800
@@ -12,9 +12,8 @@
{
pte_t *pte;
- pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
if (pte) {
- clear_page(pte);
__flush_page_to_ram(pte);
flush_tlb_kernel_page(pte);
nocache_page(pte);
@@ -31,7 +30,7 @@
static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
{
- struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
pte_t *pte;
if(!page)
@@ -39,7 +38,6 @@
pte = kmap(page);
if (pte) {
- clear_page(pte);
__flush_page_to_ram(pte);
flush_tlb_kernel_page(pte);
nocache_page(pte);
Index: linux-2.6.10/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/sparc64/mm/init.c 2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/sparc64/mm/init.c 2005-01-05 09:30:39.000000000 -0800
@@ -1687,13 +1687,12 @@
* Set up the zero page, mark it reserved, so that page count
* is not manipulated when freeing the page from user ptes.
*/
- mem_map_zero = alloc_pages(GFP_KERNEL, 0);
+ mem_map_zero = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
if (mem_map_zero == NULL) {
prom_printf("paging_init: Cannot alloc zero page.\n");
prom_halt();
}
SetPageReserved(mem_map_zero);
- clear_page(page_address(mem_map_zero));
codepages = (((unsigned long) _etext) - ((unsigned long) _start));
codepages = PAGE_ALIGN(codepages) >> PAGE_SHIFT;
Index: linux-2.6.10/include/asm-arm/pgalloc.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/pgalloc.h 2004-12-24 13:35:29.000000000 -0800
+++ linux-2.6.10/include/asm-arm/pgalloc.h 2005-01-05 09:30:39.000000000 -0800
@@ -50,9 +50,8 @@
{
pte_t *pte;
- pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+ pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
if (pte) {
- clear_page(pte);
clean_dcache_area(pte, sizeof(pte_t) * PTRS_PER_PTE);
pte += PTRS_PER_PTE;
}
@@ -65,10 +64,9 @@
{
struct page *pte;
- pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0);
+ pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
if (pte) {
void *page = page_address(pte);
- clear_page(page);
clean_dcache_area(page, sizeof(pte_t) * PTRS_PER_PTE);
}
Index: linux-2.6.10/drivers/block/pktcdvd.c
===================================================================
--- linux-2.6.10.orig/drivers/block/pktcdvd.c 2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10/drivers/block/pktcdvd.c 2005-01-05 09:30:39.000000000 -0800
@@ -135,12 +135,10 @@
goto no_bio;
for (i = 0; i < PAGES_PER_PACKET; i++) {
- pkt->pages[i] = alloc_page(GFP_KERNEL);
+ pkt->pages[i] = alloc_page(GFP_KERNEL|| __GFP_ZERO);
if (!pkt->pages[i])
goto no_page;
}
- for (i = 0; i < PAGES_PER_PACKET; i++)
- clear_page(page_address(pkt->pages[i]));
spin_lock_init(&pkt->lock);
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [2/4]: Extension of clear_page to take an order parameter
2005-01-04 23:14 ` Prezeroing V3 [2/4]: Extension of " Christoph Lameter
@ 2005-01-05 23:25 ` Christoph Lameter
2005-01-06 13:52 ` Andi Kleen
0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2005-01-05 23:25 UTC (permalink / raw)
To: Linus Torvalds, linux-ia64, Andrew Morton, linux-mm,
Linux Kernel Development
Here is an updated version that is independent of the first patch and
contains all the necessary modifications to make clear_page take a second
parameter.
Architecture support:
---------------------
Known to work:
ia64
i386
sparc64
m68k
Trivial modification expected to simply work:
arm
cris
h8300
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um
Modification made but it would be good to have some feedback from the arch maintainers:
x86_64
s390
alpha
sh
mips
m32r
Index: linux-2.6.10/include/asm-ia64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/page.h 2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -56,7 +56,7 @@
# ifdef __KERNEL__
# define STRICT_MM_TYPECHECKS
-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
extern void copy_page (void *to, void *from);
/*
@@ -65,7 +65,7 @@
*/
#define clear_user_page(addr, vaddr, page) \
do { \
- clear_page(addr); \
+ clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
Index: linux-2.6.10/include/asm-i386/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/page.h 2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -18,7 +18,7 @@
#include <asm/mmx.h>
-#define clear_page(page) mmx_clear_page((void *)(page))
+#define clear_page(page, order) mmx_clear_page((void *)(page),order)
#define copy_page(to,from) mmx_copy_page(to,from)
#else
@@ -28,12 +28,12 @@
* Maybe the K6-III ?
*/
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
#endif
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/page.h 2005-01-04 14:17:01.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -32,10 +32,10 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-void clear_page(void *);
+void clear_page(void *, int);
void copy_page(void *, void *);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-sparc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -28,10 +28,10 @@
#ifndef __ASSEMBLY__
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
sparc_flush_page_to_ram(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.10/include/asm-s390/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/page.h 2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -22,12 +22,12 @@
#ifndef __s390x__
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
register_pair rp;
rp.subreg.even = (unsigned long) page;
- rp.subreg.odd = (unsigned long) 4096;
+ rp.subreg.odd = (unsigned long) 4096 << order;
asm volatile (" slr 1,1\n"
" mvcl %0,0"
: "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@
#else /* __s390x__ */
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
- asm volatile (" lgr 2,%0\n"
+ int nr = 1 << order;
+
+ while (nr-- >0) {
+ asm volatile (" lgr 2,%0\n"
" lghi 3,4096\n"
" slgr 1,1\n"
" mvcl 2,0"
: : "a" ((void *) (page))
: "memory", "cc", "1", "2", "3" );
+ page += PAGE_SIZE;
+ }
}
static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@
#endif /* __s390x__ */
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/* Pure 2^n version of get_order */
Index: linux-2.6.10/arch/i386/lib/mmx.c
===================================================================
--- linux-2.6.10.orig/arch/i386/lib/mmx.c 2004-12-24 13:34:48.000000000 -0800
+++ linux-2.6.10/arch/i386/lib/mmx.c 2005-01-05 10:09:51.000000000 -0800
@@ -128,7 +128,7 @@
* other MMX using processors do not.
*/
-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
{
int i;
@@ -138,7 +138,7 @@
" pxor %%mm0, %%mm0\n" : :
);
- for(i=0;i<4096/64;i++)
+ for(i=0;i<((4096/64) << order);i++)
{
__asm__ __volatile__ (
" movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@
* Generic MMX implementation without K7 specific streaming
*/
-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
{
int i;
@@ -267,7 +267,7 @@
" pxor %%mm0, %%mm0\n" : :
);
- for(i=0;i<4096/128;i++)
+ for(i=0;i<((4096/128) << order);i++)
{
__asm__ __volatile__ (
" movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@
* Favour MMX for page clear and copy.
*/
-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
{
int d0, d1;
__asm__ __volatile__( \
"cld\n\t" \
"rep ; stosl" \
: "=&c" (d0), "=&D" (d1)
- :"a" (0),"1" (page),"0" (1024)
+ :"a" (0),"1" (page),"0" (1024 << order)
:"memory");
}
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
{
if(unlikely(in_interrupt()))
- slow_zero_page(page);
+ slow_clear_page(page, order);
else
- fast_clear_page(page);
+ fast_clear_page(page, order);
}
static void slow_copy_page(void *to, void *from)
Index: linux-2.6.10/include/asm-x86_64/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/mmx.h 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/mmx.h 2005-01-05 10:09:51.000000000 -0800
@@ -8,7 +8,7 @@
#include <linux/types.h>
extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
extern void mmx_copy_page(void *to, void *from);
#endif
Index: linux-2.6.10/arch/ia64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/ia64/lib/clear_page.S 2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10/arch/ia64/lib/clear_page.S 2005-01-05 10:09:51.000000000 -0800
@@ -7,6 +7,7 @@
* 1/06/01 davidm Tuned for Itanium.
* 2/12/02 kchen Tuned for both Itanium and McKinley
* 3/08/02 davidm Some more tweaking
+ * 12/10/04 clameter Make it work on pages of order size
*/
#include <linux/config.h>
@@ -29,27 +30,33 @@
#define dst4 r11
#define dst_last r31
+#define totsize r14
GLOBAL_ENTRY(clear_page)
.prologue
- .regstk 1,0,0,0
- mov r16 = PAGE_SIZE/L3_LINE_SIZE-1 // main loop count, -1=repeat/until
+ .regstk 2,0,0,0
+ mov r16 = PAGE_SIZE/L3_LINE_SIZE // main loop count
+ mov totsize = PAGE_SIZE
.save ar.lc, saved_lc
mov saved_lc = ar.lc
-
+ ;;
.body
+ adds dst1 = 16, in0
mov ar.lc = (PREFETCH_LINES - 1)
mov dst_fetch = in0
- adds dst1 = 16, in0
adds dst2 = 32, in0
+ shl r16 = r16, in1
+ shl totsize = totsize, in1
;;
.fetch: stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
adds dst3 = 48, in0 // executing this multiple times is harmless
br.cloop.sptk.few .fetch
+ add r16 = -1,r16
+ add dst_last = totsize, dst_fetch
+ adds dst4 = 64, in0
;;
- addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
mov ar.lc = r16 // one L3 line per iteration
- adds dst4 = 64, in0
+ adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
;;
#ifdef CONFIG_ITANIUM
// Optimized for Itanium
Index: linux-2.6.10/arch/x86_64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/x86_64/lib/clear_page.S 2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/lib/clear_page.S 2005-01-05 10:09:51.000000000 -0800
@@ -7,6 +7,7 @@
clear_page:
xorl %eax,%eax
movl $4096/64,%ecx
+ shl %esi, %ecx
.p2align 4
.Lloop:
decl %ecx
@@ -42,6 +43,7 @@
.section .altinstr_replacement,"ax"
clear_page_c:
movl $4096/8,%ecx
+ shl %esi, %ecx
xorl %eax,%eax
rep
stosq
Index: linux-2.6.10/include/asm-sh/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh/page.h 2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-sh/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -36,12 +36,22 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
extern void (*copy_page)(void *to, void *from);
extern void clear_page_slow(void *to);
extern void copy_page_slow(void *to, void *from);
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- >0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
#if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
struct page;
extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@
extern void __clear_user_page(void *to, void *orig_to);
extern void __copy_user_page(void *to, void *from, void *orig_to);
#elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#elif defined(CONFIG_CPU_SH4)
struct page;
Index: linux-2.6.10/include/asm-i386/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/mmx.h 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/mmx.h 2005-01-05 10:09:51.000000000 -0800
@@ -8,7 +8,7 @@
#include <linux/types.h>
extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
extern void mmx_copy_page(void *to, void *from);
#endif
Index: linux-2.6.10/arch/alpha/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/clear_page.S 2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/clear_page.S 2005-01-05 10:09:51.000000000 -0800
@@ -6,11 +6,10 @@
.text
.align 4
- .global clear_page
- .ent clear_page
-clear_page:
+ .global _clear_page
+ .ent _clear_page
+_clear_page:
.prologue 0
-
lda $0,128
nop
unop
@@ -36,4 +35,4 @@
unop
nop
- .end clear_page
+ .end _clear_page
Index: linux-2.6.10/include/asm-sh64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh64/page.h 2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -50,12 +50,20 @@
extern void sh64_page_clear(void *page);
extern void sh64_page_copy(void *from, void *to);
-#define clear_page(page) sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+ int nr = 1 << order;
+
+ while (nr-- >0) {
+ sh64_page_clear(page++, 0);
+ }
+}
+
#define copy_page(to,from) sh64_page_copy(from, to)
#if defined(CONFIG_DCACHE_DISABLED)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) sh_clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#else
Index: linux-2.6.10/include/asm-h8300/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-h8300/page.h 2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -24,10 +24,10 @@
#define get_user_page(vaddr) __get_free_page(GFP_KERNEL)
#define free_user_page(page, addr) free_page(addr)
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-arm/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/page.h 2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-arm/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -128,7 +128,7 @@
preempt_enable(); \
} while (0)
-#define clear_page(page) memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order) memzero((void *)(page), PAGE_SIZE << (order))
extern void copy_page(void *to, const void *from);
#undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-ppc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc64/page.h 2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10/include/asm-ppc64/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -102,12 +102,12 @@
#define REGION_MASK (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
#define REGION_STRIDE (1UL << REGION_SHIFT)
-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, int order)
{
unsigned long lines, line_size;
line_size = systemcfg->dCacheL1LineSize;
- lines = naca->dCacheL1LinesPerPage;
+ lines = naca->dCacheL1LinesPerPage << order;
__asm__ __volatile__(
"mtctr %1 # clear_page\n\
Index: linux-2.6.10/include/asm-m32r/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -11,10 +11,22 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- > 0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
+
extern void copy_page(void *to, void *from);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-alpha/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/page.h 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -15,8 +15,20 @@
#define STRICT_MM_TYPECHECKS
-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+ int nr = 1 << order;
+
+ while (nr--)
+ {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
extern void copy_page(void * _to, void * _from);
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
Index: linux-2.6.10/arch/mips/mm/pg-sb1.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-sb1.c 2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-sb1.c 2005-01-05 10:09:51.000000000 -0800
@@ -42,7 +42,7 @@
#ifdef CONFIG_SIBYTE_DMA_PAGEOPS
static inline void clear_page_cpu(void *page)
#else
-void clear_page(void *page)
+void _clear_page(void *page)
#endif
{
unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@
IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
}
-void clear_page(void *page)
+void _clear_page(void *page)
{
int cpu = smp_processor_id();
/* if the page is above Kseg0, use old way */
if (KSEGX(page) != CAC_BASE)
return clear_page_cpu(page);
-
page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@
#endif
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(copy_page);
Index: linux-2.6.10/include/asm-m68k/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68k/page.h 2004-12-24 13:35:49.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -50,7 +50,7 @@
);
}
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
unsigned long tmp;
unsigned long *sp = page;
@@ -69,16 +69,16 @@
"dbra %1,1b\n\t"
: "=a" (sp), "=d" (tmp)
: "a" (page), "0" (sp),
- "1" ((PAGE_SIZE - 16) / 16 - 1));
+ "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
}
#else
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
#endif
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.10/include/asm-mips/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-mips/page.h 2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/include/asm-mips/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -39,7 +39,18 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- >0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
extern void copy_page(void * to, void * from);
extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@
{
extern void (*flush_data_cache_page)(unsigned long addr);
- clear_page(addr);
+ clear_page(addr, 0);
if (pages_do_alias((unsigned long) addr, vaddr))
flush_data_cache_page((unsigned long)addr);
}
Index: linux-2.6.10/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68knommu/page.h 2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -24,10 +24,10 @@
#define get_user_page(vaddr) __get_free_page(GFP_KERNEL)
#define free_user_page(page, addr) free_page(addr)
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-cris/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/page.h 2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -15,10 +15,10 @@
#ifdef __KERNEL__
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-v850/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-v850/page.h 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-v850/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -37,11 +37,11 @@
#define STRICT_MM_TYPECHECKS
-#define clear_page(page) memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset ((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to, from) memcpy ((void *)(to), (void *)from, PAGE_SIZE)
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.10/include/asm-parisc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-parisc/page.h 2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -13,7 +13,7 @@
#include <asm/types.h>
#include <asm/cache.h>
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) copy_user_page_asm((void *)(to), (void *)(from))
struct page;
Index: linux-2.6.10/arch/arm/mm/copypage-v6.c
===================================================================
--- linux-2.6.10.orig/arch/arm/mm/copypage-v6.c 2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/arm/mm/copypage-v6.c 2005-01-05 10:09:51.000000000 -0800
@@ -47,7 +47,7 @@
*/
void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
{
- clear_page(kaddr);
+ _clear_page(kaddr);
}
/*
@@ -116,7 +116,7 @@
set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
flush_tlb_kernel_page(to);
- clear_page((void *)to);
+ _clear_page((void *)to);
spin_unlock(&v6_lock);
}
Index: linux-2.6.10/arch/m32r/mm/page.S
===================================================================
--- linux-2.6.10.orig/arch/m32r/mm/page.S 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/arch/m32r/mm/page.S 2005-01-05 10:09:51.000000000 -0800
@@ -51,7 +51,7 @@
jmp r14
.text
- .global clear_page
+ .global _clear_page
/*
* clear_page (to)
*
@@ -60,7 +60,7 @@
* 16 * 256
*/
.align 4
-clear_page:
+_clear_page:
ldi r2, #255
ldi r4, #0
ld r3, @r0 /* cache line allocate */
Index: linux-2.6.10/include/asm-ppc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-ppc/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -85,7 +85,7 @@
struct page;
extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define clear_page clear_pages
extern void copy_page(void *to, void *from);
extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/alpha/kernel/alpha_ksyms.c 2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c 2005-01-05 10:09:51.000000000 -0800
@@ -88,7 +88,7 @@
EXPORT_SYMBOL(__memsetw);
EXPORT_SYMBOL(__constant_c_memset);
EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(__direct_map_base);
EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.10/arch/alpha/lib/ev6-clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/ev6-clear_page.S 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/ev6-clear_page.S 2005-01-05 10:09:51.000000000 -0800
@@ -6,9 +6,9 @@
.text
.align 4
- .global clear_page
- .ent clear_page
-clear_page:
+ .global _clear_page
+ .ent _clear_page
+_clear_page:
.prologue 0
lda $0,128
@@ -51,4 +51,4 @@
nop
nop
- .end clear_page
+ .end _clear_page
Index: linux-2.6.10/arch/sh/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/init.c 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/init.c 2005-01-05 10:09:51.000000000 -0800
@@ -57,7 +57,7 @@
#endif
void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);
void show_mem(void)
{
@@ -255,7 +255,7 @@
* later in the boot process if a better method is available.
*/
copy_page = copy_page_slow;
- clear_page = clear_page_slow;
+ _clear_page = clear_page_slow;
/* this will put all low memory onto the freelists */
totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.10/arch/sh/mm/pg-dma.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-dma.c 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-dma.c 2005-01-05 10:09:51.000000000 -0800
@@ -78,7 +78,7 @@
return ret;
copy_page = copy_page_dma;
- clear_page = clear_page_dma;
+ _clear_page = clear_page_dma;
return ret;
}
Index: linux-2.6.10/arch/sh/mm/pg-nommu.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-nommu.c 2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-nommu.c 2005-01-05 10:09:51.000000000 -0800
@@ -27,7 +27,7 @@
static int __init pg_nommu_init(void)
{
copy_page = copy_page_nommu;
- clear_page = clear_page_nommu;
+ _clear_page = clear_page_nommu;
return 0;
}
Index: linux-2.6.10/arch/mips/mm/pg-r4k.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-r4k.c 2004-12-24 13:34:49.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-r4k.c 2005-01-05 10:09:51.000000000 -0800
@@ -39,9 +39,9 @@
static unsigned int clear_page_array[0x130 / 4];
-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
/*
* Maximum sizes:
Index: linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/m32r/kernel/m32r_ksyms.c 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c 2005-01-05 10:09:51.000000000 -0800
@@ -102,7 +102,7 @@
EXPORT_SYMBOL(memcmp);
EXPORT_SYMBOL(memscan);
EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(strcat);
EXPORT_SYMBOL(strchr);
Index: linux-2.6.10/include/asm-arm26/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm26/page.h 2004-12-24 13:35:22.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -25,7 +25,7 @@
preempt_enable(); \
} while (0)
-#define clear_page(page) memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order) memzero((void *)(page), PAGE_SIZE << (order))
#define copy_page(to, from) __copy_user_page(to, from, 0);
#undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-sparc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/page.h 2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/page.h 2005-01-05 10:09:51.000000000 -0800
@@ -14,8 +14,8 @@
#ifndef __ASSEMBLY__
-extern void _clear_page(void *page);
-#define clear_page(X) _clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y) _clear_page((void *)(X),(Y))
struct page;
extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
#define copy_page(X,Y) memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
Index: linux-2.6.10/arch/sparc64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/sparc64/lib/clear_page.S 2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/arch/sparc64/lib/clear_page.S 2005-01-05 10:09:51.000000000 -0800
@@ -28,9 +28,12 @@
.text
.globl _clear_page
-_clear_page: /* %o0=dest */
+_clear_page: /* %o0=dest, %o1=order */
+ sethi %hi(PAGE_SIZE/64), %o2
+ clr %o4
+ or %o2, %lo(PAGE_SIZE/64), %o2
ba,pt %xcc, clear_page_common
- clr %o4
+ sllx %o2, %o1, %o1
/* This thing is pretty important, it shows up
* on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@
flush %g6
wrpr %o4, 0x0, %pstate
+ sethi %hi(PAGE_SIZE/64), %o1
mov 1, %o4
+ or %o1, %lo(PAGE_SIZE/64), %o1
clear_page_common:
VISEntryHalf
membar #StoreLoad | #StoreStore | #LoadStore
fzero %f0
- sethi %hi(PAGE_SIZE/64), %o1
mov %o0, %g1 ! remember vaddr for tlbflush
fzero %f2
- or %o1, %lo(PAGE_SIZE/64), %o1
faddd %f0, %f2, %f4
fmuld %f0, %f2, %f6
faddd %f0, %f2, %f8
Index: linux-2.6.10/drivers/net/tc35815.c
===================================================================
--- linux-2.6.10.orig/drivers/net/tc35815.c 2005-01-05 09:43:48.000000000 -0800
+++ linux-2.6.10/drivers/net/tc35815.c 2005-01-05 10:09:51.000000000 -0800
@@ -657,7 +657,7 @@
dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
#endif
} else {
- clear_page(lp->fd_buf);
+ clear_page(lp->fd_buf, 0);
#ifdef __mips__
dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
#endif
Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c 2005-01-05 09:32:52.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c 2005-01-05 10:09:51.000000000 -0800
@@ -550,10 +550,14 @@
* or two.
*/
static inline void prep_zero_page(struct page *page, int order) {
- int i;
- for(i = 0; i < 1 << order; i++)
- clear_highpage(page + i);
+ if (PageHighMem(page)) {
+ int i;
+
+ for(i = 0; i < 1 << order; i++)
+ clear_highpage(page + i);
+ } else
+ clear_page(page_address(page), order);
}
static struct page *
Index: linux-2.6.10/include/linux/highmem.h
===================================================================
--- linux-2.6.10.orig/include/linux/highmem.h 2005-01-05 10:09:44.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h 2005-01-05 10:10:08.000000000 -0800
@@ -45,7 +45,7 @@
static inline void clear_highpage(struct page *page)
{
void *kaddr = kmap_atomic(page, KM_USER0);
- clear_page(kaddr);
+ clear_page(kaddr, 0);
kunmap_atomic(kaddr, KM_USER0);
}
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [2/4]: Extension of clear_page to take an order parameter
2005-01-05 23:25 ` Christoph Lameter
@ 2005-01-06 13:52 ` Andi Kleen
2005-01-06 17:47 ` Christoph Lameter
0 siblings, 1 reply; 89+ messages in thread
From: Andi Kleen @ 2005-01-06 13:52 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-kernel
Christoph Lameter <clameter@sgi.com> writes:
> Here is an updated version that is independent of the first patch and
> contains all the necessary modifications to make clear_page take a second
> parameter.
I still think the clear_page order addition is completely pointless,
because for > order 0 you probably want a cache bypassing store
in a separate function.
Removing it would also make the patch much less intrusive.
-Andi
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [2/4]: Extension of clear_page to take an order parameter
2005-01-06 13:52 ` Andi Kleen
@ 2005-01-06 17:47 ` Christoph Lameter
0 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-06 17:47 UTC (permalink / raw)
To: Andi Kleen; +Cc: linux-kernel
On Thu, 6 Jan 2005, Andi Kleen wrote:
> Christoph Lameter <clameter@sgi.com> writes:
>
> > Here is an updated version that is independent of the first patch and
> > contains all the necessary modifications to make clear_page take a second
> > parameter.
>
> I still think the clear_page order addition is completely pointless,
> because for > order 0 you probably want a cache bypassing store
> in a separate function.
I would think that having clear_page avoid loading cache
lines from memory should be general improvement.
Bypassing the cache may be beneficial for clear_page in general but I
would like to test that first.
If this is not a win then it may be better to implement the bypassing the
cache through a zero driver.
> Removing it would also make the patch much less intrusive.
Right. I also thought about that. I will likely offer the clear_page patch
as an optional component in V4. Being able to specify an order with
clear_page also helps in other situations like clearing huge pages.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-04 23:13 ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
2005-01-04 23:45 ` Dave Hansen
2005-01-05 0:34 ` Linus Torvalds
@ 2005-01-08 21:12 ` Hugh Dickins
2005-01-08 21:56 ` David S. Miller
2005-01-10 17:16 ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
2 siblings, 2 replies; 89+ messages in thread
From: Hugh Dickins @ 2005-01-08 21:12 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, David S. Miller, linux-ia64, Linus Torvalds,
linux-mm, Linux Kernel Development
On Tue, 4 Jan 2005, Christoph Lameter wrote:
> This patch introduces __GFP_ZERO as an additional gfp_mask element to allow
> to request zeroed pages from the page allocator.
> ...
> --- linux-2.6.10.orig/mm/memory.c 2005-01-04 12:16:41.000000000 -0800
> +++ linux-2.6.10/mm/memory.c 2005-01-04 12:16:49.000000000 -0800
> @@ -1650,10 +1650,9 @@
>
> if (unlikely(anon_vma_prepare(vma)))
> goto no_mem;
> - page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
> + page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
> if (!page)
> goto no_mem;
> - clear_user_highpage(page, addr);
>
> spin_lock(&mm->page_table_lock);
> page_table = pte_offset_map(pmd, addr);
Christoph, a late comment: doesn't this effectively replace
do_anonymous_page's clear_user_highpage by clear_highpage, which would
be a bad idea (inefficient? or corrupting?) on those few architectures
which actually do something with that user addr?
Hugh
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-08 21:12 ` Hugh Dickins
@ 2005-01-08 21:56 ` David S. Miller
2005-01-21 20:09 ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
` (2 more replies)
2005-01-10 17:16 ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
1 sibling, 3 replies; 89+ messages in thread
From: David S. Miller @ 2005-01-08 21:56 UTC (permalink / raw)
To: Hugh Dickins; +Cc: clameter, akpm, linux-ia64, torvalds, linux-mm, linux-kernel
On Sat, 8 Jan 2005 21:12:10 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:
> Christoph, a late comment: doesn't this effectively replace
> do_anonymous_page's clear_user_highpage by clear_highpage, which would
> be a bad idea (inefficient? or corrupting?) on those few architectures
> which actually do something with that user addr?
Good catch, it probably does. We really do need to use
the page clearing routines that pass in the user virtual
address when preparing new anonymous pages or else we'll
get cache aliasing problems on sparc, sparc64, and mips
at the very least. That is what the virtual address argument
was added for to begin with.
The other way to deal with this is to make whatever routine
the kscrubd thing invokes do all the cache flushing et al.
magic so that the above works when taking pages from the
pre-zero'd pool (only, if no pre-zero'd pages are available
we sill need to invoke clear_user_highpage() with the proper
virtual address).
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-08 21:12 ` Hugh Dickins
2005-01-08 21:56 ` David S. Miller
@ 2005-01-10 17:16 ` Christoph Lameter
2005-01-10 18:13 ` Linus Torvalds
1 sibling, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2005-01-10 17:16 UTC (permalink / raw)
To: Hugh Dickins
Cc: Andrew Morton, David S. Miller, linux-ia64, Linus Torvalds,
linux-mm, Linux Kernel Development
On Sat, 8 Jan 2005, Hugh Dickins wrote:
> Christoph, a late comment: doesn't this effectively replace
> do_anonymous_page's clear_user_highpage by clear_highpage, which would
> be a bad idea (inefficient? or corrupting?) on those few architectures
> which actually do something with that user addr?
Yes. Right my ia64 centric vision got me again. Thanks for all the other
patches that were posted. I hope this is now all cleared up?
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-10 17:16 ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
@ 2005-01-10 18:13 ` Linus Torvalds
2005-01-10 20:17 ` Christoph Lameter
2005-01-10 23:53 ` Prezeroing V4 [0/4]: Overview Christoph Lameter
0 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2005-01-10 18:13 UTC (permalink / raw)
To: Christoph Lameter
Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
linux-mm, Linux Kernel Development
On Mon, 10 Jan 2005, Christoph Lameter wrote:
>
> Yes. Right my ia64 centric vision got me again. Thanks for all the other
> patches that were posted. I hope this is now all cleared up?
Hmm.. I fixed things up, but I didn't exactly do it like the posted
patches.
Currently the BK tree
- doesn't use __GFP_ZERO with anonymous user-mapped pages (which is what
you wrote this whole thing for ;)
Potential fix: declare a per-architecture "alloc_user_highpage(vaddr)"
that does the proper magic on virtually indexed machines, and on others
it just does a "alloc_page(GFP_HIGHUSER | __GFP_ZERO)".
- verifies that nobody ever asks for a HIGHMEM allocation together with
__GFP_ZERO (nobody does - a quick grep shows that 99% of all uses are
statically clearly fine (there's a few HIGHMEM zero-page users, but
they are all GFP_KERNEL or similar), with just two special cases:
- get_zeroed_page() - which can't use HIGHMEM anyway
- shm.c does "mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO"
and that's fine because while the mapping gfp masks may lack
GFP_FS and GFP_IO, they are always supposed to be ok with
waiting.
- moves "kernel_map_pages()" into "prep_new_page()" to fix the
DEBUG_PAGEALLOC issue (Chris Wright).
So that should take care of the known problems.
Linus
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V3 [1/4]: Allow request for zeroed memory
2005-01-10 18:13 ` Linus Torvalds
@ 2005-01-10 20:17 ` Christoph Lameter
2005-01-10 23:53 ` Prezeroing V4 [0/4]: Overview Christoph Lameter
1 sibling, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-10 20:17 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
linux-mm, Linux Kernel Development
On Mon, 10 Jan 2005, Linus Torvalds wrote:
> Currently the BK tree
> - doesn't use __GFP_ZERO with anonymous user-mapped pages (which is what
> you wrote this whole thing for ;)
>
> Potential fix: declare a per-architecture "alloc_user_highpage(vaddr)"
> that does the proper magic on virtually indexed machines, and on others
> it just does a "alloc_page(GFP_HIGHUSER | __GFP_ZERO)".
The following patch adds an alloc_zeroed_user_highpage(vma, vaddr). It
also uses zeroed pages on COW. clear_user_highpage is now only used by
that function. Fold it into alloc_zeroed_user_highpage?
This is against last hours bitkeeper tree. mm/memory.o compiles fine but
I was not able to build a ia64 kernel due to some pieces that seem to be
missing in last hours tree.
Index: linus/include/asm-ia64/page.h
===================================================================
--- linus.orig/include/asm-ia64/page.h 2004-10-20 12:04:58.000000000 -0700
+++ linus/include/asm-ia64/page.h 2005-01-10 12:05:55.000000000 -0800
@@ -75,6 +75,16 @@
flush_dcache_page(page); \
} while (0)
+
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+({ \
+ struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+ flush_dcache_page(page); \
+ page; \
+})
+
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
#ifdef CONFIG_VIRTUAL_MEM_MAP
Index: linus/include/asm-h8300/page.h
===================================================================
--- linus.orig/include/asm-h8300/page.h 2004-10-20 12:04:58.000000000 -0700
+++ linus/include/asm-h8300/page.h 2005-01-10 11:53:17.000000000 -0800
@@ -30,6 +30,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linus/mm/memory.c
===================================================================
--- linus.orig/mm/memory.c 2005-01-10 11:44:39.000000000 -0800
+++ linus/mm/memory.c 2005-01-10 12:05:21.000000000 -0800
@@ -84,20 +84,6 @@
EXPORT_SYMBOL(vmalloc_earlyreserve);
/*
- * We special-case the C-O-W ZERO_PAGE, because it's such
- * a common occurrence (no need to read the page to know
- * that it's zero - better for the cache and memory subsystem).
- */
-static inline void copy_cow_page(struct page * from, struct page * to, unsigned long address)
-{
- if (from == ZERO_PAGE(address)) {
- clear_user_highpage(to, address);
- return;
- }
- copy_user_highpage(to, from, address);
-}
-
-/*
* Note: this doesn't free the actual pages themselves. That
* has been handled earlier when unmapping all the memory regions.
*/
@@ -1329,11 +1315,16 @@
if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
- if (!new_page)
- goto no_new_page;
- copy_cow_page(old_page,new_page,address);
-
+ if (old_page == ZERO_PAGE(address)) {
+ new_page = alloc_zeroed_user_highpage(vma, address);
+ if (!new_page)
+ goto no_new_page;
+ } else {
+ new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ if (!new_page)
+ goto no_new_page;
+ copy_user_highpage(new_page, old_page, address);
+ }
/*
* Re-check the pte - we dropped the lock
*/
@@ -1795,10 +1786,9 @@
if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
- page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ page = alloc_zeroed_user_highpage(vma, addr);
if (!page)
goto no_mem;
- clear_user_highpage(page, addr);
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);
Index: linus/include/asm-m32r/page.h
===================================================================
--- linus.orig/include/asm-m32r/page.h 2004-10-20 12:04:58.000000000 -0700
+++ linus/include/asm-m32r/page.h 2005-01-10 12:08:03.000000000 -0800
@@ -17,6 +17,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linus/include/asm-alpha/page.h
===================================================================
--- linus.orig/include/asm-alpha/page.h 2004-10-20 12:04:57.000000000 -0700
+++ linus/include/asm-alpha/page.h 2005-01-10 11:54:37.000000000 -0800
@@ -18,6 +18,9 @@
extern void clear_page(void *page);
#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
extern void copy_page(void * _to, void * _from);
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
Index: linus/include/asm-m68knommu/page.h
===================================================================
--- linus.orig/include/asm-m68knommu/page.h 2005-01-10 09:53:05.000000000 -0800
+++ linus/include/asm-m68knommu/page.h 2005-01-10 11:54:27.000000000 -0800
@@ -30,6 +30,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linus/include/asm-cris/page.h
===================================================================
--- linus.orig/include/asm-cris/page.h 2004-10-20 12:04:57.000000000 -0700
+++ linus/include/asm-cris/page.h 2005-01-10 11:55:06.000000000 -0800
@@ -21,6 +21,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linus/include/linux/highmem.h
===================================================================
--- linus.orig/include/linux/highmem.h 2005-01-06 12:58:48.000000000 -0800
+++ linus/include/linux/highmem.h 2005-01-10 12:08:56.000000000 -0800
@@ -42,6 +42,17 @@
smp_wmb();
}
+#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+static inline struct page* alloc_zeroed_user_highpage(struct vm_area_struct *vma,
+ unsigned long vaddr)
+{
+ struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+
+ clear_user_highpage(page, vaddr);
+ return page;
+}
+#endif
+
static inline void clear_highpage(struct page *page)
{
void *kaddr = kmap_atomic(page, KM_USER0);
Index: linus/include/asm-i386/page.h
===================================================================
--- linus.orig/include/asm-i386/page.h 2005-01-06 12:58:47.000000000 -0800
+++ linus/include/asm-i386/page.h 2005-01-10 12:09:43.000000000 -0800
@@ -36,6 +36,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linus/include/asm-x86_64/page.h
===================================================================
--- linus.orig/include/asm-x86_64/page.h 2005-01-06 12:58:48.000000000 -0800
+++ linus/include/asm-x86_64/page.h 2005-01-10 11:56:04.000000000 -0800
@@ -38,6 +38,8 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
/*
* These are used to make use of C type-checking..
*/
Index: linus/include/asm-s390/page.h
===================================================================
--- linus.orig/include/asm-s390/page.h 2004-10-20 12:04:59.000000000 -0700
+++ linus/include/asm-s390/page.h 2005-01-10 11:56:33.000000000 -0800
@@ -106,6 +106,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/* Pure 2^n version of get_order */
extern __inline__ int get_order(unsigned long size)
{
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V4 [0/4]: Overview
2005-01-10 18:13 ` Linus Torvalds
2005-01-10 20:17 ` Christoph Lameter
@ 2005-01-10 23:53 ` Christoph Lameter
2005-01-10 23:54 ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
` (3 more replies)
1 sibling, 4 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:53 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
linux-mm, Linux Kernel Development
Changes from V3 to V4:
o Drop __GFP_ZERO patch since its in Linus tree. Include new patch that allows
archs that need special measures around zeroing of user pages during a page
fault to maintain their special adaptations.
o Use zeroed pages during COW.
o Updates for clear_page for various platforms. Make clear_page an optional
patch and fall back to a series of clear_page without order if the patch
to expand clear_page patch has not been applied.
o x86_64 asm code fixed up
o Port patches to 2.6.10-bk13 and make it fit the bitmapless buddy allocator
The patches increasing the page fault rate (introduction of atomic pte
operations and anticipatory prefaulting) do so by reducing the locking
overhead and are therefore mainly of interest for applications running in
SMP systems with a high number of cpus. The single thread performance does
just show minor increases. Only the performance of multi-threaded
applications increases significantly.
The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page that is also done in the page fault
handler. This zeroing means that all cachelines of the faulted page (on Altix
that means all 128 cachelines of 128 byte each) must be loaded and later
written back. This patch allows to avoid having to load all cachelines
if only a part of the cachelines of that page is needed immediately after
the fault. Doing so will only be effective for sparsely accessed memory
which is typical for anonymous memory and pte maps. Prezeroed pages will
only be used for those purposes. Unzeroed pages will be used as usual for
file mapping, page caching etc etc.
The patch makes prezeroing very effective by:
1. Aggregating zeroing operations to only apply to pages of higher order,
which results in many pages that will later become zero 0 to be zeroed in one
step.
For that purpose the existing clear_page function is extended and made to
take an additional argument specifying the order of the page to be cleared.
2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.
The scrub daemon is invoked when a unzeroed page of a certain order has
been generated so that its worth running it. If no higher order pages are
present then the logic will favor hot zeroing rather than simply shifting
processing around. kscrubd typically runs only for a fraction of a second
and sleeps for long periods of time even under memory benchmarking. kscrubd
performs short bursts of zeroing when needed and tries to stay out off the
processor as much as possible.
The benefits of prezeroing are reduced to minimal quantities if all
cachelines of a page are touched. Prezeroing can only be effective
if the whole page is not immediately used after the page fault.
The patch is composed of 4 parts:
[1/4] GFP_ZERO fixups
Adds alloc_zeroed_user_highpage(vma, vaddr) that may be customized for
each arch by defining __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE. Includes
proper definitions for a large selection of arches, others fall back to
the default function in include/linux/highmem.h (and falls back to not
using prezeroed pages).
[2/4] Page Zeroing
Adds management of ZEROED and NOT_ZEROED pages and a background daemon
called scrubd. scrubd is disabled by default but can be enabled
by writing an order number to /proc/sys/vm/scrub_start. If a page
is coalesced of that order or higher then the scrub daemon will
start zeroing until all pages of order /proc/sys/vm/scrub_stop and
higher are zeroed and then go back to sleep.
In an SMP environment the scrub daemon is typically
running on the most idle cpu. Thus a single threaded application running
on one cpu may have the other cpu zeroing pages for it etc. The scrub
daemon is hardly noticable and usually finished zeroing quickly since
most processors are optimized for linear memory filling.
The following patches increase performance but may be omitted:
[2/4] SGI Altix Block Transfer Engine Support
Implements a driver to shift the zeroing off the cpu into hardware.
With hardware support the impact of zeroing on the system is reduced
to a minimum.
[4/4] Architecture specific clear_page updates
Adds second order argument to clear_page and updates all arches.
This allows the zeroing of large areas of memory without repeately
invoking clear_page() for the page allocator, scrubd and the huge
page allocator.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V4 [1/4]: Arch specific page zeroing during page fault
2005-01-10 23:53 ` Prezeroing V4 [0/4]: Overview Christoph Lameter
@ 2005-01-10 23:54 ` Christoph Lameter
2005-01-11 0:41 ` Chris Wright
2005-01-10 23:55 ` Prezeroing V4 [2/4]: Zeroing implementation Christoph Lameter
` (2 subsequent siblings)
3 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:54 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
linux-mm, Linux Kernel Development
This patch fixes the __GFP_ZERO related code by adding a new function
alloc_zeroed_user_highpage that is then used in the anonymous page fault
handler and in the COW code to allocate pages. The function can be defined
per arch to setup special processing for user pages by defining
__HAVE_ARCH_ALLOC_ZEROED_USER_PAGE.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.10/include/asm-ia64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/page.h 2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h 2005-01-10 13:53:59.000000000 -0800
@@ -75,6 +75,16 @@
flush_dcache_page(page); \
} while (0)
+
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+({ \
+ struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+ flush_dcache_page(page); \
+ page; \
+})
+
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
#ifdef CONFIG_VIRTUAL_MEM_MAP
Index: linux-2.6.10/include/asm-h8300/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-h8300/page.h 2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h 2005-01-10 13:53:59.000000000 -0800
@@ -30,6 +30,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-10 13:54:30.000000000 -0800
@@ -84,20 +84,6 @@
EXPORT_SYMBOL(vmalloc_earlyreserve);
/*
- * We special-case the C-O-W ZERO_PAGE, because it's such
- * a common occurrence (no need to read the page to know
- * that it's zero - better for the cache and memory subsystem).
- */
-static inline void copy_cow_page(struct page * from, struct page * to, unsigned long address)
-{
- if (from == ZERO_PAGE(address)) {
- clear_user_highpage(to, address);
- return;
- }
- copy_user_highpage(to, from, address);
-}
-
-/*
* Note: this doesn't free the actual pages themselves. That
* has been handled earlier when unmapping all the memory regions.
*/
@@ -1329,11 +1315,16 @@
if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
- if (!new_page)
- goto no_new_page;
- copy_cow_page(old_page,new_page,address);
-
+ if (old_page == ZERO_PAGE(address)) {
+ new_page = alloc_zeroed_user_highpage(vma, address);
+ if (!new_page)
+ goto no_new_page;
+ } else {
+ new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ if (!new_page)
+ goto no_new_page;
+ copy_user_highpage(new_page, old_page, address);
+ }
/*
* Re-check the pte - we dropped the lock
*/
@@ -1795,7 +1786,7 @@
if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
- page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
+ page = alloc_zeroed_user_highpage(vma, addr);
if (!page)
goto no_mem;
Index: linux-2.6.10/include/asm-m32r/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h 2005-01-10 13:53:59.000000000 -0800
@@ -17,6 +17,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linux-2.6.10/include/asm-alpha/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/page.h 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h 2005-01-10 13:53:59.000000000 -0800
@@ -18,6 +18,9 @@
extern void clear_page(void *page);
#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
extern void copy_page(void * _to, void * _from);
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
Index: linux-2.6.10/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68knommu/page.h 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h 2005-01-10 13:53:59.000000000 -0800
@@ -30,6 +30,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linux-2.6.10/include/asm-cris/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/page.h 2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h 2005-01-10 13:53:59.000000000 -0800
@@ -21,6 +21,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linux-2.6.10/include/linux/highmem.h
===================================================================
--- linux-2.6.10.orig/include/linux/highmem.h 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h 2005-01-10 13:53:59.000000000 -0800
@@ -42,6 +42,17 @@
smp_wmb();
}
+#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+static inline struct page* alloc_zeroed_user_highpage(struct vm_area_struct *vma,
+ unsigned long vaddr)
+{
+ struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+
+ clear_user_highpage(page, vaddr);
+ return page;
+}
+#endif
+
static inline void clear_highpage(struct page *page)
{
void *kaddr = kmap_atomic(page, KM_USER0);
Index: linux-2.6.10/include/asm-i386/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/page.h 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h 2005-01-10 13:53:59.000000000 -0800
@@ -36,6 +36,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linux-2.6.10/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/page.h 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h 2005-01-10 13:53:59.000000000 -0800
@@ -38,6 +38,8 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
/*
* These are used to make use of C type-checking..
*/
Index: linux-2.6.10/include/asm-s390/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/page.h 2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h 2005-01-10 13:53:59.000000000 -0800
@@ -106,6 +106,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/* Pure 2^n version of get_order */
extern __inline__ int get_order(unsigned long size)
{
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V4 [2/4]: Zeroing implementation
2005-01-10 23:53 ` Prezeroing V4 [0/4]: Overview Christoph Lameter
2005-01-10 23:54 ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
@ 2005-01-10 23:55 ` Christoph Lameter
2005-01-10 23:55 ` Prezeroing V4 [3/4]: Altix SN2 BTE zero driver Christoph Lameter
2005-01-10 23:56 ` Prezeroing V4 [4/4]: Extend clear_page to take an order parameter Christoph Lameter
3 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:55 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
linux-mm, Linux Kernel Development
o Add page zeroing
o Add scrub daemon
o Add ability to view amount of zeroed information in /proc/meninfo
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c 2005-01-10 14:44:22.000000000 -0800
@@ -12,6 +12,7 @@
* Zone balancing, Kanoj Sarcar, SGI, Jan 2000
* Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
* (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ * Support for page zeroing, Christoph Lameter, SGI, Dec 2004
*/
#include <linux/config.h>
@@ -33,6 +34,7 @@
#include <linux/cpu.h>
#include <linux/nodemask.h>
#include <linux/vmalloc.h>
+#include <linux/scrub.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -167,16 +169,16 @@
* zone->lock is already acquired when we use these.
* So, we don't need atomic page->flags operations here.
*/
-static inline unsigned long page_order(struct page *page) {
+static inline unsigned long page_zorder(struct page *page) {
return page->private;
}
-static inline void set_page_order(struct page *page, int order) {
- page->private = order;
+static inline void set_page_zorder(struct page *page, int order, int zero) {
+ page->private = order + (zero << 10);
__SetPagePrivate(page);
}
-static inline void rmv_page_order(struct page *page)
+static inline void rmv_page_zorder(struct page *page)
{
__ClearPagePrivate(page);
page->private = 0;
@@ -187,14 +189,15 @@
* we can do coalesce a page and its buddy if
* (a) the buddy is free &&
* (b) the buddy is on the buddy system &&
- * (c) a page and its buddy have the same order.
+ * (c) a page and its buddy have the same order and the same
+ * zeroing status.
* for recording page's order, we use page->private and PG_private.
*
*/
-static inline int page_is_buddy(struct page *page, int order)
+static inline int page_is_buddy(struct page *page, int order, int zero)
{
if (PagePrivate(page) &&
- (page_order(page) == order) &&
+ (page_zorder(page) == order + (zero << 10)) &&
!PageReserved(page) &&
page_count(page) == 0)
return 1;
@@ -225,22 +228,20 @@
* -- wli
*/
-static inline void __free_pages_bulk (struct page *page, struct page *base,
- struct zone *zone, unsigned int order)
+static inline int __free_pages_bulk (struct page *page, struct page *base,
+ struct zone *zone, unsigned int order, int zero)
{
unsigned long page_idx;
struct page *coalesced;
- int order_size = 1 << order;
if (unlikely(order))
destroy_compound_page(page, order);
page_idx = page - base;
- BUG_ON(page_idx & (order_size - 1));
+ BUG_ON(page_idx & (( 1 << order) - 1));
BUG_ON(bad_range(zone, page));
- zone->free_pages += order_size;
while (order < MAX_ORDER-1) {
struct free_area *area;
struct page *buddy;
@@ -250,20 +251,21 @@
buddy = base + buddy_idx;
if (bad_range(zone, buddy))
break;
- if (!page_is_buddy(buddy, order))
+ if (!page_is_buddy(buddy, order, zero))
break;
/* Move the buddy up one level. */
list_del(&buddy->lru);
- area = zone->free_area + order;
+ area = zone->free_area[zero] + order;
area->nr_free--;
- rmv_page_order(buddy);
+ rmv_page_zorder(buddy);
page_idx &= buddy_idx;
order++;
}
coalesced = base + page_idx;
- set_page_order(coalesced, order);
- list_add(&coalesced->lru, &zone->free_area[order].free_list);
- zone->free_area[order].nr_free++;
+ set_page_zorder(coalesced, order, zero);
+ list_add(&coalesced->lru, &zone->free_area[zero][order].free_list);
+ zone->free_area[zero][order].nr_free++;
+ return order;
}
static inline void free_pages_check(const char *function, struct page *page)
@@ -312,8 +314,11 @@
page = list_entry(list->prev, struct page, lru);
/* have to delete it as __free_pages_bulk list manipulates */
list_del(&page->lru);
- __free_pages_bulk(page, base, zone, order);
+ if (__free_pages_bulk(page, base, zone, order, NOT_ZEROED)
+ >= sysctl_scrub_start)
+ wakeup_kscrubd(zone);
ret++;
+ zone->free_pages += 1UL << order;
}
spin_unlock_irqrestore(&zone->lock, flags);
return ret;
@@ -341,6 +346,18 @@
free_pages_bulk(page_zone(page), 1, &list, order);
}
+void end_zero_page(struct page *page, unsigned int order)
+{
+ unsigned long flags;
+ struct zone * zone = page_zone(page);
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ __free_pages_bulk(page, zone->zone_mem_map, zone, order, ZEROED);
+ zone->zero_pages += 1UL << order;
+
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
/*
* The order of subdivision here is critical for the IO subsystem.
@@ -358,7 +375,7 @@
*/
static inline struct page *
expand(struct zone *zone, struct page *page,
- int low, int high, struct free_area *area)
+ int low, int high, struct free_area *area, int zero)
{
unsigned long size = 1 << high;
@@ -369,7 +386,7 @@
BUG_ON(bad_range(zone, &page[size]));
list_add(&page[size].lru, &area->free_list);
area->nr_free++;
- set_page_order(&page[size], high);
+ set_page_zorder(&page[size], high, zero);
}
return page;
}
@@ -419,23 +436,44 @@
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct free_area *area)
+{
+ list_del(&page->lru);
+ rmv_page_zorder(page);
+ area->nr_free--;
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area)
+{
+ unsigned long flags;
+ struct page *page = NULL;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ if (!list_empty(&area->free_list)) {
+ page = list_entry(area->free_list.next, struct page, lru);
+ rmpage(page, area);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
{
- struct free_area * area;
+ struct free_area *area;
unsigned int current_order;
struct page *page;
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
- area = zone->free_area + current_order;
+ area = zone->free_area[zero] + current_order;
if (list_empty(&area->free_list))
continue;
page = list_entry(area->free_list.next, struct page, lru);
- list_del(&page->lru);
- rmv_page_order(page);
- area->nr_free--;
+ rmpage(page, zone->free_area[zero] + current_order);
zone->free_pages -= 1UL << order;
- return expand(zone, page, order, current_order, area);
+ if (zero)
+ zone->zero_pages -= 1UL << order;
+ return expand(zone, page, order, current_order, area, zero);
}
return NULL;
@@ -447,7 +485,7 @@
* Returns the number of new pages which were placed at *list.
*/
static int rmqueue_bulk(struct zone *zone, unsigned int order,
- unsigned long count, struct list_head *list)
+ unsigned long count, struct list_head *list, int zero)
{
unsigned long flags;
int i;
@@ -456,7 +494,7 @@
spin_lock_irqsave(&zone->lock, flags);
for (i = 0; i < count; ++i) {
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, zero);
if (page == NULL)
break;
allocated++;
@@ -503,7 +541,7 @@
ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));
for (order = MAX_ORDER - 1; order >= 0; --order)
- list_for_each(curr, &zone->free_area[order].free_list) {
+ list_for_each(curr, &zone->free_area[NOT_ZEROED][order].free_list) {
unsigned long start_pfn, i;
start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -595,7 +633,7 @@
* we cheat by calling it from here, in the order > 0 path. Saves a branch
* or two.
*/
-static inline void prep_zero_page(struct page *page, int order)
+void prep_zero_page(struct page *page, unsigned int order)
{
int i;
@@ -608,7 +646,9 @@
{
unsigned long flags;
struct page *page = NULL;
- int cold = !!(gfp_flags & __GFP_COLD);
+ int nr_pages = 1 << order;
+ int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+ int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;
if (order == 0) {
struct per_cpu_pages *pcp;
@@ -617,7 +657,7 @@
local_irq_save(flags);
if (pcp->count <= pcp->low)
pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list);
+ pcp->batch, &pcp->list, zero);
if (pcp->count) {
page = list_entry(pcp->list.next, struct page, lru);
list_del(&page->lru);
@@ -629,16 +669,25 @@
if (page == NULL) {
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, zero);
+ /*
+ * If we failed to obtain a zero and/or unzeroed page
+ * then we may still be able to obtain the other
+ * type of page.
+ */
+ if (!page) {
+ page = __rmqueue(zone, order, !zero);
+ zero = 0;
+ }
spin_unlock_irqrestore(&zone->lock, flags);
}
if (page != NULL) {
BUG_ON(bad_range(zone, page));
- mod_page_state_zone(zone, pgalloc, 1 << order);
+ mod_page_state_zone(zone, pgalloc, nr_pages);
prep_new_page(page, order);
- if (gfp_flags & __GFP_ZERO)
+ if ((gfp_flags & __GFP_ZERO) && !zero)
prep_zero_page(page, order);
if (order && (gfp_flags & __GFP_COMP))
@@ -667,7 +716,7 @@
return 0;
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
- free_pages -= z->free_area[o].nr_free << o;
+ free_pages -= (z->free_area[NOT_ZEROED][o].nr_free + z->free_area[ZEROED][o].nr_free) << o;
/* Require fewer higher order pages to be free */
min >>= 1;
@@ -1045,7 +1094,7 @@
}
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free, struct pglist_data *pgdat)
+ unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
{
struct zone *zones = pgdat->node_zones;
int i;
@@ -1053,27 +1102,31 @@
*active = 0;
*inactive = 0;
*free = 0;
+ *zero = 0;
for (i = 0; i < MAX_NR_ZONES; i++) {
*active += zones[i].nr_active;
*inactive += zones[i].nr_inactive;
*free += zones[i].free_pages;
+ *zero += zones[i].zero_pages;
}
}
void get_zone_counts(unsigned long *active,
- unsigned long *inactive, unsigned long *free)
+ unsigned long *inactive, unsigned long *free, unsigned long *zero)
{
struct pglist_data *pgdat;
*active = 0;
*inactive = 0;
*free = 0;
+ *zero = 0;
for_each_pgdat(pgdat) {
- unsigned long l, m, n;
- __get_zone_counts(&l, &m, &n, pgdat);
+ unsigned long l, m, n,o;
+ __get_zone_counts(&l, &m, &n, &o, pgdat);
*active += l;
*inactive += m;
*free += n;
+ *zero += o;
}
}
@@ -1110,6 +1163,7 @@
#define K(x) ((x) << (PAGE_SHIFT-10))
+const char *temp[3] = { "hot", "cold", "zero" };
/*
* Show free area list (used inside shift_scroll-lock stuff)
* We also calculate the percentage fragmentation. We do this by counting the
@@ -1122,6 +1176,7 @@
unsigned long active;
unsigned long inactive;
unsigned long free;
+ unsigned long zero;
struct zone *zone;
for_each_zone(zone) {
@@ -1142,10 +1197,10 @@
pageset = zone->pageset + cpu;
- for (temperature = 0; temperature < 2; temperature++)
+ for (temperature = 0; temperature < 3; temperature++)
printk("cpu %d %s: low %d, high %d, batch %d\n",
cpu,
- temperature ? "cold" : "hot",
+ temp[temperature],
pageset->pcp[temperature].low,
pageset->pcp[temperature].high,
pageset->pcp[temperature].batch);
@@ -1153,20 +1208,21 @@
}
get_page_state(&ps);
- get_zone_counts(&active, &inactive, &free);
+ get_zone_counts(&active, &inactive, &free, &zero);
printk("\nFree pages: %11ukB (%ukB HighMem)\n",
K(nr_free_pages()),
K(nr_free_highpages()));
printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
- "unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+ "unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
active,
inactive,
ps.nr_dirty,
ps.nr_writeback,
ps.nr_unstable,
nr_free_pages(),
+ zero,
ps.nr_slab,
ps.nr_mapped,
ps.nr_page_table_pages);
@@ -1215,7 +1271,7 @@
spin_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
- nr = zone->free_area[order].nr_free;
+ nr = zone->free_area[NOT_ZEROED][order].nr_free + zone->free_area[ZEROED][order].nr_free;
total += nr << order;
printk("%lu*%lukB ", nr, K(1UL) << order);
}
@@ -1515,8 +1571,10 @@
{
int order;
for (order = 0; order < MAX_ORDER ; order++) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list);
- zone->free_area[order].nr_free = 0;
+ INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+ INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
+ zone->free_area[NOT_ZEROED][order].nr_free = 0;
+ zone->free_area[ZEROED][order].nr_free = 0;
}
}
@@ -1541,6 +1599,7 @@
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
+ init_waitqueue_head(&pgdat->kscrubd_wait);
pgdat->kswapd_max_order = 0;
for (j = 0; j < MAX_NR_ZONES; j++) {
@@ -1564,6 +1623,7 @@
spin_lock_init(&zone->lru_lock);
zone->zone_pgdat = pgdat;
zone->free_pages = 0;
+ zone->zero_pages = 0;
zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
@@ -1597,6 +1657,13 @@
pcp->high = 2 * batch;
pcp->batch = 1 * batch;
INIT_LIST_HEAD(&pcp->list);
+
+ pcp = &zone->pageset[cpu].pcp[2]; /* zero pages */
+ pcp->count = 0;
+ pcp->low = 0;
+ pcp->high = 2 * batch;
+ pcp->batch = 1 * batch;
+ INIT_LIST_HEAD(&pcp->list);
}
printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
zone_names[j], realsize, batch);
@@ -1722,7 +1789,7 @@
spin_lock_irqsave(&zone->lock, flags);
seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
for (order = 0; order < MAX_ORDER; ++order)
- seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+ seq_printf(m, "%6lu ", zone->free_area[NOT_ZEROED][order].nr_free);
spin_unlock_irqrestore(&zone->lock, flags);
seq_putc(m, '\n');
}
Index: linux-2.6.10/include/linux/mmzone.h
===================================================================
--- linux-2.6.10.orig/include/linux/mmzone.h 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/mmzone.h 2005-01-10 13:54:50.000000000 -0800
@@ -51,7 +51,7 @@
};
struct per_cpu_pageset {
- struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */
+ struct per_cpu_pages pcp[3]; /* 0: hot. 1: cold 2: cold zeroed pages */
#ifdef CONFIG_NUMA
unsigned long numa_hit; /* allocated in intended node */
unsigned long numa_miss; /* allocated in non intended node */
@@ -107,10 +107,14 @@
* ZONE_HIGHMEM > 896 MB only page cache and user processes
*/
+#define NOT_ZEROED 0
+#define ZEROED 1
+
struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long free_pages;
unsigned long pages_min, pages_low, pages_high;
+ unsigned long zero_pages;
/*
* protection[] is a pre-calculated number of extra pages that must be
* available in a zone in order for __alloc_pages() to allocate memory
@@ -131,7 +135,7 @@
* free areas of different sizes
*/
spinlock_t lock;
- struct free_area free_area[MAX_ORDER];
+ struct free_area free_area[2][MAX_ORDER];
ZONE_PADDING(_pad1_)
@@ -266,6 +270,9 @@
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
int kswapd_max_order;
+
+ wait_queue_head_t kscrubd_wait;
+ struct task_struct *kscrubd;
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
@@ -274,9 +281,9 @@
extern struct pglist_data *pgdat_list;
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free, struct pglist_data *pgdat);
+ unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
void get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free);
+ unsigned long *free, unsigned long *zero);
void build_all_zonelists(void);
void wakeup_kswapd(struct zone *zone, int order);
int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Index: linux-2.6.10/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.10.orig/fs/proc/proc_misc.c 2005-01-10 13:48:10.000000000 -0800
+++ linux-2.6.10/fs/proc/proc_misc.c 2005-01-10 13:54:50.000000000 -0800
@@ -123,12 +123,13 @@
unsigned long inactive;
unsigned long active;
unsigned long free;
+ unsigned long zero;
unsigned long committed;
unsigned long allowed;
struct vmalloc_info vmi;
get_page_state(&ps);
- get_zone_counts(&active, &inactive, &free);
+ get_zone_counts(&active, &inactive, &free, &zero);
/*
* display in kilobytes.
@@ -148,6 +149,7 @@
len = sprintf(page,
"MemTotal: %8lu kB\n"
"MemFree: %8lu kB\n"
+ "MemZero: %8lu kB\n"
"Buffers: %8lu kB\n"
"Cached: %8lu kB\n"
"SwapCached: %8lu kB\n"
@@ -171,6 +173,7 @@
"VmallocChunk: %8lu kB\n",
K(i.totalram),
K(i.freeram),
+ K(zero),
K(i.bufferram),
K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
K(total_swapcache_pages),
Index: linux-2.6.10/mm/readahead.c
===================================================================
--- linux-2.6.10.orig/mm/readahead.c 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/readahead.c 2005-01-10 13:54:50.000000000 -0800
@@ -573,7 +573,8 @@
unsigned long active;
unsigned long inactive;
unsigned long free;
+ unsigned long zero;
- __get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+ __get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
return min(nr, (inactive + free) / 2);
}
Index: linux-2.6.10/drivers/base/node.c
===================================================================
--- linux-2.6.10.orig/drivers/base/node.c 2005-01-10 13:48:08.000000000 -0800
+++ linux-2.6.10/drivers/base/node.c 2005-01-10 13:54:50.000000000 -0800
@@ -42,13 +42,15 @@
unsigned long inactive;
unsigned long active;
unsigned long free;
+ unsigned long zero;
si_meminfo_node(&i, nid);
- __get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+ __get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));
n = sprintf(buf, "\n"
"Node %d MemTotal: %8lu kB\n"
"Node %d MemFree: %8lu kB\n"
+ "Node %d MemZero: %8lu kB\n"
"Node %d MemUsed: %8lu kB\n"
"Node %d Active: %8lu kB\n"
"Node %d Inactive: %8lu kB\n"
@@ -58,6 +60,7 @@
"Node %d LowFree: %8lu kB\n",
nid, K(i.totalram),
nid, K(i.freeram),
+ nid, K(zero),
nid, K(i.totalram - i.freeram),
nid, K(active),
nid, K(inactive),
Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h 2005-01-10 13:54:50.000000000 -0800
@@ -731,6 +731,7 @@
#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
#define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */
+#define PF_KSCRUBD 0x00800000 /* I am kscrubd */
#ifdef CONFIG_SMP
extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.10/mm/Makefile
===================================================================
--- linux-2.6.10.orig/mm/Makefile 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/Makefile 2005-01-10 13:54:50.000000000 -0800
@@ -5,7 +5,7 @@
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
- vmalloc.o
+ vmalloc.o scrubd.o
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
page_alloc.o page-writeback.o pdflush.o \
Index: linux-2.6.10/mm/scrubd.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/mm/scrubd.c 2005-01-10 14:56:20.000000000 -0800
@@ -0,0 +1,134 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = 5; /* if a page of this order is coalesed then run kscrubd */
+unsigned int sysctl_scrub_stop = 2; /* Mininum order of page to zero */
+unsigned int sysctl_scrub_load = 999; /* Do not run scrubd if load > */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ proc_dointvec(table, write, file, buffer, length, ppos);
+ if (sysctl_scrub_start < MAX_ORDER) {
+ struct zone *zone;
+
+ for_each_zone(zone)
+ wakeup_kscrubd(zone);
+ }
+ return 0;
+}
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+ int order;
+
+ for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+ struct free_area *area = z->free_area[NOT_ZEROED] + order;
+ if (!list_empty(&area->free_list)) {
+ struct page *page = scrubd_rmpage(z, area);
+ struct list_head *l;
+ int size = PAGE_SIZE << order;
+
+ if (!page)
+ continue;
+
+ list_for_each(l, &zero_drivers) {
+ struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+
+ if (driver->start(page_address(page), size) == 0)
+ goto done;
+ }
+
+ /* Unable to find a zeroing device that would
+ * deal with this page so just do it on our own.
+ * This will likely thrash the cpu caches.
+ */
+ cond_resched();
+ prep_zero_page(page, order);
+done:
+ end_zero_page(page, order);
+ cond_resched();
+ return 1 << order;
+ }
+ }
+ return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+ int i;
+ unsigned long pages_zeroed;
+
+ if (system_state != SYSTEM_RUNNING)
+ return;
+
+ do {
+ pages_zeroed = 0;
+ for (i = 0; i < pgdat->nr_zones; i++) {
+ struct zone *zone = pgdat->node_zones + i;
+
+ pages_zeroed += zero_highest_order_page(zone);
+ }
+ } while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+ pg_data_t *pgdat = (pg_data_t*)p;
+ struct task_struct *tsk = current;
+ DEFINE_WAIT(wait);
+ cpumask_t cpumask;
+
+ daemonize("kscrubd%d", pgdat->node_id);
+ cpumask = node_to_cpumask(pgdat->node_id);
+ if (!cpus_empty(cpumask))
+ set_cpus_allowed(tsk, cpumask);
+
+ tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+ for ( ; ; ) {
+ if (current->flags & PF_FREEZE)
+ refrigerator(PF_FREEZE);
+ prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+ schedule();
+ finish_wait(&pgdat->kscrubd_wait, &wait);
+
+ scrub_pgdat(pgdat);
+ }
+ return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+ pg_data_t *pgdat;
+ for_each_pgdat(pgdat)
+ pgdat->kscrubd
+ = find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+ return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.10/include/linux/scrub.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/include/linux/scrub.h 2005-01-10 14:34:25.000000000 -0800
@@ -0,0 +1,49 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+ int (*start)(void *, unsigned long); /* Start bzero transfer */
+ struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+extern unsigned int sysctl_scrub_load;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+ list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+ list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+ if (avenrun[0] >= ((unsigned long)sysctl_scrub_load << FSHIFT))
+ return;
+ if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+ return;
+ wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page, unsigned int order);
+#endif
Index: linux-2.6.10/kernel/sysctl.c
===================================================================
--- linux-2.6.10.orig/kernel/sysctl.c 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/kernel/sysctl.c 2005-01-10 13:54:50.000000000 -0800
@@ -40,6 +40,7 @@
#include <linux/times.h>
#include <linux/limits.h>
#include <linux/dcache.h>
+#include <linux/scrub.h>
#include <linux/syscalls.h>
#include <asm/uaccess.h>
@@ -827,6 +828,33 @@
.strategy = &sysctl_jiffies,
},
#endif
+ {
+ .ctl_name = VM_SCRUB_START,
+ .procname = "scrub_start",
+ .data = &sysctl_scrub_start,
+ .maxlen = sizeof(sysctl_scrub_start),
+ .mode = 0644,
+ .proc_handler = &scrub_start_handler,
+ .strategy = &sysctl_intvec,
+ },
+ {
+ .ctl_name = VM_SCRUB_STOP,
+ .procname = "scrub_stop",
+ .data = &sysctl_scrub_stop,
+ .maxlen = sizeof(sysctl_scrub_stop),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ },
+ {
+ .ctl_name = VM_SCRUB_LOAD,
+ .procname = "scrub_load",
+ .data = &sysctl_scrub_load,
+ .maxlen = sizeof(sysctl_scrub_load),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ },
{ .ctl_name = 0 }
};
Index: linux-2.6.10/include/linux/sysctl.h
===================================================================
--- linux-2.6.10.orig/include/linux/sysctl.h 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/sysctl.h 2005-01-10 13:54:50.000000000 -0800
@@ -169,6 +169,9 @@
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+ VM_SCRUB_START=30, /* percentage * 10 at which to start scrubd */
+ VM_SCRUB_STOP=31, /* percentage * 10 at which to stop scrubd */
+ VM_SCRUB_LOAD=32, /* Load factor at which not to scrub anymore */
};
Index: linux-2.6.10/include/linux/gfp.h
===================================================================
--- linux-2.6.10.orig/include/linux/gfp.h 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/linux/gfp.h 2005-01-10 13:54:50.000000000 -0800
@@ -132,4 +132,5 @@
void page_alloc_init(void);
+void prep_zero_page(struct page *, unsigned int order);
#endif /* __LINUX_GFP_H */
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V4 [3/4]: Altix SN2 BTE zero driver
2005-01-10 23:53 ` Prezeroing V4 [0/4]: Overview Christoph Lameter
2005-01-10 23:54 ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
2005-01-10 23:55 ` Prezeroing V4 [2/4]: Zeroing implementation Christoph Lameter
@ 2005-01-10 23:55 ` Christoph Lameter
2005-01-10 23:56 ` Prezeroing V4 [4/4]: Extend clear_page to take an order parameter Christoph Lameter
3 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:55 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
linux-mm, Linux Kernel Development
o Zeroing driver implemented with the Block Transfer Engine in the Altix
SN2 SHub.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.10/arch/ia64/sn/kernel/bte.c
===================================================================
--- linux-2.6.10.orig/arch/ia64/sn/kernel/bte.c 2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10/arch/ia64/sn/kernel/bte.c 2005-01-10 13:54:52.000000000 -0800
@@ -4,6 +4,8 @@
* for more details.
*
* Copyright (c) 2000-2003 Silicon Graphics, Inc. All Rights Reserved.
+ *
+ * Support for zeroing pages, Christoph Lameter, SGI, December 2004.
*/
#include <linux/config.h>
@@ -20,6 +22,8 @@
#include <linux/bootmem.h>
#include <linux/string.h>
#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/scrub.h>
#include <asm/sn/bte.h>
@@ -30,7 +34,7 @@
/* two interfaces on two btes */
#define MAX_INTERFACES_TO_TRY 4
-static struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
+static inline struct bteinfo_s *bte_if_on_node(nasid_t nasid, int interface)
{
nodepda_t *tmp_nodepda;
@@ -132,7 +136,6 @@
if (bte == NULL) {
continue;
}
-
if (spin_trylock(&bte->spinlock)) {
if (!(*bte->most_rcnt_na & BTE_WORD_AVAILABLE) ||
(BTE_LNSTAT_LOAD(bte) & BTE_ACTIVE)) {
@@ -157,7 +160,7 @@
}
} while (1);
- if (notification == NULL) {
+ if (notification == NULL || (mode & BTE_NOTIFY_AND_GET_POINTER)) {
/* User does not want to be notified. */
bte->most_rcnt_na = &bte->notify;
} else {
@@ -192,6 +195,8 @@
itc_end = ia64_get_itc() + (40000000 * local_cpu_data->cyc_per_usec);
+ if (mode & BTE_NOTIFY_AND_GET_POINTER)
+ *(u64 volatile **)(notification) = &bte->notify;
spin_unlock_irqrestore(&bte->spinlock, irq_flags);
if (notification != NULL) {
@@ -449,5 +454,47 @@
mynodepda->bte_if[i].cleanup_active = 0;
mynodepda->bte_if[i].bh_error = 0;
}
+}
+
+u64 *bte_zero_notify[MAX_COMPACT_NODES];
+
+#define ZERO_RATE_PER_SEC 500000000
+
+static int bte_start_bzero(void *p, unsigned long len)
+{
+ int rc;
+ int ticks;
+ int node = get_nasid();
+
+ /* Check limitations.
+ 1. System must be running (weird things happen during bootup)
+ 2. Size >64KB. Smaller requests cause too much bte traffic
+ */
+ if (len >= BTE_MAX_XFER || len < 60000 || system_state != SYSTEM_RUNNING)
+ return EINVAL;
+
+ rc = bte_zero(ia64_tpa(p), len, BTE_NOTIFY_AND_GET_POINTER, bte_zero_notify+node);
+ if (rc)
+ return rc;
+
+ ticks = (len*HZ)/ZERO_RATE_PER_SEC;
+ if (ticks) {
+ /* Wait the minimum time of the transfer */
+ current->state = TASK_INTERRUPTIBLE;
+ schedule_timeout(ticks);
+ }
+ while (*(bte_zero_notify[node]) != BTE_WORD_BUSY) {
+ /* Then keep on checking until transfer is complete */
+ cpu_relax();
+ schedule();
+ }
+ return 0;
+}
+
+static struct zero_driver bte_bzero = {
+ .start = bte_start_bzero,
+};
+void sn_bte_bzero_init(void) {
+ register_zero_driver(&bte_bzero);
}
Index: linux-2.6.10/arch/ia64/sn/kernel/setup.c
===================================================================
--- linux-2.6.10.orig/arch/ia64/sn/kernel/setup.c 2005-01-10 13:48:08.000000000 -0800
+++ linux-2.6.10/arch/ia64/sn/kernel/setup.c 2005-01-10 13:54:52.000000000 -0800
@@ -244,6 +244,7 @@
int pxm;
int major = sn_sal_rev_major(), minor = sn_sal_rev_minor();
extern void sn_cpu_init(void);
+ extern void sn_bte_bzero_init(void);
/*
* If the generic code has enabled vga console support - lets
@@ -334,6 +335,7 @@
screen_info = sn_screen_info;
sn_timer_init();
+ sn_bte_bzero_init();
}
/**
Index: linux-2.6.10/include/asm-ia64/sn/bte.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/sn/bte.h 2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/sn/bte.h 2005-01-10 13:54:52.000000000 -0800
@@ -48,6 +48,8 @@
#define BTE_ZERO_FILL (BTE_NOTIFY | IBCT_ZFIL_MODE)
/* Use a reserved bit to let the caller specify a wait for any BTE */
#define BTE_WACQUIRE (0x4000)
+/* Return the pointer to the notification cacheline to the user */
+#define BTE_NOTIFY_AND_GET_POINTER (0x8000)
/* Use the BTE on the node with the destination memory */
#define BTE_USE_DEST (BTE_WACQUIRE << 1)
/* Use any available BTE interface on any node for the transfer */
^ permalink raw reply [flat|nested] 89+ messages in thread
* Prezeroing V4 [4/4]: Extend clear_page to take an order parameter
2005-01-10 23:53 ` Prezeroing V4 [0/4]: Overview Christoph Lameter
` (2 preceding siblings ...)
2005-01-10 23:55 ` Prezeroing V4 [3/4]: Altix SN2 BTE zero driver Christoph Lameter
@ 2005-01-10 23:56 ` Christoph Lameter
3 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-10 23:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hugh Dickins, Andrew Morton, David S. Miller, linux-ia64,
linux-mm, Linux Kernel Development
- Extend clear_page to take an order parameter.
Architecture support:
---------------------
Known to work:
ia64
i386
x86_64
sparc64
m68k
Trivial modification expected to simply work:
arm
cris
h8300
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um
Modification made but it would be good to have some feedback from the arch maintainers:
s390
alpha
sh
mips
m32r
Index: linux-2.6.10/include/asm-ia64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/page.h 2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h 2005-01-10 14:23:21.000000000 -0800
@@ -56,7 +56,7 @@
# ifdef __KERNEL__
# define STRICT_MM_TYPECHECKS
-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
extern void copy_page (void *to, void *from);
/*
@@ -65,7 +65,7 @@
*/
#define clear_user_page(addr, vaddr, page) \
do { \
- clear_page(addr); \
+ clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
Index: linux-2.6.10/include/asm-i386/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/page.h 2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -18,7 +18,7 @@
#include <asm/mmx.h>
-#define clear_page(page) mmx_clear_page((void *)(page))
+#define clear_page(page, order) mmx_clear_page((void *)(page),order)
#define copy_page(to,from) mmx_copy_page(to,from)
#else
@@ -28,12 +28,12 @@
* Maybe the K6-III ?
*/
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
#endif
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/page.h 2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -32,10 +32,10 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-void clear_page(void *);
+void clear_page(void *, int);
void copy_page(void *, void *);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-sparc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -28,10 +28,10 @@
#ifndef __ASSEMBLY__
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
sparc_flush_page_to_ram(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.10/include/asm-s390/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/page.h 2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -22,12 +22,12 @@
#ifndef __s390x__
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
register_pair rp;
rp.subreg.even = (unsigned long) page;
- rp.subreg.odd = (unsigned long) 4096;
+ rp.subreg.odd = (unsigned long) 4096 << order;
asm volatile (" slr 1,1\n"
" mvcl %0,0"
: "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@
#else /* __s390x__ */
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
- asm volatile (" lgr 2,%0\n"
+ int nr = 1 << order;
+
+ while (nr-- >0) {
+ asm volatile (" lgr 2,%0\n"
" lghi 3,4096\n"
" slgr 1,1\n"
" mvcl 2,0"
: : "a" ((void *) (page))
: "memory", "cc", "1", "2", "3" );
+ page += PAGE_SIZE;
+ }
}
static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@
#endif /* __s390x__ */
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/arch/i386/lib/mmx.c
===================================================================
--- linux-2.6.10.orig/arch/i386/lib/mmx.c 2004-12-24 13:34:48.000000000 -0800
+++ linux-2.6.10/arch/i386/lib/mmx.c 2005-01-10 14:23:22.000000000 -0800
@@ -128,7 +128,7 @@
* other MMX using processors do not.
*/
-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
{
int i;
@@ -138,7 +138,7 @@
" pxor %%mm0, %%mm0\n" : :
);
- for(i=0;i<4096/64;i++)
+ for(i=0;i<((4096/64) << order);i++)
{
__asm__ __volatile__ (
" movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@
* Generic MMX implementation without K7 specific streaming
*/
-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
{
int i;
@@ -267,7 +267,7 @@
" pxor %%mm0, %%mm0\n" : :
);
- for(i=0;i<4096/128;i++)
+ for(i=0;i<((4096/128) << order);i++)
{
__asm__ __volatile__ (
" movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@
* Favour MMX for page clear and copy.
*/
-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
{
int d0, d1;
__asm__ __volatile__( \
"cld\n\t" \
"rep ; stosl" \
: "=&c" (d0), "=&D" (d1)
- :"a" (0),"1" (page),"0" (1024)
+ :"a" (0),"1" (page),"0" (1024 << order)
:"memory");
}
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
{
if(unlikely(in_interrupt()))
- slow_zero_page(page);
+ slow_clear_page(page, order);
else
- fast_clear_page(page);
+ fast_clear_page(page, order);
}
static void slow_copy_page(void *to, void *from)
Index: linux-2.6.10/include/asm-x86_64/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/mmx.h 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/mmx.h 2005-01-10 14:23:22.000000000 -0800
@@ -8,7 +8,7 @@
#include <linux/types.h>
extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
extern void mmx_copy_page(void *to, void *from);
#endif
Index: linux-2.6.10/arch/ia64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/ia64/lib/clear_page.S 2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10/arch/ia64/lib/clear_page.S 2005-01-10 14:23:22.000000000 -0800
@@ -7,6 +7,7 @@
* 1/06/01 davidm Tuned for Itanium.
* 2/12/02 kchen Tuned for both Itanium and McKinley
* 3/08/02 davidm Some more tweaking
+ * 12/10/04 clameter Make it work on pages of order size
*/
#include <linux/config.h>
@@ -29,27 +30,33 @@
#define dst4 r11
#define dst_last r31
+#define totsize r14
GLOBAL_ENTRY(clear_page)
.prologue
- .regstk 1,0,0,0
- mov r16 = PAGE_SIZE/L3_LINE_SIZE-1 // main loop count, -1=repeat/until
+ .regstk 2,0,0,0
+ mov r16 = PAGE_SIZE/L3_LINE_SIZE // main loop count
+ mov totsize = PAGE_SIZE
.save ar.lc, saved_lc
mov saved_lc = ar.lc
-
+ ;;
.body
+ adds dst1 = 16, in0
mov ar.lc = (PREFETCH_LINES - 1)
mov dst_fetch = in0
- adds dst1 = 16, in0
adds dst2 = 32, in0
+ shl r16 = r16, in1
+ shl totsize = totsize, in1
;;
.fetch: stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
adds dst3 = 48, in0 // executing this multiple times is harmless
br.cloop.sptk.few .fetch
+ add r16 = -1,r16
+ add dst_last = totsize, dst_fetch
+ adds dst4 = 64, in0
;;
- addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
mov ar.lc = r16 // one L3 line per iteration
- adds dst4 = 64, in0
+ adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
;;
#ifdef CONFIG_ITANIUM
// Optimized for Itanium
Index: linux-2.6.10/arch/x86_64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/x86_64/lib/clear_page.S 2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/lib/clear_page.S 2005-01-10 14:23:22.000000000 -0800
@@ -1,12 +1,16 @@
/*
* Zero a page.
* rdi page
+ * rsi order
*/
.globl clear_page
.p2align 4
clear_page:
+ movl $4096/64,%eax
+ movl %esi, %ecx
+ shll %cl, %eax
+ movl %eax, %ecx
xorl %eax,%eax
- movl $4096/64,%ecx
.p2align 4
.Lloop:
decl %ecx
@@ -41,7 +45,10 @@
.section .altinstr_replacement,"ax"
clear_page_c:
- movl $4096/8,%ecx
+ movl $4096/8,%eax
+ movl %esi, %ecx
+ shll %cl, %eax
+ movl %eax, %ecx
xorl %eax,%eax
rep
stosq
Index: linux-2.6.10/include/asm-sh/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh/page.h 2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-sh/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -36,12 +36,22 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
extern void (*copy_page)(void *to, void *from);
extern void clear_page_slow(void *to);
extern void copy_page_slow(void *to, void *from);
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- >0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
#if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
struct page;
extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@
extern void __clear_user_page(void *to, void *orig_to);
extern void __copy_user_page(void *to, void *from, void *orig_to);
#elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#elif defined(CONFIG_CPU_SH4)
struct page;
Index: linux-2.6.10/include/asm-i386/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/mmx.h 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/mmx.h 2005-01-10 14:23:22.000000000 -0800
@@ -8,7 +8,7 @@
#include <linux/types.h>
extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
extern void mmx_copy_page(void *to, void *from);
#endif
Index: linux-2.6.10/arch/alpha/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/clear_page.S 2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/clear_page.S 2005-01-10 14:23:22.000000000 -0800
@@ -6,11 +6,10 @@
.text
.align 4
- .global clear_page
- .ent clear_page
-clear_page:
+ .global _clear_page
+ .ent _clear_page
+_clear_page:
.prologue 0
-
lda $0,128
nop
unop
@@ -36,4 +35,4 @@
unop
nop
- .end clear_page
+ .end _clear_page
Index: linux-2.6.10/include/asm-sh64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh64/page.h 2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -50,12 +50,20 @@
extern void sh64_page_clear(void *page);
extern void sh64_page_copy(void *from, void *to);
-#define clear_page(page) sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+ int nr = 1 << order;
+
+ while (nr-- >0) {
+ sh64_page_clear(page++, 0);
+ }
+}
+
#define copy_page(to,from) sh64_page_copy(from, to)
#if defined(CONFIG_DCACHE_DISABLED)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) sh_clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#else
Index: linux-2.6.10/include/asm-h8300/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-h8300/page.h 2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -24,10 +24,10 @@
#define get_user_page(vaddr) __get_free_page(GFP_KERNEL)
#define free_user_page(page, addr) free_page(addr)
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-arm/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/page.h 2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-arm/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -128,7 +128,7 @@
preempt_enable(); \
} while (0)
-#define clear_page(page) memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order) memzero((void *)(page), PAGE_SIZE << (order))
extern void copy_page(void *to, const void *from);
#undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-ppc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc64/page.h 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/include/asm-ppc64/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -102,12 +102,12 @@
#define REGION_MASK (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
#define REGION_STRIDE (1UL << REGION_SHIFT)
-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, unsigned int order)
{
unsigned long lines, line_size;
line_size = ppc64_caches.dline_size;
- lines = ppc64_caches.dlines_per_page;
+ lines = ppc64_caches.dlines_per_page << order;
__asm__ __volatile__(
"mtctr %1 # clear_page\n\
Index: linux-2.6.10/include/asm-m32r/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/page.h 2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -11,10 +11,22 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- > 0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
+
extern void copy_page(void *to, void *from);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-alpha/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/page.h 2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -15,8 +15,20 @@
#define STRICT_MM_TYPECHECKS
-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+ int nr = 1 << order;
+
+ while (nr--)
+ {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
Index: linux-2.6.10/arch/mips/mm/pg-sb1.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-sb1.c 2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-sb1.c 2005-01-10 14:23:22.000000000 -0800
@@ -42,7 +42,7 @@
#ifdef CONFIG_SIBYTE_DMA_PAGEOPS
static inline void clear_page_cpu(void *page)
#else
-void clear_page(void *page)
+void _clear_page(void *page)
#endif
{
unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@
IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
}
-void clear_page(void *page)
+void _clear_page(void *page)
{
int cpu = smp_processor_id();
/* if the page is above Kseg0, use old way */
if (KSEGX(page) != CAC_BASE)
return clear_page_cpu(page);
-
page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@
#endif
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(copy_page);
Index: linux-2.6.10/include/asm-m68k/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68k/page.h 2004-12-24 13:35:49.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -50,7 +50,7 @@
);
}
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
unsigned long tmp;
unsigned long *sp = page;
@@ -69,16 +69,16 @@
"dbra %1,1b\n\t"
: "=a" (sp), "=d" (tmp)
: "a" (page), "0" (sp),
- "1" ((PAGE_SIZE - 16) / 16 - 1));
+ "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
}
#else
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
#endif
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.10/include/asm-mips/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-mips/page.h 2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/include/asm-mips/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -39,7 +39,18 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- >0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
extern void copy_page(void * to, void * from);
extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@
{
extern void (*flush_data_cache_page)(unsigned long addr);
- clear_page(addr);
+ clear_page(addr, 0);
if (pages_do_alias((unsigned long) addr, vaddr))
flush_data_cache_page((unsigned long)addr);
}
Index: linux-2.6.10/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68knommu/page.h 2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -24,10 +24,10 @@
#define get_user_page(vaddr) __get_free_page(GFP_KERNEL)
#define free_user_page(page, addr) free_page(addr)
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-cris/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/page.h 2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -15,10 +15,10 @@
#ifdef __KERNEL__
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Index: linux-2.6.10/include/asm-v850/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-v850/page.h 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-v850/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -37,11 +37,11 @@
#define STRICT_MM_TYPECHECKS
-#define clear_page(page) memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset ((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to, from) memcpy ((void *)(to), (void *)from, PAGE_SIZE)
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.10/include/asm-parisc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-parisc/page.h 2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -13,7 +13,7 @@
#include <asm/types.h>
#include <asm/cache.h>
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) copy_user_page_asm((void *)(to), (void *)(from))
struct page;
Index: linux-2.6.10/arch/arm/mm/copypage-v6.c
===================================================================
--- linux-2.6.10.orig/arch/arm/mm/copypage-v6.c 2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/arm/mm/copypage-v6.c 2005-01-10 14:23:22.000000000 -0800
@@ -47,7 +47,7 @@
*/
void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
{
- clear_page(kaddr);
+ _clear_page(kaddr);
}
/*
@@ -116,7 +116,7 @@
set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
flush_tlb_kernel_page(to);
- clear_page((void *)to);
+ _clear_page((void *)to);
spin_unlock(&v6_lock);
}
Index: linux-2.6.10/arch/m32r/mm/page.S
===================================================================
--- linux-2.6.10.orig/arch/m32r/mm/page.S 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/arch/m32r/mm/page.S 2005-01-10 14:23:22.000000000 -0800
@@ -51,7 +51,7 @@
jmp r14
.text
- .global clear_page
+ .global _clear_page
/*
* clear_page (to)
*
@@ -60,7 +60,7 @@
* 16 * 256
*/
.align 4
-clear_page:
+_clear_page:
ldi r2, #255
ldi r4, #0
ld r3, @r0 /* cache line allocate */
Index: linux-2.6.10/include/asm-ppc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-ppc/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -85,7 +85,7 @@
struct page;
extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define clear_page clear_pages
extern void copy_page(void *to, void *from);
extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/alpha/kernel/alpha_ksyms.c 2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c 2005-01-10 14:23:22.000000000 -0800
@@ -88,7 +88,7 @@
EXPORT_SYMBOL(__memsetw);
EXPORT_SYMBOL(__constant_c_memset);
EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(__direct_map_base);
EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.10/arch/alpha/lib/ev6-clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/ev6-clear_page.S 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/ev6-clear_page.S 2005-01-10 14:23:22.000000000 -0800
@@ -6,9 +6,9 @@
.text
.align 4
- .global clear_page
- .ent clear_page
-clear_page:
+ .global _clear_page
+ .ent _clear_page
+_clear_page:
.prologue 0
lda $0,128
@@ -51,4 +51,4 @@
nop
nop
- .end clear_page
+ .end _clear_page
Index: linux-2.6.10/arch/sh/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/init.c 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/init.c 2005-01-10 14:23:22.000000000 -0800
@@ -57,7 +57,7 @@
#endif
void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);
void show_mem(void)
{
@@ -255,7 +255,7 @@
* later in the boot process if a better method is available.
*/
copy_page = copy_page_slow;
- clear_page = clear_page_slow;
+ _clear_page = clear_page_slow;
/* this will put all low memory onto the freelists */
totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.10/arch/sh/mm/pg-dma.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-dma.c 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-dma.c 2005-01-10 14:23:22.000000000 -0800
@@ -78,7 +78,7 @@
return ret;
copy_page = copy_page_dma;
- clear_page = clear_page_dma;
+ _clear_page = clear_page_dma;
return ret;
}
Index: linux-2.6.10/arch/sh/mm/pg-nommu.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-nommu.c 2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-nommu.c 2005-01-10 14:23:22.000000000 -0800
@@ -27,7 +27,7 @@
static int __init pg_nommu_init(void)
{
copy_page = copy_page_nommu;
- clear_page = clear_page_nommu;
+ _clear_page = clear_page_nommu;
return 0;
}
Index: linux-2.6.10/arch/mips/mm/pg-r4k.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-r4k.c 2004-12-24 13:34:49.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-r4k.c 2005-01-10 14:23:22.000000000 -0800
@@ -39,9 +39,9 @@
static unsigned int clear_page_array[0x130 / 4];
-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
/*
* Maximum sizes:
Index: linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/m32r/kernel/m32r_ksyms.c 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c 2005-01-10 14:23:22.000000000 -0800
@@ -102,7 +102,7 @@
EXPORT_SYMBOL(memcmp);
EXPORT_SYMBOL(memscan);
EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(strcat);
EXPORT_SYMBOL(strchr);
Index: linux-2.6.10/include/asm-arm26/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm26/page.h 2004-12-24 13:35:22.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -25,7 +25,7 @@
preempt_enable(); \
} while (0)
-#define clear_page(page) memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order) memzero((void *)(page), PAGE_SIZE << (order))
#define copy_page(to, from) __copy_user_page(to, from, 0);
#undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-sparc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/page.h 2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/page.h 2005-01-10 14:23:22.000000000 -0800
@@ -14,8 +14,8 @@
#ifndef __ASSEMBLY__
-extern void _clear_page(void *page);
-#define clear_page(X) _clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y) _clear_page((void *)(X),(Y))
struct page;
extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
#define copy_page(X,Y) memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
Index: linux-2.6.10/arch/sparc64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/sparc64/lib/clear_page.S 2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/arch/sparc64/lib/clear_page.S 2005-01-10 14:23:22.000000000 -0800
@@ -28,9 +28,12 @@
.text
.globl _clear_page
-_clear_page: /* %o0=dest */
+_clear_page: /* %o0=dest, %o1=order */
+ sethi %hi(PAGE_SIZE/64), %o2
+ clr %o4
+ or %o2, %lo(PAGE_SIZE/64), %o2
ba,pt %xcc, clear_page_common
- clr %o4
+ sllx %o2, %o1, %o1
/* This thing is pretty important, it shows up
* on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@
flush %g6
wrpr %o4, 0x0, %pstate
+ sethi %hi(PAGE_SIZE/64), %o1
mov 1, %o4
+ or %o1, %lo(PAGE_SIZE/64), %o1
clear_page_common:
VISEntryHalf
membar #StoreLoad | #StoreStore | #LoadStore
fzero %f0
- sethi %hi(PAGE_SIZE/64), %o1
mov %o0, %g1 ! remember vaddr for tlbflush
fzero %f2
- or %o1, %lo(PAGE_SIZE/64), %o1
faddd %f0, %f2, %f4
fmuld %f0, %f2, %f6
faddd %f0, %f2, %f8
Index: linux-2.6.10/drivers/net/tc35815.c
===================================================================
--- linux-2.6.10.orig/drivers/net/tc35815.c 2004-12-24 13:33:48.000000000 -0800
+++ linux-2.6.10/drivers/net/tc35815.c 2005-01-10 14:23:22.000000000 -0800
@@ -657,7 +657,7 @@
dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
#endif
} else {
- clear_page(lp->fd_buf);
+ clear_page(lp->fd_buf, 0);
#ifdef __mips__
dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
#endif
Index: linux-2.6.10/include/linux/highmem.h
===================================================================
--- linux-2.6.10.orig/include/linux/highmem.h 2005-01-10 13:53:59.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h 2005-01-10 14:23:22.000000000 -0800
@@ -56,7 +56,7 @@
static inline void clear_highpage(struct page *page)
{
void *kaddr = kmap_atomic(page, KM_USER0);
- clear_page(kaddr);
+ clear_page(kaddr, 0);
kunmap_atomic(kaddr, KM_USER0);
}
Index: linux-2.6.10/fs/afs/file.c
===================================================================
--- linux-2.6.10.orig/fs/afs/file.c 2004-12-24 13:35:59.000000000 -0800
+++ linux-2.6.10/fs/afs/file.c 2005-01-10 14:23:22.000000000 -0800
@@ -172,7 +172,7 @@
(size_t) PAGE_SIZE);
desc.buffer = kmap(page);
- clear_page(desc.buffer);
+ clear_page(desc.buffer, 0);
/* read the contents of the file from the server into the
* page */
Index: linux-2.6.10/fs/ntfs/compress.c
===================================================================
--- linux-2.6.10.orig/fs/ntfs/compress.c 2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/fs/ntfs/compress.c 2005-01-10 14:23:22.000000000 -0800
@@ -107,7 +107,7 @@
* FIXME: Using clear_page() will become wrong when we get
* PAGE_CACHE_SIZE != PAGE_SIZE but for now there is no problem.
*/
- clear_page(kp);
+ clear_page(kp, 0);
return;
}
kp_ofs = ni->initialized_size & ~PAGE_CACHE_MASK;
@@ -742,7 +742,7 @@
* for now there is no problem.
*/
if (likely(!cur_ofs))
- clear_page(page_address(page));
+ clear_page(page_address(page), 0);
else
memset(page_address(page) + cur_ofs, 0,
PAGE_CACHE_SIZE -
Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c 2005-01-10 14:21:06.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c 2005-01-10 14:23:22.000000000 -0800
@@ -639,6 +639,10 @@
{
int i;
+ if (!PageHighMem(page)) {
+ clear_page(page_address(page), order);
+ return;
+ }
for(i = 0; i < (1 << order); i++)
clear_highpage(page + i);
}
Index: linux-2.6.10/mm/hugetlb.c
===================================================================
--- linux-2.6.10.orig/mm/hugetlb.c 2005-01-10 13:48:11.000000000 -0800
+++ linux-2.6.10/mm/hugetlb.c 2005-01-10 14:23:22.000000000 -0800
@@ -89,8 +89,7 @@
spin_unlock(&hugetlb_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
- for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
- clear_highpage(&page[i]);
+ clear_page(page_address(page), HUGETLB_PAGE_ORDER);
return page;
}
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V4 [1/4]: Arch specific page zeroing during page fault
2005-01-10 23:54 ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
@ 2005-01-11 0:41 ` Chris Wright
2005-01-11 0:46 ` Christoph Lameter
0 siblings, 1 reply; 89+ messages in thread
From: Chris Wright @ 2005-01-11 0:41 UTC (permalink / raw)
To: Christoph Lameter
Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, David S. Miller,
linux-ia64, linux-mm, Linux Kernel Development
* Christoph Lameter (clameter@sgi.com) wrote:
> @@ -1795,7 +1786,7 @@
>
> if (unlikely(anon_vma_prepare(vma)))
> goto no_mem;
> - page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
> + page = alloc_zeroed_user_highpage(vma, addr);
Oops, HIGHZERO is gone already in Linus' tree.
thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V4 [1/4]: Arch specific page zeroing during page fault
2005-01-11 0:41 ` Chris Wright
@ 2005-01-11 0:46 ` Christoph Lameter
2005-01-11 0:49 ` Chris Wright
0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2005-01-11 0:46 UTC (permalink / raw)
To: Chris Wright
Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, David S. Miller,
linux-ia64, linux-mm, Linux Kernel Development
On Mon, 10 Jan 2005, Chris Wright wrote:
> * Christoph Lameter (clameter@sgi.com) wrote:
> > @@ -1795,7 +1786,7 @@
> >
> > if (unlikely(anon_vma_prepare(vma)))
> > goto no_mem;
> > - page = alloc_page_vma(GFP_HIGHZERO, vma, addr);
> > + page = alloc_zeroed_user_highpage(vma, addr);
>
> Oops, HIGHZERO is gone already in Linus' tree.
Use bk13 as I indicated.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V4 [1/4]: Arch specific page zeroing during page fault
2005-01-11 0:46 ` Christoph Lameter
@ 2005-01-11 0:49 ` Chris Wright
0 siblings, 0 replies; 89+ messages in thread
From: Chris Wright @ 2005-01-11 0:49 UTC (permalink / raw)
To: Christoph Lameter
Cc: Chris Wright, Linus Torvalds, Hugh Dickins, Andrew Morton,
David S. Miller, linux-ia64, linux-mm, Linux Kernel Development
* Christoph Lameter (clameter@sgi.com) wrote:
> Use bk13 as I indicated.
Ah, so you did, thanks ;-)
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
^ permalink raw reply [flat|nested] 89+ messages in thread
* alloc_zeroed_user_highpage to fix the clear_user_highpage issue
2005-01-08 21:56 ` David S. Miller
@ 2005-01-21 20:09 ` Christoph Lameter
2005-02-09 9:58 ` [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL Michael Ellerman
2005-01-21 20:12 ` Extend clear_page by an order parameter Christoph Lameter
2005-01-21 20:15 ` A scrub daemon (prezeroing) Christoph Lameter
2 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2005-01-21 20:09 UTC (permalink / raw)
To: David S. Miller
Cc: Hugh Dickins, akpm, linux-ia64, torvalds, linux-mm, linux-kernel
This patch adds a new function alloc_zeroed_user_highpage that is then used in the
anonymous page fault handler and in the COW code to allocate zeroed pages. The function
can be defined per arch to setup special processing for user pages by defining
__HAVE_ARCH_ALLOC_ZEROED_USER_PAGE. For arches that do not need to do special things
for user pages, alloc_zeroed_user_highpage is defined to simply do
alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
Patch against 2.6.11-rc1-bk9
This patch needs to update a number of archs. Wish there was a better way
to do this.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.10/include/linux/highmem.h
===================================================================
--- linux-2.6.10.orig/include/linux/highmem.h 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h 2005-01-21 10:44:27.000000000 -0800
@@ -42,6 +42,17 @@ static inline void clear_user_highpage(s
smp_wmb();
}
+#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+static inline struct page* alloc_zeroed_user_highpage(struct vm_area_struct *vma,
+ unsigned long vaddr)
+{
+ struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+
+ clear_user_highpage(page, vaddr);
+ return page;
+}
+#endif
+
static inline void clear_highpage(struct page *page)
{
void *kaddr = kmap_atomic(page, KM_USER0);
Index: linux-2.6.10/mm/memory.c
===================================================================
--- linux-2.6.10.orig/mm/memory.c 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/memory.c 2005-01-21 11:10:42.000000000 -0800
@@ -84,20 +84,6 @@ EXPORT_SYMBOL(high_memory);
EXPORT_SYMBOL(vmalloc_earlyreserve);
/*
- * We special-case the C-O-W ZERO_PAGE, because it's such
- * a common occurrence (no need to read the page to know
- * that it's zero - better for the cache and memory subsystem).
- */
-static inline void copy_cow_page(struct page * from, struct page * to, unsigned long address)
-{
- if (from == ZERO_PAGE(address)) {
- clear_user_highpage(to, address);
- return;
- }
- copy_user_highpage(to, from, address);
-}
-
-/*
* Note: this doesn't free the actual pages themselves. That
* has been handled earlier when unmapping all the memory regions.
*/
@@ -1329,11 +1315,16 @@ static int do_wp_page(struct mm_struct *
if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
- if (!new_page)
- goto no_new_page;
- copy_cow_page(old_page,new_page,address);
-
+ if (old_page == ZERO_PAGE(address)) {
+ new_page = alloc_zeroed_user_highpage(vma, address);
+ if (!new_page)
+ goto no_new_page;
+ } else {
+ new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ if (!new_page)
+ goto no_new_page;
+ copy_user_highpage(new_page, old_page, address);
+ }
/*
* Re-check the pte - we dropped the lock
*/
@@ -1795,10 +1786,9 @@ do_anonymous_page(struct mm_struct *mm,
if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
- page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ page = alloc_zeroed_user_highpage(vma, addr);
if (!page)
goto no_mem;
- clear_user_highpage(page, addr);
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);
Index: linux-2.6.10/include/asm-ia64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/page.h 2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h 2005-01-21 10:44:27.000000000 -0800
@@ -75,6 +75,16 @@ do { \
flush_dcache_page(page); \
} while (0)
+
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+({ \
+ struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+ flush_dcache_page(page); \
+ page; \
+})
+
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
#ifdef CONFIG_VIRTUAL_MEM_MAP
Index: linux-2.6.10/include/asm-i386/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/page.h 2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h 2005-01-21 10:44:27.000000000 -0800
@@ -36,6 +36,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linux-2.6.10/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/page.h 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h 2005-01-21 10:44:27.000000000 -0800
@@ -38,6 +38,8 @@ void copy_page(void *, void *);
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
/*
* These are used to make use of C type-checking..
*/
Index: linux-2.6.10/include/asm-m32r/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h 2005-01-21 10:44:27.000000000 -0800
@@ -17,6 +17,9 @@ extern void copy_page(void *to, void *fr
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linux-2.6.10/include/asm-alpha/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/page.h 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h 2005-01-21 10:44:27.000000000 -0800
@@ -18,6 +18,9 @@
extern void clear_page(void *page);
#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
extern void copy_page(void * _to, void * _from);
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
Index: linux-2.6.10/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68knommu/page.h 2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h 2005-01-21 10:44:27.000000000 -0800
@@ -30,6 +30,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linux-2.6.10/include/asm-cris/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/page.h 2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h 2005-01-21 10:44:27.000000000 -0800
@@ -21,6 +21,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
Index: linux-2.6.10/include/asm-s390/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/page.h 2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h 2005-01-21 10:44:27.000000000 -0800
@@ -106,6 +106,9 @@ static inline void copy_page(void *to, v
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/* Pure 2^n version of get_order */
extern __inline__ int get_order(unsigned long size)
{
Index: linux-2.6.10/include/asm-h8300/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-h8300/page.h 2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h 2005-01-21 10:44:27.000000000 -0800
@@ -30,6 +30,9 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+
/*
* These are used to make use of C type-checking..
*/
^ permalink raw reply [flat|nested] 89+ messages in thread
* Extend clear_page by an order parameter
2005-01-08 21:56 ` David S. Miller
2005-01-21 20:09 ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
@ 2005-01-21 20:12 ` Christoph Lameter
2005-01-21 22:29 ` Paul Mackerras
2005-01-23 7:45 ` Andrew Morton
2005-01-21 20:15 ` A scrub daemon (prezeroing) Christoph Lameter
2 siblings, 2 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-21 20:12 UTC (permalink / raw)
To: David S. Miller
Cc: Hugh Dickins, akpm, linux-ia64, torvalds, linux-mm, linux-kernel
The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
clear_page that is capable of zeroing multiple pages at once (and scrubd
too but that is now an independent patch). The following patch extends
clear_page with a second parameter specifying the order of the page to be zeroed to allow an
efficient zeroing of pages. Hope I caught everything....
Patch against 2.6.11-rc1-bk9
Architecture support:
---------------------
Known to work:
ia64
i386
x86_64
sparc64
m68k
Trivial modification expected to simply work:
arm
cris
h8300
m68knommu
ppc
ppc64
sh64
v850
parisc
sparc
um
Modification made but it would be good to have some feedback from the arch maintainers:
s390
alpha
sh
mips
m32r
Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c 2005-01-21 11:51:39.000000000 -0800
@@ -591,11 +591,16 @@ void fastcall free_cold_page(struct page
free_hot_cold_page(page, 1);
}
-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, unsigned int gfp_flags)
{
int i;
BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+ if (!PageHighMem(page)) {
+ clear_page(page_address(page), order);
+ return;
+ }
+
for(i = 0; i < (1 << order); i++)
clear_highpage(page + i);
}
Index: linux-2.6.10/mm/hugetlb.c
===================================================================
--- linux-2.6.10.orig/mm/hugetlb.c 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/hugetlb.c 2005-01-21 11:51:39.000000000 -0800
@@ -78,7 +78,6 @@ void free_huge_page(struct page *page)
struct page *alloc_huge_page(void)
{
struct page *page;
- int i;
spin_lock(&hugetlb_lock);
page = dequeue_huge_page();
@@ -89,8 +88,7 @@ struct page *alloc_huge_page(void)
spin_unlock(&hugetlb_lock);
set_page_count(page, 1);
page[1].mapping = (void *)free_huge_page;
- for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
- clear_highpage(&page[i]);
+ prep_zero_page(page, HUGETLB_PAGE_ORDER, GFP_HIGHUSER);
return page;
}
Index: linux-2.6.10/include/linux/highmem.h
===================================================================
--- linux-2.6.10.orig/include/linux/highmem.h 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/linux/highmem.h 2005-01-21 11:51:39.000000000 -0800
@@ -45,7 +45,7 @@ static inline void clear_user_highpage(s
static inline void clear_highpage(struct page *page)
{
void *kaddr = kmap_atomic(page, KM_USER0);
- clear_page(kaddr);
+ clear_page(kaddr, 0);
kunmap_atomic(kaddr, KM_USER0);
}
Index: linux-2.6.10/drivers/net/tc35815.c
===================================================================
--- linux-2.6.10.orig/drivers/net/tc35815.c 2004-12-24 13:33:48.000000000 -0800
+++ linux-2.6.10/drivers/net/tc35815.c 2005-01-21 11:51:39.000000000 -0800
@@ -657,7 +657,7 @@ tc35815_init_queues(struct net_device *d
dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
#endif
} else {
- clear_page(lp->fd_buf);
+ clear_page(lp->fd_buf, 0);
#ifdef __mips__
dma_cache_wback_inv((unsigned long)lp->fd_buf, PAGE_SIZE * FD_PAGE_NUM);
#endif
Index: linux-2.6.10/fs/afs/file.c
===================================================================
--- linux-2.6.10.orig/fs/afs/file.c 2004-12-24 13:35:59.000000000 -0800
+++ linux-2.6.10/fs/afs/file.c 2005-01-21 11:51:39.000000000 -0800
@@ -172,7 +172,7 @@ static int afs_file_readpage(struct file
(size_t) PAGE_SIZE);
desc.buffer = kmap(page);
- clear_page(desc.buffer);
+ clear_page(desc.buffer, 0);
/* read the contents of the file from the server into the
* page */
Index: linux-2.6.10/fs/ntfs/compress.c
===================================================================
--- linux-2.6.10.orig/fs/ntfs/compress.c 2004-12-24 13:34:45.000000000 -0800
+++ linux-2.6.10/fs/ntfs/compress.c 2005-01-21 11:51:39.000000000 -0800
@@ -107,7 +107,7 @@ static void zero_partial_compressed_page
* FIXME: Using clear_page() will become wrong when we get
* PAGE_CACHE_SIZE != PAGE_SIZE but for now there is no problem.
*/
- clear_page(kp);
+ clear_page(kp, 0);
return;
}
kp_ofs = ni->initialized_size & ~PAGE_CACHE_MASK;
@@ -742,7 +742,7 @@ lock_retry_remap:
* for now there is no problem.
*/
if (likely(!cur_ofs))
- clear_page(page_address(page));
+ clear_page(page_address(page), 0);
else
memset(page_address(page) + cur_ofs, 0,
PAGE_CACHE_SIZE -
Index: linux-2.6.10/include/asm-ia64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ia64/page.h 2004-12-24 13:34:00.000000000 -0800
+++ linux-2.6.10/include/asm-ia64/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -56,7 +56,7 @@
# ifdef __KERNEL__
# define STRICT_MM_TYPECHECKS
-extern void clear_page (void *page);
+extern void clear_page (void *page, int order);
extern void copy_page (void *to, void *from);
/*
@@ -65,7 +65,7 @@ extern void copy_page (void *to, void *f
*/
#define clear_user_page(addr, vaddr, page) \
do { \
- clear_page(addr); \
+ clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
Index: linux-2.6.10/arch/ia64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/ia64/lib/clear_page.S 2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10/arch/ia64/lib/clear_page.S 2005-01-21 11:51:39.000000000 -0800
@@ -7,6 +7,7 @@
* 1/06/01 davidm Tuned for Itanium.
* 2/12/02 kchen Tuned for both Itanium and McKinley
* 3/08/02 davidm Some more tweaking
+ * 12/10/04 clameter Make it work on pages of order size
*/
#include <linux/config.h>
@@ -29,27 +30,33 @@
#define dst4 r11
#define dst_last r31
+#define totsize r14
GLOBAL_ENTRY(clear_page)
.prologue
- .regstk 1,0,0,0
- mov r16 = PAGE_SIZE/L3_LINE_SIZE-1 // main loop count, -1=repeat/until
+ .regstk 2,0,0,0
+ mov r16 = PAGE_SIZE/L3_LINE_SIZE // main loop count
+ mov totsize = PAGE_SIZE
.save ar.lc, saved_lc
mov saved_lc = ar.lc
-
+ ;;
.body
+ adds dst1 = 16, in0
mov ar.lc = (PREFETCH_LINES - 1)
mov dst_fetch = in0
- adds dst1 = 16, in0
adds dst2 = 32, in0
+ shl r16 = r16, in1
+ shl totsize = totsize, in1
;;
.fetch: stf.spill.nta [dst_fetch] = f0, L3_LINE_SIZE
adds dst3 = 48, in0 // executing this multiple times is harmless
br.cloop.sptk.few .fetch
+ add r16 = -1,r16
+ add dst_last = totsize, dst_fetch
+ adds dst4 = 64, in0
;;
- addl dst_last = (PAGE_SIZE - PREFETCH_LINES*L3_LINE_SIZE), dst_fetch
mov ar.lc = r16 // one L3 line per iteration
- adds dst4 = 64, in0
+ adds dst_last = -PREFETCH_LINES*L3_LINE_SIZE, dst_last
;;
#ifdef CONFIG_ITANIUM
// Optimized for Itanium
Index: linux-2.6.10/include/asm-i386/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/page.h 2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-i386/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -18,7 +18,7 @@
#include <asm/mmx.h>
-#define clear_page(page) mmx_clear_page((void *)(page))
+#define clear_page(page, order) mmx_clear_page((void *)(page),order)
#define copy_page(to,from) mmx_copy_page(to,from)
#else
@@ -28,12 +28,12 @@
* Maybe the K6-III ?
*/
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
#endif
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-i386/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-i386/mmx.h 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-i386/mmx.h 2005-01-21 11:51:39.000000000 -0800
@@ -8,7 +8,7 @@
#include <linux/types.h>
extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
extern void mmx_copy_page(void *to, void *from);
#endif
Index: linux-2.6.10/arch/i386/lib/mmx.c
===================================================================
--- linux-2.6.10.orig/arch/i386/lib/mmx.c 2004-12-24 13:34:48.000000000 -0800
+++ linux-2.6.10/arch/i386/lib/mmx.c 2005-01-21 11:51:39.000000000 -0800
@@ -128,7 +128,7 @@ void *_mmx_memcpy(void *to, const void *
* other MMX using processors do not.
*/
-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
{
int i;
@@ -138,7 +138,7 @@ static void fast_clear_page(void *page)
" pxor %%mm0, %%mm0\n" : :
);
- for(i=0;i<4096/64;i++)
+ for(i=0;i<((4096/64) << order);i++)
{
__asm__ __volatile__ (
" movntq %%mm0, (%0)\n"
@@ -257,7 +257,7 @@ static void fast_copy_page(void *to, voi
* Generic MMX implementation without K7 specific streaming
*/
-static void fast_clear_page(void *page)
+static void fast_clear_page(void *page, int order)
{
int i;
@@ -267,7 +267,7 @@ static void fast_clear_page(void *page)
" pxor %%mm0, %%mm0\n" : :
);
- for(i=0;i<4096/128;i++)
+ for(i=0;i<((4096/128) << order);i++)
{
__asm__ __volatile__ (
" movq %%mm0, (%0)\n"
@@ -359,23 +359,23 @@ static void fast_copy_page(void *to, voi
* Favour MMX for page clear and copy.
*/
-static void slow_zero_page(void * page)
+static void slow_clear_page(void * page, int order)
{
int d0, d1;
__asm__ __volatile__( \
"cld\n\t" \
"rep ; stosl" \
: "=&c" (d0), "=&D" (d1)
- :"a" (0),"1" (page),"0" (1024)
+ :"a" (0),"1" (page),"0" (1024 << order)
:"memory");
}
-
-void mmx_clear_page(void * page)
+
+void mmx_clear_page(void * page, int order)
{
if(unlikely(in_interrupt()))
- slow_zero_page(page);
+ slow_clear_page(page, order);
else
- fast_clear_page(page);
+ fast_clear_page(page, order);
}
static void slow_copy_page(void *to, void *from)
Index: linux-2.6.10/include/asm-x86_64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/page.h 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -32,10 +32,10 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-void clear_page(void *);
+void clear_page(void *, int);
void copy_page(void *, void *);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-x86_64/mmx.h
===================================================================
--- linux-2.6.10.orig/include/asm-x86_64/mmx.h 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/include/asm-x86_64/mmx.h 2005-01-21 11:51:39.000000000 -0800
@@ -8,7 +8,7 @@
#include <linux/types.h>
extern void *_mmx_memcpy(void *to, const void *from, size_t size);
-extern void mmx_clear_page(void *page);
+extern void mmx_clear_page(void *page, int order);
extern void mmx_copy_page(void *to, void *from);
#endif
Index: linux-2.6.10/arch/x86_64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/x86_64/lib/clear_page.S 2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/arch/x86_64/lib/clear_page.S 2005-01-21 11:51:39.000000000 -0800
@@ -1,12 +1,16 @@
/*
* Zero a page.
* rdi page
+ * rsi order
*/
.globl clear_page
.p2align 4
clear_page:
+ movl $4096/64,%eax
+ movl %esi, %ecx
+ shll %cl, %eax
+ movl %eax, %ecx
xorl %eax,%eax
- movl $4096/64,%ecx
.p2align 4
.Lloop:
decl %ecx
@@ -41,7 +45,10 @@ clear_page_end:
.section .altinstr_replacement,"ax"
clear_page_c:
- movl $4096/8,%ecx
+ movl $4096/8,%eax
+ movl %esi, %ecx
+ shll %cl, %eax
+ movl %eax, %ecx
xorl %eax,%eax
rep
stosq
Index: linux-2.6.10/include/asm-sparc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-sparc/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -28,10 +28,10 @@
#ifndef __ASSEMBLY__
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
sparc_flush_page_to_ram(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.10/include/asm-s390/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-s390/page.h 2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-s390/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -22,12 +22,12 @@
#ifndef __s390x__
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
register_pair rp;
rp.subreg.even = (unsigned long) page;
- rp.subreg.odd = (unsigned long) 4096;
+ rp.subreg.odd = (unsigned long) 4096 << order;
asm volatile (" slr 1,1\n"
" mvcl %0,0"
: "+&a" (rp) : : "memory", "cc", "1" );
@@ -63,14 +63,19 @@ static inline void copy_page(void *to, v
#else /* __s390x__ */
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
- asm volatile (" lgr 2,%0\n"
+ int nr = 1 << order;
+
+ while (nr-- >0) {
+ asm volatile (" lgr 2,%0\n"
" lghi 3,4096\n"
" slgr 1,1\n"
" mvcl 2,0"
: : "a" ((void *) (page))
: "memory", "cc", "1", "2", "3" );
+ page += PAGE_SIZE;
+ }
}
static inline void copy_page(void *to, void *from)
@@ -103,7 +108,7 @@ static inline void copy_page(void *to, v
#endif /* __s390x__ */
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/* Pure 2^n version of get_order */
Index: linux-2.6.10/include/asm-sh/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh/page.h 2004-12-24 13:35:28.000000000 -0800
+++ linux-2.6.10/include/asm-sh/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -36,12 +36,22 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void (*clear_page)(void *to);
+extern void (*_clear_page)(void *to);
extern void (*copy_page)(void *to, void *from);
extern void clear_page_slow(void *to);
extern void copy_page_slow(void *to, void *from);
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- >0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
#if defined(CONFIG_SH7705_CACHE_32KB) && defined(CONFIG_MMU)
struct page;
extern void clear_user_page(void *to, unsigned long address, struct page *pg);
@@ -49,7 +59,7 @@ extern void copy_user_page(void *to, voi
extern void __clear_user_page(void *to, void *orig_to);
extern void __copy_user_page(void *to, void *from, void *orig_to);
#elif defined(CONFIG_CPU_SH2) || defined(CONFIG_CPU_SH3) || !defined(CONFIG_MMU)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#elif defined(CONFIG_CPU_SH4)
struct page;
Index: linux-2.6.10/arch/alpha/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/clear_page.S 2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/clear_page.S 2005-01-21 11:51:39.000000000 -0800
@@ -6,11 +6,10 @@
.text
.align 4
- .global clear_page
- .ent clear_page
-clear_page:
+ .global _clear_page
+ .ent _clear_page
+_clear_page:
.prologue 0
-
lda $0,128
nop
unop
@@ -36,4 +35,4 @@ clear_page:
unop
nop
- .end clear_page
+ .end _clear_page
Index: linux-2.6.10/include/asm-sh64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sh64/page.h 2004-12-24 13:34:33.000000000 -0800
+++ linux-2.6.10/include/asm-sh64/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -50,12 +50,20 @@ extern struct page *mem_map;
extern void sh64_page_clear(void *page);
extern void sh64_page_copy(void *from, void *to);
-#define clear_page(page) sh64_page_clear(page)
+static inline void clear_page(page, order)
+{
+ int nr = 1 << order;
+
+ while (nr-- >0) {
+ sh64_page_clear(page++, 0);
+ }
+}
+
#define copy_page(to,from) sh64_page_copy(from, to)
#if defined(CONFIG_DCACHE_DISABLED)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) sh_clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#else
Index: linux-2.6.10/include/asm-h8300/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-h8300/page.h 2004-12-24 13:35:25.000000000 -0800
+++ linux-2.6.10/include/asm-h8300/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -24,10 +24,10 @@
#define get_user_page(vaddr) __get_free_page(GFP_KERNEL)
#define free_user_page(page, addr) free_page(addr)
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-arm/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm/page.h 2004-12-24 13:34:01.000000000 -0800
+++ linux-2.6.10/include/asm-arm/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -128,7 +128,7 @@ extern void __cpu_copy_user_page(void *t
preempt_enable(); \
} while (0)
-#define clear_page(page) memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order) memzero((void *)(page), PAGE_SIZE << (order))
extern void copy_page(void *to, const void *from);
#undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-ppc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc64/page.h 2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-ppc64/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -102,12 +102,12 @@
#define REGION_MASK (((1UL<<REGION_SIZE)-1UL)<<REGION_SHIFT)
#define REGION_STRIDE (1UL << REGION_SHIFT)
-static __inline__ void clear_page(void *addr)
+static __inline__ void clear_page(void *addr, unsigned int order)
{
unsigned long lines, line_size;
line_size = ppc64_caches.dline_size;
- lines = ppc64_caches.dlines_per_page;
+ lines = ppc64_caches.dlines_per_page << order;
__asm__ __volatile__(
"mtctr %1 # clear_page\n\
Index: linux-2.6.10/include/asm-m32r/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m32r/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-m32r/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -11,10 +11,22 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void clear_page(void *to);
+extern void _clear_page(void *to);
+
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- > 0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
+
extern void copy_page(void *to, void *from);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-alpha/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-alpha/page.h 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/include/asm-alpha/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -15,8 +15,20 @@
#define STRICT_MM_TYPECHECKS
-extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+extern void _clear_page(void *page);
+
+static inline void clear_page(void *page, int order)
+{
+ int nr = 1 << order;
+
+ while (nr--)
+ {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
extern void copy_page(void * _to, void * _from);
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
Index: linux-2.6.10/arch/mips/mm/pg-sb1.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-sb1.c 2004-12-24 13:35:50.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-sb1.c 2005-01-21 11:51:39.000000000 -0800
@@ -42,7 +42,7 @@
#ifdef CONFIG_SIBYTE_DMA_PAGEOPS
static inline void clear_page_cpu(void *page)
#else
-void clear_page(void *page)
+void _clear_page(void *page)
#endif
{
unsigned char *addr = (unsigned char *) page;
@@ -172,14 +172,13 @@ void sb1_dma_init(void)
IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_BASE)));
}
-void clear_page(void *page)
+void _clear_page(void *page)
{
int cpu = smp_processor_id();
/* if the page is above Kseg0, use old way */
if (KSEGX(page) != CAC_BASE)
return clear_page_cpu(page);
-
page_descr[cpu].dscr_a = PHYSADDR(page) | M_DM_DSCRA_ZERO_MEM | M_DM_DSCRA_L2C_DEST | M_DM_DSCRA_INTERRUPT;
page_descr[cpu].dscr_b = V_DM_DSCRB_SRC_LENGTH(PAGE_SIZE);
__raw_writeq(1, IOADDR(A_DM_REGISTER(cpu, R_DM_DSCR_COUNT)));
@@ -218,5 +217,5 @@ void copy_page(void *to, void *from)
#endif
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(copy_page);
Index: linux-2.6.10/include/asm-m68k/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68k/page.h 2004-12-24 13:35:49.000000000 -0800
+++ linux-2.6.10/include/asm-m68k/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -50,7 +50,7 @@ static inline void copy_page(void *to, v
);
}
-static inline void clear_page(void *page)
+static inline void clear_page(void *page, int order)
{
unsigned long tmp;
unsigned long *sp = page;
@@ -69,16 +69,16 @@ static inline void clear_page(void *page
"dbra %1,1b\n\t"
: "=a" (sp), "=d" (tmp)
: "a" (page), "0" (sp),
- "1" ((PAGE_SIZE - 16) / 16 - 1));
+ "1" (((PAGE_SIZE<<(order)) - 16) / 16 - 1));
}
#else
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
#endif
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.10/include/asm-mips/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-mips/page.h 2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/include/asm-mips/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -39,7 +39,18 @@
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
-extern void clear_page(void * page);
+extern void _clear_page(void * page);
+
+static inline void clear_page(void *page, int order)
+{
+ unsigned int nr = 1 << order;
+
+ while (nr-- >0) {
+ _clear_page(page);
+ page += PAGE_SIZE;
+ }
+}
+
extern void copy_page(void * to, void * from);
extern unsigned long shm_align_mask;
@@ -57,7 +68,7 @@ static inline void clear_user_page(void
{
extern void (*flush_data_cache_page)(unsigned long addr);
- clear_page(addr);
+ clear_page(addr, 0);
if (pages_do_alias((unsigned long) addr, vaddr))
flush_data_cache_page((unsigned long)addr);
}
Index: linux-2.6.10/include/asm-m68knommu/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-m68knommu/page.h 2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/include/asm-m68knommu/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -24,10 +24,10 @@
#define get_user_page(vaddr) __get_free_page(GFP_KERNEL)
#define free_user_page(page, addr) free_page(addr)
-#define clear_page(page) memset((page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((to), (from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-cris/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-cris/page.h 2004-12-24 13:34:30.000000000 -0800
+++ linux-2.6.10/include/asm-cris/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -15,10 +15,10 @@
#ifdef __KERNEL__
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) memcpy((void *)(to), (void *)(from), PAGE_SIZE)
-#define clear_user_page(page, vaddr, pg) clear_page(page)
+#define clear_user_page(page, vaddr, pg) clear_page(page, 0)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
/*
Index: linux-2.6.10/include/asm-v850/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-v850/page.h 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/include/asm-v850/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -37,11 +37,11 @@
#define STRICT_MM_TYPECHECKS
-#define clear_page(page) memset ((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset ((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to, from) memcpy ((void *)(to), (void *)from, PAGE_SIZE)
#define clear_user_page(addr, vaddr, page) \
- do { clear_page(addr); \
+ do { clear_page(addr, 0); \
flush_dcache_page(page); \
} while (0)
#define copy_user_page(to, from, vaddr, page) \
Index: linux-2.6.10/include/asm-parisc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-parisc/page.h 2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10/include/asm-parisc/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -13,7 +13,7 @@
#include <asm/types.h>
#include <asm/cache.h>
-#define clear_page(page) memset((void *)(page), 0, PAGE_SIZE)
+#define clear_page(page, order) memset((void *)(page), 0, PAGE_SIZE << (order))
#define copy_page(to,from) copy_user_page_asm((void *)(to), (void *)(from))
struct page;
Index: linux-2.6.10/arch/arm/mm/copypage-v6.c
===================================================================
--- linux-2.6.10.orig/arch/arm/mm/copypage-v6.c 2004-12-24 13:34:31.000000000 -0800
+++ linux-2.6.10/arch/arm/mm/copypage-v6.c 2005-01-21 11:51:39.000000000 -0800
@@ -47,7 +47,7 @@ void v6_copy_user_page_nonaliasing(void
*/
void v6_clear_user_page_nonaliasing(void *kaddr, unsigned long vaddr)
{
- clear_page(kaddr);
+ _clear_page(kaddr);
}
/*
@@ -116,7 +116,7 @@ void v6_clear_user_page_aliasing(void *k
set_pte(to_pte + offset, pfn_pte(__pa(kaddr) >> PAGE_SHIFT, to_pgprot));
flush_tlb_kernel_page(to);
- clear_page((void *)to);
+ _clear_page((void *)to);
spin_unlock(&v6_lock);
}
Index: linux-2.6.10/arch/m32r/mm/page.S
===================================================================
--- linux-2.6.10.orig/arch/m32r/mm/page.S 2004-12-24 13:34:57.000000000 -0800
+++ linux-2.6.10/arch/m32r/mm/page.S 2005-01-21 11:51:39.000000000 -0800
@@ -51,7 +51,7 @@ copy_page:
jmp r14
.text
- .global clear_page
+ .global _clear_page
/*
* clear_page (to)
*
@@ -60,7 +60,7 @@ copy_page:
* 16 * 256
*/
.align 4
-clear_page:
+_clear_page:
ldi r2, #255
ldi r4, #0
ld r3, @r0 /* cache line allocate */
Index: linux-2.6.10/include/asm-ppc/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-ppc/page.h 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/include/asm-ppc/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -85,7 +85,7 @@ typedef unsigned long pgprot_t;
struct page;
extern void clear_pages(void *page, int order);
-static inline void clear_page(void *page) { clear_pages(page, 0); }
+#define clear_page clear_pages
extern void copy_page(void *to, void *from);
extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
extern void copy_user_page(void *to, void *from, unsigned long vaddr,
Index: linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/alpha/kernel/alpha_ksyms.c 2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10/arch/alpha/kernel/alpha_ksyms.c 2005-01-21 11:51:39.000000000 -0800
@@ -88,7 +88,7 @@ EXPORT_SYMBOL(__memset);
EXPORT_SYMBOL(__memsetw);
EXPORT_SYMBOL(__constant_c_memset);
EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(__direct_map_base);
EXPORT_SYMBOL(__direct_map_size);
Index: linux-2.6.10/arch/alpha/lib/ev6-clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/alpha/lib/ev6-clear_page.S 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/alpha/lib/ev6-clear_page.S 2005-01-21 11:51:39.000000000 -0800
@@ -6,9 +6,9 @@
.text
.align 4
- .global clear_page
- .ent clear_page
-clear_page:
+ .global _clear_page
+ .ent _clear_page
+_clear_page:
.prologue 0
lda $0,128
@@ -51,4 +51,4 @@ clear_page:
nop
nop
- .end clear_page
+ .end _clear_page
Index: linux-2.6.10/arch/sh/mm/init.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/init.c 2004-12-24 13:35:24.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/init.c 2005-01-21 11:51:39.000000000 -0800
@@ -57,7 +57,7 @@ bootmem_data_t discontig_node_bdata[MAX_
#endif
void (*copy_page)(void *from, void *to);
-void (*clear_page)(void *to);
+void (*_clear_page)(void *to);
void show_mem(void)
{
@@ -255,7 +255,7 @@ void __init mem_init(void)
* later in the boot process if a better method is available.
*/
copy_page = copy_page_slow;
- clear_page = clear_page_slow;
+ _clear_page = clear_page_slow;
/* this will put all low memory onto the freelists */
totalram_pages += free_all_bootmem_node(NODE_DATA(0));
Index: linux-2.6.10/arch/sh/mm/pg-dma.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-dma.c 2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-dma.c 2005-01-21 11:51:39.000000000 -0800
@@ -78,7 +78,7 @@ static int __init pg_dma_init(void)
return ret;
copy_page = copy_page_dma;
- clear_page = clear_page_dma;
+ _clear_page = clear_page_dma;
return ret;
}
Index: linux-2.6.10/arch/sh/mm/pg-nommu.c
===================================================================
--- linux-2.6.10.orig/arch/sh/mm/pg-nommu.c 2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/arch/sh/mm/pg-nommu.c 2005-01-21 11:51:39.000000000 -0800
@@ -27,7 +27,7 @@ static void clear_page_nommu(void *to)
static int __init pg_nommu_init(void)
{
copy_page = copy_page_nommu;
- clear_page = clear_page_nommu;
+ _clear_page = clear_page_nommu;
return 0;
}
Index: linux-2.6.10/arch/mips/mm/pg-r4k.c
===================================================================
--- linux-2.6.10.orig/arch/mips/mm/pg-r4k.c 2004-12-24 13:34:49.000000000 -0800
+++ linux-2.6.10/arch/mips/mm/pg-r4k.c 2005-01-21 11:51:39.000000000 -0800
@@ -39,9 +39,9 @@
static unsigned int clear_page_array[0x130 / 4];
-void clear_page(void * page) __attribute__((alias("clear_page_array")));
+void _clear_page(void * page) __attribute__((alias("clear_page_array")));
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
/*
* Maximum sizes:
Index: linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c
===================================================================
--- linux-2.6.10.orig/arch/m32r/kernel/m32r_ksyms.c 2004-12-24 13:34:29.000000000 -0800
+++ linux-2.6.10/arch/m32r/kernel/m32r_ksyms.c 2005-01-21 11:51:39.000000000 -0800
@@ -102,7 +102,7 @@ EXPORT_SYMBOL(memmove);
EXPORT_SYMBOL(memcmp);
EXPORT_SYMBOL(memscan);
EXPORT_SYMBOL(copy_page);
-EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(_clear_page);
EXPORT_SYMBOL(strcat);
EXPORT_SYMBOL(strchr);
Index: linux-2.6.10/include/asm-arm26/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-arm26/page.h 2004-12-24 13:35:22.000000000 -0800
+++ linux-2.6.10/include/asm-arm26/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -25,7 +25,7 @@ extern void copy_page(void *to, const vo
preempt_enable(); \
} while (0)
-#define clear_page(page) memzero((void *)(page), PAGE_SIZE)
+#define clear_page(page, order) memzero((void *)(page), PAGE_SIZE << (order))
#define copy_page(to, from) __copy_user_page(to, from, 0);
#undef STRICT_MM_TYPECHECKS
Index: linux-2.6.10/include/asm-sparc64/page.h
===================================================================
--- linux-2.6.10.orig/include/asm-sparc64/page.h 2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10/include/asm-sparc64/page.h 2005-01-21 11:51:39.000000000 -0800
@@ -14,8 +14,8 @@
#ifndef __ASSEMBLY__
-extern void _clear_page(void *page);
-#define clear_page(X) _clear_page((void *)(X))
+extern void _clear_page(void *page, unsigned long order);
+#define clear_page(X,Y) _clear_page((void *)(X),(Y))
struct page;
extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
#define copy_page(X,Y) memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
Index: linux-2.6.10/arch/sparc64/lib/clear_page.S
===================================================================
--- linux-2.6.10.orig/arch/sparc64/lib/clear_page.S 2004-12-24 13:35:23.000000000 -0800
+++ linux-2.6.10/arch/sparc64/lib/clear_page.S 2005-01-21 11:51:39.000000000 -0800
@@ -28,9 +28,12 @@
.text
.globl _clear_page
-_clear_page: /* %o0=dest */
+_clear_page: /* %o0=dest, %o1=order */
+ sethi %hi(PAGE_SIZE/64), %o2
+ clr %o4
+ or %o2, %lo(PAGE_SIZE/64), %o2
ba,pt %xcc, clear_page_common
- clr %o4
+ sllx %o2, %o1, %o1
/* This thing is pretty important, it shows up
* on the profiles via do_anonymous_page().
@@ -69,16 +72,16 @@ clear_user_page: /* %o0=dest, %o1=vaddr
flush %g6
wrpr %o4, 0x0, %pstate
+ sethi %hi(PAGE_SIZE/64), %o1
mov 1, %o4
+ or %o1, %lo(PAGE_SIZE/64), %o1
clear_page_common:
VISEntryHalf
membar #StoreLoad | #StoreStore | #LoadStore
fzero %f0
- sethi %hi(PAGE_SIZE/64), %o1
mov %o0, %g1 ! remember vaddr for tlbflush
fzero %f2
- or %o1, %lo(PAGE_SIZE/64), %o1
faddd %f0, %f2, %f4
fmuld %f0, %f2, %f6
faddd %f0, %f2, %f8
^ permalink raw reply [flat|nested] 89+ messages in thread
* A scrub daemon (prezeroing)
2005-01-08 21:56 ` David S. Miller
2005-01-21 20:09 ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
2005-01-21 20:12 ` Extend clear_page by an order parameter Christoph Lameter
@ 2005-01-21 20:15 ` Christoph Lameter
2 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-21 20:15 UTC (permalink / raw)
Cc: linux-mm, linux-kernel
Adds management of ZEROED and NOT_ZEROED pages and a background daemon
called scrubd. scrubd is disabled by default but can be enabled
by writing an order number to /proc/sys/vm/scrub_start. If a page
is coalesced of that order or higher then the scrub daemon will
start zeroing until all pages of order /proc/sys/vm/scrub_stop and
higher are zeroed and then go back to sleep.
In an SMP environment the scrub daemon is typically
running on the most idle cpu. Thus a single threaded application running
on one cpu may have the other cpu zeroing pages for it etc. The scrub
daemon is hardly noticable and usually finished zeroing quickly since
most processors are optimized for linear memory filling.
Note that this patch does not depend on any other patches but other
patches would improve what scrubd does. The extension of clear_pages by an
order parameter would increase the speed of zeroing and the patch
introducing alloc_zeroed_user_highpage is necessary for user
pages to be allocated from the pool of zeroed pages.
Patch against 2.6.11-rc1-bk9
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.10/mm/page_alloc.c
===================================================================
--- linux-2.6.10.orig/mm/page_alloc.c 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/page_alloc.c 2005-01-21 12:01:44.000000000 -0800
@@ -12,6 +12,8 @@
* Zone balancing, Kanoj Sarcar, SGI, Jan 2000
* Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002
* (lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ * Page zeroing by Christoph Lameter, SGI, Dec 2004 based on
+ * initial code for __GFP_ZERO support by Andrea Arcangeli, Oct 2004.
*/
#include <linux/config.h>
@@ -33,6 +35,7 @@
#include <linux/cpu.h>
#include <linux/nodemask.h>
#include <linux/vmalloc.h>
+#include <linux/scrub.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -167,16 +170,16 @@ static void destroy_compound_page(struct
* zone->lock is already acquired when we use these.
* So, we don't need atomic page->flags operations here.
*/
-static inline unsigned long page_order(struct page *page) {
+static inline unsigned long page_zorder(struct page *page) {
return page->private;
}
-static inline void set_page_order(struct page *page, int order) {
- page->private = order;
+static inline void set_page_zorder(struct page *page, int order, int zero) {
+ page->private = order + (zero << 10);
__SetPagePrivate(page);
}
-static inline void rmv_page_order(struct page *page)
+static inline void rmv_page_zorder(struct page *page)
{
__ClearPagePrivate(page);
page->private = 0;
@@ -187,14 +190,15 @@ static inline void rmv_page_order(struct
* we can do coalesce a page and its buddy if
* (a) the buddy is free &&
* (b) the buddy is on the buddy system &&
- * (c) a page and its buddy have the same order.
+ * (c) a page and its buddy have the same order and the same
+ * zeroing status.
* for recording page's order, we use page->private and PG_private.
*
*/
-static inline int page_is_buddy(struct page *page, int order)
+static inline int page_is_buddy(struct page *page, int order, int zero)
{
if (PagePrivate(page) &&
- (page_order(page) == order) &&
+ (page_zorder(page) == order + (zero << 10)) &&
!PageReserved(page) &&
page_count(page) == 0)
return 1;
@@ -225,22 +229,20 @@ static inline int page_is_buddy(struct p
* -- wli
*/
-static inline void __free_pages_bulk (struct page *page, struct page *base,
- struct zone *zone, unsigned int order)
+static inline int __free_pages_bulk (struct page *page, struct page *base,
+ struct zone *zone, unsigned int order, int zero)
{
unsigned long page_idx;
struct page *coalesced;
- int order_size = 1 << order;
if (unlikely(order))
destroy_compound_page(page, order);
page_idx = page - base;
- BUG_ON(page_idx & (order_size - 1));
+ BUG_ON(page_idx & (( 1 << order) - 1));
BUG_ON(bad_range(zone, page));
- zone->free_pages += order_size;
while (order < MAX_ORDER-1) {
struct free_area *area;
struct page *buddy;
@@ -250,20 +252,21 @@ static inline void __free_pages_bulk (st
buddy = base + buddy_idx;
if (bad_range(zone, buddy))
break;
- if (!page_is_buddy(buddy, order))
+ if (!page_is_buddy(buddy, order, zero))
break;
/* Move the buddy up one level. */
list_del(&buddy->lru);
- area = zone->free_area + order;
+ area = zone->free_area[zero] + order;
area->nr_free--;
- rmv_page_order(buddy);
+ rmv_page_zorder(buddy);
page_idx &= buddy_idx;
order++;
}
coalesced = base + page_idx;
- set_page_order(coalesced, order);
- list_add(&coalesced->lru, &zone->free_area[order].free_list);
- zone->free_area[order].nr_free++;
+ set_page_zorder(coalesced, order, zero);
+ list_add(&coalesced->lru, &zone->free_area[zero][order].free_list);
+ zone->free_area[zero][order].nr_free++;
+ return order;
}
static inline void free_pages_check(const char *function, struct page *page)
@@ -312,8 +315,11 @@ free_pages_bulk(struct zone *zone, int c
page = list_entry(list->prev, struct page, lru);
/* have to delete it as __free_pages_bulk list manipulates */
list_del(&page->lru);
- __free_pages_bulk(page, base, zone, order);
+ if (__free_pages_bulk(page, base, zone, order, NOT_ZEROED)
+ >= sysctl_scrub_start)
+ wakeup_kscrubd(zone);
ret++;
+ zone->free_pages += 1UL << order;
}
spin_unlock_irqrestore(&zone->lock, flags);
return ret;
@@ -341,6 +347,18 @@ void __free_pages_ok(struct page *page,
free_pages_bulk(page_zone(page), 1, &list, order);
}
+void end_zero_page(struct page *page, unsigned int order)
+{
+ unsigned long flags;
+ struct zone * zone = page_zone(page);
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ __free_pages_bulk(page, zone->zone_mem_map, zone, order, ZEROED);
+ zone->zero_pages += 1UL << order;
+
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
/*
* The order of subdivision here is critical for the IO subsystem.
@@ -358,7 +376,7 @@ void __free_pages_ok(struct page *page,
*/
static inline struct page *
expand(struct zone *zone, struct page *page,
- int low, int high, struct free_area *area)
+ int low, int high, struct free_area *area, int zero)
{
unsigned long size = 1 << high;
@@ -369,7 +387,7 @@ expand(struct zone *zone, struct page *p
BUG_ON(bad_range(zone, &page[size]));
list_add(&page[size].lru, &area->free_list);
area->nr_free++;
- set_page_order(&page[size], high);
+ set_page_zorder(&page[size], high, zero);
}
return page;
}
@@ -420,23 +438,44 @@ static void prep_new_page(struct page *p
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static void inline rmpage(struct page *page, struct free_area *area)
+{
+ list_del(&page->lru);
+ rmv_page_zorder(page);
+ area->nr_free--;
+}
+
+struct page *scrubd_rmpage(struct zone *zone, struct free_area *area)
+{
+ unsigned long flags;
+ struct page *page = NULL;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ if (!list_empty(&area->free_list)) {
+ page = list_entry(area->free_list.next, struct page, lru);
+ rmpage(page, area);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return page;
+}
+
+static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero)
{
- struct free_area * area;
+ struct free_area *area;
unsigned int current_order;
struct page *page;
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
- area = zone->free_area + current_order;
+ area = zone->free_area[zero] + current_order;
if (list_empty(&area->free_list))
continue;
page = list_entry(area->free_list.next, struct page, lru);
- list_del(&page->lru);
- rmv_page_order(page);
- area->nr_free--;
+ rmpage(page, zone->free_area[zero] + current_order);
zone->free_pages -= 1UL << order;
- return expand(zone, page, order, current_order, area);
+ if (zero)
+ zone->zero_pages -= 1UL << order;
+ return expand(zone, page, order, current_order, area, zero);
}
return NULL;
@@ -448,7 +487,7 @@ static struct page *__rmqueue(struct zon
* Returns the number of new pages which were placed at *list.
*/
static int rmqueue_bulk(struct zone *zone, unsigned int order,
- unsigned long count, struct list_head *list)
+ unsigned long count, struct list_head *list, int zero)
{
unsigned long flags;
int i;
@@ -457,7 +496,7 @@ static int rmqueue_bulk(struct zone *zon
spin_lock_irqsave(&zone->lock, flags);
for (i = 0; i < count; ++i) {
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, zero);
if (page == NULL)
break;
allocated++;
@@ -504,7 +543,7 @@ void mark_free_pages(struct zone *zone)
ClearPageNosaveFree(pfn_to_page(zone_pfn + zone->zone_start_pfn));
for (order = MAX_ORDER - 1; order >= 0; --order)
- list_for_each(curr, &zone->free_area[order].free_list) {
+ list_for_each(curr, &zone->free_area[NOT_ZEROED][order].free_list) {
unsigned long start_pfn, i;
start_pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -591,7 +630,7 @@ void fastcall free_cold_page(struct page
free_hot_cold_page(page, 1);
}
-static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
+void prep_zero_page(struct page *page, unsigned int order, unsigned int gfp_flags)
{
int i;
@@ -610,7 +649,9 @@ buffered_rmqueue(struct zone *zone, int
{
unsigned long flags;
struct page *page = NULL;
- int cold = !!(gfp_flags & __GFP_COLD);
+ int nr_pages = 1 << order;
+ int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages);
+ int cold = !!(gfp_flags & __GFP_COLD) + 2*zero;
if (order == 0) {
struct per_cpu_pages *pcp;
@@ -619,7 +660,7 @@ buffered_rmqueue(struct zone *zone, int
local_irq_save(flags);
if (pcp->count <= pcp->low)
pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list);
+ pcp->batch, &pcp->list, zero);
if (pcp->count) {
page = list_entry(pcp->list.next, struct page, lru);
list_del(&page->lru);
@@ -631,16 +672,25 @@ buffered_rmqueue(struct zone *zone, int
if (page == NULL) {
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, zero);
+ /*
+ * If we failed to obtain a zero and/or unzeroed page
+ * then we may still be able to obtain the other
+ * type of page.
+ */
+ if (!page) {
+ page = __rmqueue(zone, order, !zero);
+ zero = 0;
+ }
spin_unlock_irqrestore(&zone->lock, flags);
}
if (page != NULL) {
BUG_ON(bad_range(zone, page));
- mod_page_state_zone(zone, pgalloc, 1 << order);
+ mod_page_state_zone(zone, pgalloc, nr_pages);
prep_new_page(page, order);
- if (gfp_flags & __GFP_ZERO)
+ if ((gfp_flags & __GFP_ZERO) && !zero)
prep_zero_page(page, order, gfp_flags);
if (order && (gfp_flags & __GFP_COMP))
@@ -669,7 +719,7 @@ int zone_watermark_ok(struct zone *z, in
return 0;
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
- free_pages -= z->free_area[o].nr_free << o;
+ free_pages -= (z->free_area[NOT_ZEROED][o].nr_free + z->free_area[ZEROED][o].nr_free) << o;
/* Require fewer higher order pages to be free */
min >>= 1;
@@ -1046,7 +1096,7 @@ unsigned long __read_page_state(unsigned
}
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free, struct pglist_data *pgdat)
+ unsigned long *free, unsigned long *zero, struct pglist_data *pgdat)
{
struct zone *zones = pgdat->node_zones;
int i;
@@ -1054,27 +1104,31 @@ void __get_zone_counts(unsigned long *ac
*active = 0;
*inactive = 0;
*free = 0;
+ *zero = 0;
for (i = 0; i < MAX_NR_ZONES; i++) {
*active += zones[i].nr_active;
*inactive += zones[i].nr_inactive;
*free += zones[i].free_pages;
+ *zero += zones[i].zero_pages;
}
}
void get_zone_counts(unsigned long *active,
- unsigned long *inactive, unsigned long *free)
+ unsigned long *inactive, unsigned long *free, unsigned long *zero)
{
struct pglist_data *pgdat;
*active = 0;
*inactive = 0;
*free = 0;
+ *zero = 0;
for_each_pgdat(pgdat) {
- unsigned long l, m, n;
- __get_zone_counts(&l, &m, &n, pgdat);
+ unsigned long l, m, n,o;
+ __get_zone_counts(&l, &m, &n, &o, pgdat);
*active += l;
*inactive += m;
*free += n;
+ *zero += o;
}
}
@@ -1111,6 +1165,7 @@ void si_meminfo_node(struct sysinfo *val
#define K(x) ((x) << (PAGE_SHIFT-10))
+const char *temp[3] = { "hot", "cold", "zero" };
/*
* Show free area list (used inside shift_scroll-lock stuff)
* We also calculate the percentage fragmentation. We do this by counting the
@@ -1123,6 +1178,7 @@ void show_free_areas(void)
unsigned long active;
unsigned long inactive;
unsigned long free;
+ unsigned long zero;
struct zone *zone;
for_each_zone(zone) {
@@ -1143,10 +1199,10 @@ void show_free_areas(void)
pageset = zone->pageset + cpu;
- for (temperature = 0; temperature < 2; temperature++)
+ for (temperature = 0; temperature < 3; temperature++)
printk("cpu %d %s: low %d, high %d, batch %d\n",
cpu,
- temperature ? "cold" : "hot",
+ temp[temperature],
pageset->pcp[temperature].low,
pageset->pcp[temperature].high,
pageset->pcp[temperature].batch);
@@ -1154,20 +1210,21 @@ void show_free_areas(void)
}
get_page_state(&ps);
- get_zone_counts(&active, &inactive, &free);
+ get_zone_counts(&active, &inactive, &free, &zero);
printk("\nFree pages: %11ukB (%ukB HighMem)\n",
K(nr_free_pages()),
K(nr_free_highpages()));
printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu "
- "unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n",
+ "unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n",
active,
inactive,
ps.nr_dirty,
ps.nr_writeback,
ps.nr_unstable,
nr_free_pages(),
+ zero,
ps.nr_slab,
ps.nr_mapped,
ps.nr_page_table_pages);
@@ -1216,7 +1273,7 @@ void show_free_areas(void)
spin_lock_irqsave(&zone->lock, flags);
for (order = 0; order < MAX_ORDER; order++) {
- nr = zone->free_area[order].nr_free;
+ nr = zone->free_area[NOT_ZEROED][order].nr_free + zone->free_area[ZEROED][order].nr_free;
total += nr << order;
printk("%lu*%lukB ", nr, K(1UL) << order);
}
@@ -1516,8 +1573,10 @@ void zone_init_free_lists(struct pglist_
{
int order;
for (order = 0; order < MAX_ORDER ; order++) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list);
- zone->free_area[order].nr_free = 0;
+ INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list);
+ INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list);
+ zone->free_area[NOT_ZEROED][order].nr_free = 0;
+ zone->free_area[ZEROED][order].nr_free = 0;
}
}
@@ -1542,6 +1601,7 @@ static void __init free_area_init_core(s
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
+ init_waitqueue_head(&pgdat->kscrubd_wait);
pgdat->kswapd_max_order = 0;
for (j = 0; j < MAX_NR_ZONES; j++) {
@@ -1565,6 +1625,7 @@ static void __init free_area_init_core(s
spin_lock_init(&zone->lru_lock);
zone->zone_pgdat = pgdat;
zone->free_pages = 0;
+ zone->zero_pages = 0;
zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
@@ -1598,6 +1659,13 @@ static void __init free_area_init_core(s
pcp->high = 2 * batch;
pcp->batch = 1 * batch;
INIT_LIST_HEAD(&pcp->list);
+
+ pcp = &zone->pageset[cpu].pcp[2]; /* zero pages */
+ pcp->count = 0;
+ pcp->low = 0;
+ pcp->high = 2 * batch;
+ pcp->batch = 1 * batch;
+ INIT_LIST_HEAD(&pcp->list);
}
printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
zone_names[j], realsize, batch);
@@ -1723,7 +1791,7 @@ static int frag_show(struct seq_file *m,
spin_lock_irqsave(&zone->lock, flags);
seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
for (order = 0; order < MAX_ORDER; ++order)
- seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+ seq_printf(m, "%6lu ", zone->free_area[NOT_ZEROED][order].nr_free);
spin_unlock_irqrestore(&zone->lock, flags);
seq_putc(m, '\n');
}
Index: linux-2.6.10/include/linux/mmzone.h
===================================================================
--- linux-2.6.10.orig/include/linux/mmzone.h 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/linux/mmzone.h 2005-01-21 11:56:07.000000000 -0800
@@ -51,7 +51,7 @@ struct per_cpu_pages {
};
struct per_cpu_pageset {
- struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */
+ struct per_cpu_pages pcp[3]; /* 0: hot. 1: cold 2: cold zeroed pages */
#ifdef CONFIG_NUMA
unsigned long numa_hit; /* allocated in intended node */
unsigned long numa_miss; /* allocated in non intended node */
@@ -107,10 +107,14 @@ struct per_cpu_pageset {
* ZONE_HIGHMEM > 896 MB only page cache and user processes
*/
+#define NOT_ZEROED 0
+#define ZEROED 1
+
struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long free_pages;
unsigned long pages_min, pages_low, pages_high;
+ unsigned long zero_pages;
/*
* protection[] is a pre-calculated number of extra pages that must be
* available in a zone in order for __alloc_pages() to allocate memory
@@ -131,7 +135,7 @@ struct zone {
* free areas of different sizes
*/
spinlock_t lock;
- struct free_area free_area[MAX_ORDER];
+ struct free_area free_area[2][MAX_ORDER];
ZONE_PADDING(_pad1_)
@@ -266,6 +270,9 @@ typedef struct pglist_data {
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
int kswapd_max_order;
+
+ wait_queue_head_t kscrubd_wait;
+ struct task_struct *kscrubd;
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
@@ -274,9 +281,9 @@ typedef struct pglist_data {
extern struct pglist_data *pgdat_list;
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free, struct pglist_data *pgdat);
+ unsigned long *free, unsigned long *zero, struct pglist_data *pgdat);
void get_zone_counts(unsigned long *active, unsigned long *inactive,
- unsigned long *free);
+ unsigned long *free, unsigned long *zero);
void build_all_zonelists(void);
void wakeup_kswapd(struct zone *zone, int order);
int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Index: linux-2.6.10/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.10.orig/fs/proc/proc_misc.c 2005-01-21 10:43:58.000000000 -0800
+++ linux-2.6.10/fs/proc/proc_misc.c 2005-01-21 11:56:07.000000000 -0800
@@ -123,12 +123,13 @@ static int meminfo_read_proc(char *page,
unsigned long inactive;
unsigned long active;
unsigned long free;
+ unsigned long zero;
unsigned long committed;
unsigned long allowed;
struct vmalloc_info vmi;
get_page_state(&ps);
- get_zone_counts(&active, &inactive, &free);
+ get_zone_counts(&active, &inactive, &free, &zero);
/*
* display in kilobytes.
@@ -148,6 +149,7 @@ static int meminfo_read_proc(char *page,
len = sprintf(page,
"MemTotal: %8lu kB\n"
"MemFree: %8lu kB\n"
+ "MemZero: %8lu kB\n"
"Buffers: %8lu kB\n"
"Cached: %8lu kB\n"
"SwapCached: %8lu kB\n"
@@ -171,6 +173,7 @@ static int meminfo_read_proc(char *page,
"VmallocChunk: %8lu kB\n",
K(i.totalram),
K(i.freeram),
+ K(zero),
K(i.bufferram),
K(get_page_cache_size()-total_swapcache_pages-i.bufferram),
K(total_swapcache_pages),
Index: linux-2.6.10/mm/readahead.c
===================================================================
--- linux-2.6.10.orig/mm/readahead.c 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/readahead.c 2005-01-21 11:56:07.000000000 -0800
@@ -573,7 +573,8 @@ unsigned long max_sane_readahead(unsigne
unsigned long active;
unsigned long inactive;
unsigned long free;
+ unsigned long zero;
- __get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
+ __get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id()));
return min(nr, (inactive + free) / 2);
}
Index: linux-2.6.10/drivers/base/node.c
===================================================================
--- linux-2.6.10.orig/drivers/base/node.c 2005-01-21 10:43:56.000000000 -0800
+++ linux-2.6.10/drivers/base/node.c 2005-01-21 11:56:07.000000000 -0800
@@ -42,13 +42,15 @@ static ssize_t node_read_meminfo(struct
unsigned long inactive;
unsigned long active;
unsigned long free;
+ unsigned long zero;
si_meminfo_node(&i, nid);
- __get_zone_counts(&active, &inactive, &free, NODE_DATA(nid));
+ __get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid));
n = sprintf(buf, "\n"
"Node %d MemTotal: %8lu kB\n"
"Node %d MemFree: %8lu kB\n"
+ "Node %d MemZero: %8lu kB\n"
"Node %d MemUsed: %8lu kB\n"
"Node %d Active: %8lu kB\n"
"Node %d Inactive: %8lu kB\n"
@@ -58,6 +60,7 @@ static ssize_t node_read_meminfo(struct
"Node %d LowFree: %8lu kB\n",
nid, K(i.totalram),
nid, K(i.freeram),
+ nid, K(zero),
nid, K(i.totalram - i.freeram),
nid, K(active),
nid, K(inactive),
Index: linux-2.6.10/include/linux/sched.h
===================================================================
--- linux-2.6.10.orig/include/linux/sched.h 2005-01-21 10:44:03.000000000 -0800
+++ linux-2.6.10/include/linux/sched.h 2005-01-21 11:56:07.000000000 -0800
@@ -736,6 +736,7 @@ do { if (atomic_dec_and_test(&(tsk)->usa
#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
#define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */
+#define PF_KSCRUBD 0x00800000 /* I am kscrubd */
#ifdef CONFIG_SMP
extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
Index: linux-2.6.10/mm/Makefile
===================================================================
--- linux-2.6.10.orig/mm/Makefile 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/mm/Makefile 2005-01-21 11:56:07.000000000 -0800
@@ -5,7 +5,7 @@
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
- vmalloc.o
+ vmalloc.o scrubd.o
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
page_alloc.o page-writeback.o pdflush.o \
Index: linux-2.6.10/mm/scrubd.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/mm/scrubd.c 2005-01-21 11:56:07.000000000 -0800
@@ -0,0 +1,134 @@
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <linux/file.h>
+#include <linux/suspend.h>
+#include <linux/sysctl.h>
+#include <linux/scrub.h>
+
+unsigned int sysctl_scrub_start = 5; /* if a page of this order is coalesed then run kscrubd */
+unsigned int sysctl_scrub_stop = 2; /* Mininum order of page to zero */
+unsigned int sysctl_scrub_load = 999; /* Do not run scrubd if load > */
+
+/*
+ * sysctl handler for /proc/sys/vm/scrub_start
+ */
+int scrub_start_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ proc_dointvec(table, write, file, buffer, length, ppos);
+ if (sysctl_scrub_start < MAX_ORDER) {
+ struct zone *zone;
+
+ for_each_zone(zone)
+ wakeup_kscrubd(zone);
+ }
+ return 0;
+}
+
+LIST_HEAD(zero_drivers);
+
+/*
+ * zero_highest_order_page takes a page off the freelist
+ * and then hands it off to block zeroing agents.
+ * The cleared pages are added to the back of
+ * the freelist where the page allocator may pick them up.
+ */
+int zero_highest_order_page(struct zone *z)
+{
+ int order;
+
+ for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) {
+ struct free_area *area = z->free_area[NOT_ZEROED] + order;
+ if (!list_empty(&area->free_list)) {
+ struct page *page = scrubd_rmpage(z, area);
+ struct list_head *l;
+ int size = PAGE_SIZE << order;
+
+ if (!page)
+ continue;
+
+ list_for_each(l, &zero_drivers) {
+ struct zero_driver *driver = list_entry(l, struct zero_driver, list);
+
+ if (driver->start(page_address(page), size) == 0)
+ goto done;
+ }
+
+ /* Unable to find a zeroing device that would
+ * deal with this page so just do it on our own.
+ * This will likely thrash the cpu caches.
+ */
+ cond_resched();
+ prep_zero_page(page, order, 0);
+done:
+ end_zero_page(page, order);
+ cond_resched();
+ return 1 << order;
+ }
+ }
+ return 0;
+}
+
+/*
+ * scrub_pgdat() will work across all this node's zones.
+ */
+static void scrub_pgdat(pg_data_t *pgdat)
+{
+ int i;
+ unsigned long pages_zeroed;
+
+ if (system_state != SYSTEM_RUNNING)
+ return;
+
+ do {
+ pages_zeroed = 0;
+ for (i = 0; i < pgdat->nr_zones; i++) {
+ struct zone *zone = pgdat->node_zones + i;
+
+ pages_zeroed += zero_highest_order_page(zone);
+ }
+ } while (pages_zeroed);
+}
+
+/*
+ * The background scrub daemon, started as a kernel thread
+ * from the init process.
+ */
+static int kscrubd(void *p)
+{
+ pg_data_t *pgdat = (pg_data_t*)p;
+ struct task_struct *tsk = current;
+ DEFINE_WAIT(wait);
+ cpumask_t cpumask;
+
+ daemonize("kscrubd%d", pgdat->node_id);
+ cpumask = node_to_cpumask(pgdat->node_id);
+ if (!cpus_empty(cpumask))
+ set_cpus_allowed(tsk, cpumask);
+
+ tsk->flags |= PF_MEMALLOC | PF_KSCRUBD;
+
+ for ( ; ; ) {
+ if (current->flags & PF_FREEZE)
+ refrigerator(PF_FREEZE);
+ prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE);
+ schedule();
+ finish_wait(&pgdat->kscrubd_wait, &wait);
+
+ scrub_pgdat(pgdat);
+ }
+ return 0;
+}
+
+static int __init kscrubd_init(void)
+{
+ pg_data_t *pgdat;
+ for_each_pgdat(pgdat)
+ pgdat->kscrubd
+ = find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL));
+ return 0;
+}
+
+module_init(kscrubd_init)
Index: linux-2.6.10/include/linux/scrub.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.10/include/linux/scrub.h 2005-01-21 11:56:07.000000000 -0800
@@ -0,0 +1,49 @@
+#ifndef _LINUX_SCRUB_H
+#define _LINUX_SCRUB_H
+
+/*
+ * Definitions for scrubbing of memory include an interface
+ * for drivers that may that allow the zeroing of memory
+ * without invalidating the caches.
+ *
+ * Christoph Lameter, December 2004.
+ */
+
+struct zero_driver {
+ int (*start)(void *, unsigned long); /* Start bzero transfer */
+ struct list_head list;
+};
+
+extern struct list_head zero_drivers;
+
+extern unsigned int sysctl_scrub_start;
+extern unsigned int sysctl_scrub_stop;
+extern unsigned int sysctl_scrub_load;
+
+/* Registering and unregistering zero drivers */
+static inline void register_zero_driver(struct zero_driver *z)
+{
+ list_add(&z->list, &zero_drivers);
+}
+
+static inline void unregister_zero_driver(struct zero_driver *z)
+{
+ list_del(&z->list);
+}
+
+extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area);
+
+static void inline wakeup_kscrubd(struct zone *zone)
+{
+ if (avenrun[0] >= ((unsigned long)sysctl_scrub_load << FSHIFT))
+ return;
+ if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait))
+ return;
+ wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait);
+}
+
+int scrub_start_handler(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
+
+extern void end_zero_page(struct page *page, unsigned int order);
+#endif
Index: linux-2.6.10/kernel/sysctl.c
===================================================================
--- linux-2.6.10.orig/kernel/sysctl.c 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/kernel/sysctl.c 2005-01-21 11:56:07.000000000 -0800
@@ -40,6 +40,7 @@
#include <linux/times.h>
#include <linux/limits.h>
#include <linux/dcache.h>
+#include <linux/scrub.h>
#include <linux/syscalls.h>
#include <asm/uaccess.h>
@@ -827,6 +828,33 @@ static ctl_table vm_table[] = {
.strategy = &sysctl_jiffies,
},
#endif
+ {
+ .ctl_name = VM_SCRUB_START,
+ .procname = "scrub_start",
+ .data = &sysctl_scrub_start,
+ .maxlen = sizeof(sysctl_scrub_start),
+ .mode = 0644,
+ .proc_handler = &scrub_start_handler,
+ .strategy = &sysctl_intvec,
+ },
+ {
+ .ctl_name = VM_SCRUB_STOP,
+ .procname = "scrub_stop",
+ .data = &sysctl_scrub_stop,
+ .maxlen = sizeof(sysctl_scrub_stop),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ },
+ {
+ .ctl_name = VM_SCRUB_LOAD,
+ .procname = "scrub_load",
+ .data = &sysctl_scrub_load,
+ .maxlen = sizeof(sysctl_scrub_load),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ },
{ .ctl_name = 0 }
};
Index: linux-2.6.10/include/linux/sysctl.h
===================================================================
--- linux-2.6.10.orig/include/linux/sysctl.h 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/linux/sysctl.h 2005-01-21 11:56:07.000000000 -0800
@@ -169,6 +169,9 @@ enum
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+ VM_SCRUB_START=30, /* percentage * 10 at which to start scrubd */
+ VM_SCRUB_STOP=31, /* percentage * 10 at which to stop scrubd */
+ VM_SCRUB_LOAD=32, /* Load factor at which not to scrub anymore */
};
Index: linux-2.6.10/include/linux/gfp.h
===================================================================
--- linux-2.6.10.orig/include/linux/gfp.h 2005-01-21 10:43:59.000000000 -0800
+++ linux-2.6.10/include/linux/gfp.h 2005-01-21 11:56:07.000000000 -0800
@@ -131,4 +131,5 @@ extern void FASTCALL(free_cold_page(stru
void page_alloc_init(void);
+void prep_zero_page(struct page *, unsigned int order);
#endif /* __LINUX_GFP_H */
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-21 20:12 ` Extend clear_page by an order parameter Christoph Lameter
@ 2005-01-21 22:29 ` Paul Mackerras
2005-01-21 23:48 ` Christoph Lameter
2005-01-23 7:45 ` Andrew Morton
1 sibling, 1 reply; 89+ messages in thread
From: Paul Mackerras @ 2005-01-21 22:29 UTC (permalink / raw)
To: Christoph Lameter
Cc: David S. Miller, Hugh Dickins, akpm, linux-ia64, torvalds,
linux-mm, linux-kernel
Christoph Lameter writes:
> The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
> clear_page that is capable of zeroing multiple pages at once (and scrubd
> too but that is now an independent patch). The following patch extends
> clear_page with a second parameter specifying the order of the page to be zeroed to allow an
> efficient zeroing of pages. Hope I caught everything....
Wouldn't it be nicer to call the version that takes the order
parameter "clear_pages" and then define clear_page(p) as
clear_pages(p, 0) ?
Paul.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-21 22:29 ` Paul Mackerras
@ 2005-01-21 23:48 ` Christoph Lameter
2005-01-22 0:35 ` Paul Mackerras
0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2005-01-21 23:48 UTC (permalink / raw)
To: Paul Mackerras
Cc: David S. Miller, Hugh Dickins, akpm, linux-ia64, torvalds,
linux-mm, linux-kernel
On Sat, 22 Jan 2005, Paul Mackerras wrote:
> Christoph Lameter writes:
>
> > The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
> > clear_page that is capable of zeroing multiple pages at once (and scrubd
> > too but that is now an independent patch). The following patch extends
> > clear_page with a second parameter specifying the order of the page to be zeroed to allow an
> > efficient zeroing of pages. Hope I caught everything....
>
> Wouldn't it be nicer to call the version that takes the order
> parameter "clear_pages" and then define clear_page(p) as
> clear_pages(p, 0) ?
clear_page clears one page of the specified order. clear_page cannot clear
multiple pages. Calling the function clear_pages would give a wrong
impression on what the function does and may lead to attempts to specify
the number of zero order pages as a parameter instead of the order.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-21 23:48 ` Christoph Lameter
@ 2005-01-22 0:35 ` Paul Mackerras
2005-01-22 0:43 ` Andrew Morton
0 siblings, 1 reply; 89+ messages in thread
From: Paul Mackerras @ 2005-01-22 0:35 UTC (permalink / raw)
To: Christoph Lameter
Cc: David S. Miller, Hugh Dickins, akpm, linux-ia64, torvalds,
linux-mm, linux-kernel
Christoph Lameter writes:
> clear_page clears one page of the specified order.
Now you're really being confusing. A cluster of 2^n contiguous pages
isn't one page by any normal definition. Call it "clear_page_cluster"
or "clear_page_order" or something, but not "clear_page".
Paul.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-22 0:35 ` Paul Mackerras
@ 2005-01-22 0:43 ` Andrew Morton
2005-01-22 1:08 ` Paul Mackerras
` (2 more replies)
0 siblings, 3 replies; 89+ messages in thread
From: Andrew Morton @ 2005-01-22 0:43 UTC (permalink / raw)
To: Paul Mackerras
Cc: clameter, davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel
Paul Mackerras <paulus@samba.org> wrote:
>
> A cluster of 2^n contiguous pages
> isn't one page by any normal definition.
It is, actually, from the POV of the page allocator. It's a "higher order
page" and is controlled by a struct page*, just like a zero-order page...
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-22 0:43 ` Andrew Morton
@ 2005-01-22 1:08 ` Paul Mackerras
2005-01-22 1:20 ` Roman Zippel
2005-01-22 1:25 ` Paul Mackerras
2 siblings, 0 replies; 89+ messages in thread
From: Paul Mackerras @ 2005-01-22 1:08 UTC (permalink / raw)
To: Andrew Morton
Cc: clameter, davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel
Andrew Morton writes:
> It is, actually, from the POV of the page allocator. It's a "higher order
> page" and is controlled by a struct page*, just like a zero-order page...
OK. I still reckon it's confusing terminology for the rest of us who
don't have our heads deep in the page allocator code.
Paul.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-22 0:43 ` Andrew Morton
2005-01-22 1:08 ` Paul Mackerras
@ 2005-01-22 1:20 ` Roman Zippel
2005-01-22 1:25 ` Paul Mackerras
2 siblings, 0 replies; 89+ messages in thread
From: Roman Zippel @ 2005-01-22 1:20 UTC (permalink / raw)
To: Andrew Morton
Cc: Paul Mackerras, clameter, davem, hugh, linux-ia64, torvalds,
linux-mm, linux-kernel
Hi,
On Fri, 21 Jan 2005, Andrew Morton wrote:
> Paul Mackerras <paulus@samba.org> wrote:
> >
> > A cluster of 2^n contiguous pages
> > isn't one page by any normal definition.
>
> It is, actually, from the POV of the page allocator. It's a "higher order
> page" and is controlled by a struct page*, just like a zero-order page...
OTOH we also have alloc_page/alloc_pages.
bye, Roman
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-22 0:43 ` Andrew Morton
2005-01-22 1:08 ` Paul Mackerras
2005-01-22 1:20 ` Roman Zippel
@ 2005-01-22 1:25 ` Paul Mackerras
2005-01-22 1:54 ` Christoph Lameter
2 siblings, 1 reply; 89+ messages in thread
From: Paul Mackerras @ 2005-01-22 1:25 UTC (permalink / raw)
To: Andrew Morton
Cc: clameter, davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel
Andrew Morton writes:
> It is, actually, from the POV of the page allocator. It's a "higher order
> page" and is controlled by a struct page*, just like a zero-order page...
So why is the function that gets me one of these "higher order pages"
called "get_free_pages" with an "s"? :)
Christoph's patch is bigger than it needs to be because he has to
change all the occurrences of clear_page(x) to clear_page(x, 0), and
then he has to change a lot of architectures' clear_page functions to
be called _clear_page instead. If he picked a different name for the
"clear a higher order page" function it would end up being less
invasive as well as less confusing.
The argument that clear_page is called that because it clears a higher
order page won't wash; all the clear_page implementations in his patch
are perfectly capable of clearing any contiguous set of 2^order pages
(oops, I mean "zero-order pages"), not just a "higher order page".
Paul.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-22 1:25 ` Paul Mackerras
@ 2005-01-22 1:54 ` Christoph Lameter
2005-01-22 2:53 ` Paul Mackerras
0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2005-01-22 1:54 UTC (permalink / raw)
To: Paul Mackerras
Cc: Andrew Morton, davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel
On Sat, 22 Jan 2005, Paul Mackerras wrote:
> Christoph's patch is bigger than it needs to be because he has to
> change all the occurrences of clear_page(x) to clear_page(x, 0), and
> then he has to change a lot of architectures' clear_page functions to
> be called _clear_page instead. If he picked a different name for the
> "clear a higher order page" function it would end up being less
> invasive as well as less confusing.
I had the name "zero_page" in V1 and V2 of the patch where it was
separate. Then someone complained about code duplication.
> The argument that clear_page is called that because it clears a higher
> order page won't wash; all the clear_page implementations in his patch
> are perfectly capable of clearing any contiguous set of 2^order pages
> (oops, I mean "zero-order pages"), not just a "higher order page".
clear_page is called clear_page because it clears one page of *any* order
not just higher orders. zero-order pages are not segregated nor are they
intrisincally better just because they contain more memory ;-).
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-22 1:54 ` Christoph Lameter
@ 2005-01-22 2:53 ` Paul Mackerras
0 siblings, 0 replies; 89+ messages in thread
From: Paul Mackerras @ 2005-01-22 2:53 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew Morton, davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel
Christoph Lameter writes:
> I had the name "zero_page" in V1 and V2 of the patch where it was
> separate. Then someone complained about code duplication.
Well, if you duplicated each arch's clear_page implementation in
zero_page, then yes, that would be unnecessary code duplication. I
would suggest that for architectures where the clear_page
implementation can easily be extended, rename it to clear_page_order
(or something) and #define clear_page(x) to be clear_page_order(x, 0).
For architectures where it can't, leave clear_page as clear_page and
define clear_page_order as an inline function that calls clear_page in
a loop.
> clear_page is called clear_page because it clears one page of *any* order
> not just higher orders. zero-order pages are not segregated nor are they
> intrisincally better just because they contain more memory ;-).
You have missed my point, which was about address constraints, not a
distinction between zero-order pages and higher-order pages.
Anyway, I remain of the opinion that your naming is inconsistent with
the naming of other functions that deal with zero-order and
higher-order pages, such as get_free_pages, alloc_pages, free_pages,
etc., and that your patch is unnecessarily intrusive. I guess it's up
to Andrew to decide which way we go.
Paul.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-21 20:12 ` Extend clear_page by an order parameter Christoph Lameter
2005-01-21 22:29 ` Paul Mackerras
@ 2005-01-23 7:45 ` Andrew Morton
2005-01-24 16:37 ` Christoph Lameter
1 sibling, 1 reply; 89+ messages in thread
From: Andrew Morton @ 2005-01-23 7:45 UTC (permalink / raw)
To: Christoph Lameter
Cc: davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel
Christoph Lameter <clameter@sgi.com> wrote:
>
> The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
> clear_page that is capable of zeroing multiple pages at once (and scrubd
> too but that is now an independent patch). The following patch extends
> clear_page with a second parameter specifying the order of the page to be zeroed to allow an
> efficient zeroing of pages. Hope I caught everything....
>
Sorry, I take it back. As Paul says:
: Wouldn't it be nicer to call the version that takes the order
: parameter "clear_pages" and then define clear_page(p) as
: clear_pages(p, 0) ?
It would make the patch considerably smaller, and our naming is all over
the place anyway...
> -static inline void prep_zero_page(struct page *page, int order, int gfp_flags)
> +void prep_zero_page(struct page *page, unsigned int order, unsigned int gfp_flags)
> {
> int i;
>
> BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
> + if (!PageHighMem(page)) {
> + clear_page(page_address(page), order);
> + return;
> + }
> +
> for(i = 0; i < (1 << order); i++)
> clear_highpage(page + i);
> }
I'd have thought that we'd want to make the new clear_pages() handle
highmem pages too, if only from a regularity POV. x86 hugetlbpages could
use it then, if someone thinks up a fast page-clearer.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-23 7:45 ` Andrew Morton
@ 2005-01-24 16:37 ` Christoph Lameter
2005-01-24 20:23 ` David S. Miller
0 siblings, 1 reply; 89+ messages in thread
From: Christoph Lameter @ 2005-01-24 16:37 UTC (permalink / raw)
To: Andrew Morton; +Cc: davem, hugh, linux-ia64, torvalds, linux-mm, linux-kernel
On Sat, 22 Jan 2005, Andrew Morton wrote:
> Christoph Lameter <clameter@sgi.com> wrote:
> >
> > The zeroing of a page of a arbitrary order in page_alloc.c and in hugetlb.c may benefit from a
> > clear_page that is capable of zeroing multiple pages at once (and scrubd
> > too but that is now an independent patch). The following patch extends
> > clear_page with a second parameter specifying the order of the page to be zeroed to allow an
> > efficient zeroing of pages. Hope I caught everything....
> >
>
> Sorry, I take it back. As Paul says:
>
> : Wouldn't it be nicer to call the version that takes the order
> : parameter "clear_pages" and then define clear_page(p) as
> : clear_pages(p, 0) ?
> It would make the patch considerably smaller, and our naming is all over
> the place anyway...
Sounds good. Note though that this just means renaming clear_page to
clear_pages for all arches which would increase the patch size for the
arch specific section.
> I'd have thought that we'd want to make the new clear_pages() handle
> highmem pages too, if only from a regularity POV. x86 hugetlbpages could
> use it then, if someone thinks up a fast page-clearer.
That would get us back to code duplication. We would have a clear_page (no
highmem support) and a clear_pages (supporting highmem). Then it may
also be better to pass the page struct to clear_pages instead of a memory address.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-24 16:37 ` Christoph Lameter
@ 2005-01-24 20:23 ` David S. Miller
2005-01-24 20:33 ` Christoph Lameter
0 siblings, 1 reply; 89+ messages in thread
From: David S. Miller @ 2005-01-24 20:23 UTC (permalink / raw)
To: Christoph Lameter
Cc: akpm, hugh, linux-ia64, torvalds, linux-mm, linux-kernel
On Mon, 24 Jan 2005 08:37:15 -0800 (PST)
Christoph Lameter <clameter@sgi.com> wrote:
> Then it may also be better to pass the page struct to clear_pages
> instead of a memory address.
What is more generally available at the call sites at this time?
Consider both HIGHMEM and non-HIGHMEM setups in your estimation
please :-)
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Extend clear_page by an order parameter
2005-01-24 20:23 ` David S. Miller
@ 2005-01-24 20:33 ` Christoph Lameter
0 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-01-24 20:33 UTC (permalink / raw)
To: David S. Miller; +Cc: akpm, hugh, linux-ia64, torvalds, linux-mm, linux-kernel
On Mon, 24 Jan 2005, David S. Miller wrote:
> On Mon, 24 Jan 2005 08:37:15 -0800 (PST)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > Then it may also be better to pass the page struct to clear_pages
> > instead of a memory address.
>
> What is more generally available at the call sites at this time?
> Consider both HIGHMEM and non-HIGHMEM setups in your estimation
> please :-)
The only call site is prep_zero_page which has a GFP flag, the order and
the pointer to struct page.
The patch makes the huge page code call prep_zero_page and scrubd will
also call prep_zero_page.
^ permalink raw reply [flat|nested] 89+ messages in thread
* [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL
2005-01-21 20:09 ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
@ 2005-02-09 9:58 ` Michael Ellerman
2005-02-10 0:38 ` Christoph Lameter
0 siblings, 1 reply; 89+ messages in thread
From: Michael Ellerman @ 2005-02-09 9:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: davem, hugh, akpm, linux-ia64, torvalds, linux-mm, linux-kernel
Hi All,
The generic and IA-64 versions of alloc_zeroed_user_highpage() don't check the return value from alloc_page_vma(). This can lead to an oops if we're OOM.
This fixes my oops on PPC64, but I haven't got an IA-64 machine/compiler handy.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
diff -rN -p -u oombreakage-old/include/asm-ia64/page.h oombreakage-new/include/asm-ia64/page.h
--- oombreakage-old/include/asm-ia64/page.h 2005-02-04 04:10:37.000000000 +1100
+++ oombreakage-new/include/asm-ia64/page.h 2005-02-09 20:53:37.000000000 +1100
@@ -79,7 +79,8 @@ do { \
#define alloc_zeroed_user_highpage(vma, vaddr) \
({ \
struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
- flush_dcache_page(page); \
+ if (page) \
+ flush_dcache_page(page); \
page; \
})
diff -rN -p -u oombreakage-old/include/linux/highmem.h oombreakage-new/include/linux/highmem.h
--- oombreakage-old/include/linux/highmem.h 2005-02-09 20:22:41.000000000 +1100
+++ oombreakage-new/include/linux/highmem.h 2005-02-09 20:47:01.000000000 +1100
@@ -48,7 +48,9 @@ alloc_zeroed_user_highpage(struct vm_are
{
struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
- clear_user_highpage(page, vaddr);
+ if (page)
+ clear_user_highpage(page, vaddr);
+
return page;
}
#endif
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL
2005-02-09 9:58 ` [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL Michael Ellerman
@ 2005-02-10 0:38 ` Christoph Lameter
0 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2005-02-10 0:38 UTC (permalink / raw)
To: Michael Ellerman; +Cc: akpm, linux-kernel
On Wed, 9 Feb 2005, Michael Ellerman wrote:
> The generic and IA-64 versions of alloc_zeroed_user_highpage() don't
> check the return value from alloc_page_vma(). This can lead to an oops
> if we're OOM. This fixes my oops on PPC64, but I haven't got an IA-64
> machine/compiler handy.
Patch looks okay to me. These are the only occurences as far as I can tell
after reviewing the alloc_zeroed_user_higpage implementations in
include/asm-*/page.h.
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
2004-12-23 20:27 ` Prezeroing V2 [0/3]: Why and When it works Andi Kleen
@ 2004-12-23 21:02 ` Christoph Lameter
0 siblings, 0 replies; 89+ messages in thread
From: Christoph Lameter @ 2004-12-23 21:02 UTC (permalink / raw)
To: Andi Kleen; +Cc: linux-kernel
On Thu, 23 Dec 2004, Andi Kleen wrote:
> > 1. Aggregating zeroing operations to only apply to pages of higher order,
> > which results in many pages that will later become order 0 to be
> > zeroed in one go. For that purpose the existing clear_page function is
> > extended and made to take an additional argument specifying the order of
> > the page to be cleared.
>
> But if you do that you should really use a separate function that
> can use cache bypassing stores.
>
> Normal clear_page cannot use that because it would be a loss
> when the data is soon used.
Clear_page is used both in the cache hot and no cache wanted case now.
> So the two changes don't really make sense.
Which two changes?
If an arch can do zeroing without touching the cpu caches then that can
be done with a zero driver.
> Also I must say I'm still suspicious regarding your heuristic
> to trigger gang faulting - with bad luck it could lead to a lot
> more memory usage to specific applications that do very sparse
> usage of memory.
Gang faulting is not part of this patch. Please keep the issues separate.
> There should be at least an madvise flag to turn it off and a sysctl
> and it would be better to trigger only on a longer sequence of
> consecutive faulted pages.
Again this is not related to this patchset. Look at the V13 of the page
fault scalability patch and you will find a /proc/sys/vm setting to
manipulate things. This is V2 of the prezeroing patch.
> How about some numbers on i386?
Umm. Yeah. I only have smallish i386 machines here. Maybe next year ;-)
^ permalink raw reply [flat|nested] 89+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
[not found] ` <Pine.LNX.4.58.0412231119540.31791@schroedinger.engr.sgi.com.suse.lists.linux.kernel>
@ 2004-12-23 20:27 ` Andi Kleen
2004-12-23 21:02 ` Christoph Lameter
0 siblings, 1 reply; 89+ messages in thread
From: Andi Kleen @ 2004-12-23 20:27 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-kernel
Christoph Lameter <clameter@sgi.com> writes:
> and why other approaches have not worked.
> o Instead of zero_page(p,order) extend clear_page to take second argument
> o Update all architectures to accept second argument for clear_pages
Sorry if there was a miscommunication, but ...
> 1. Aggregating zeroing operations to only apply to pages of higher order,
> which results in many pages that will later become order 0 to be
> zeroed in one go. For that purpose the existing clear_page function is
> extended and made to take an additional argument specifying the order of
> the page to be cleared.
But if you do that you should really use a separate function that
can use cache bypassing stores.
Normal clear_page cannot use that because it would be a loss
when the data is soon used.
So the two changes don't really make sense.
Also I must say I'm still suspicious regarding your heuristic
to trigger gang faulting - with bad luck it could lead to a lot
more memory usage to specific applications that do very sparse
usage of memory.
There should be at least an madvise flag to turn it off and a sysctl
and it would be better to trigger only on a longer sequence of
consecutive faulted pages.
> 2. Hardware support for offloading zeroing from the cpu. This avoids
> the invalidation of the cpu caches by extensive zeroing operations.
>
> The result is a significant increase of the page fault performance even for
> single threaded applications:
[...]
How about some numbers on i386?
-Andi
^ permalink raw reply [flat|nested] 89+ messages in thread
end of thread, other threads:[~2005-02-10 0:39 UTC | newest]
Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <B8E391BBE9FE384DAA4C5C003888BE6F02900FBD@scsmsx401.amr.corp.intel.com>
[not found] ` <41C20E3E.3070209@yahoo.com.au>
2004-12-21 19:55 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
2004-12-21 19:56 ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Christoph Lameter
2004-12-21 19:57 ` Increase page fault rate by prezeroing V1 [2/3]: zeroing and scrubd Christoph Lameter
2005-01-01 2:22 ` Nick Piggin
2005-01-01 2:55 ` pmarques
2004-12-21 19:57 ` Increase page fault rate by prezeroing V1 [3/3]: Altix SN2 BTE Zeroing Christoph Lameter
2004-12-22 12:46 ` Robin Holt
2004-12-22 19:56 ` Christoph Lameter
2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
2004-12-23 19:33 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter
2004-12-23 19:33 ` Prezeroing V2 [2/4]: add second parameter to clear_page() for all arches Christoph Lameter
2004-12-24 8:33 ` Pavel Machek
2004-12-24 16:18 ` Christoph Lameter
2004-12-24 16:27 ` Pavel Machek
2004-12-24 17:02 ` David S. Miller
2004-12-24 17:05 ` David S. Miller
2004-12-27 22:48 ` David S. Miller
2005-01-03 17:52 ` Christoph Lameter
2005-01-01 10:24 ` Geert Uytterhoeven
2005-01-04 23:12 ` Prezeroing V3 [0/4]: Discussion and i386 performance tests Christoph Lameter
2005-01-04 23:13 ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
2005-01-04 23:45 ` Dave Hansen
2005-01-05 1:16 ` Christoph Lameter
2005-01-05 1:26 ` Linus Torvalds
2005-01-05 23:11 ` Christoph Lameter
2005-01-05 0:34 ` Linus Torvalds
2005-01-05 0:47 ` Andrew Morton
2005-01-05 1:15 ` Christoph Lameter
2005-01-08 21:12 ` Hugh Dickins
2005-01-08 21:56 ` David S. Miller
2005-01-21 20:09 ` alloc_zeroed_user_highpage to fix the clear_user_highpage issue Christoph Lameter
2005-02-09 9:58 ` [Patch] Fix oops in alloc_zeroed_user_highpage() when page is NULL Michael Ellerman
2005-02-10 0:38 ` Christoph Lameter
2005-01-21 20:12 ` Extend clear_page by an order parameter Christoph Lameter
2005-01-21 22:29 ` Paul Mackerras
2005-01-21 23:48 ` Christoph Lameter
2005-01-22 0:35 ` Paul Mackerras
2005-01-22 0:43 ` Andrew Morton
2005-01-22 1:08 ` Paul Mackerras
2005-01-22 1:20 ` Roman Zippel
2005-01-22 1:25 ` Paul Mackerras
2005-01-22 1:54 ` Christoph Lameter
2005-01-22 2:53 ` Paul Mackerras
2005-01-23 7:45 ` Andrew Morton
2005-01-24 16:37 ` Christoph Lameter
2005-01-24 20:23 ` David S. Miller
2005-01-24 20:33 ` Christoph Lameter
2005-01-21 20:15 ` A scrub daemon (prezeroing) Christoph Lameter
2005-01-10 17:16 ` Prezeroing V3 [1/4]: Allow request for zeroed memory Christoph Lameter
2005-01-10 18:13 ` Linus Torvalds
2005-01-10 20:17 ` Christoph Lameter
2005-01-10 23:53 ` Prezeroing V4 [0/4]: Overview Christoph Lameter
2005-01-10 23:54 ` Prezeroing V4 [1/4]: Arch specific page zeroing during page fault Christoph Lameter
2005-01-11 0:41 ` Chris Wright
2005-01-11 0:46 ` Christoph Lameter
2005-01-11 0:49 ` Chris Wright
2005-01-10 23:55 ` Prezeroing V4 [2/4]: Zeroing implementation Christoph Lameter
2005-01-10 23:55 ` Prezeroing V4 [3/4]: Altix SN2 BTE zero driver Christoph Lameter
2005-01-10 23:56 ` Prezeroing V4 [4/4]: Extend clear_page to take an order parameter Christoph Lameter
2005-01-04 23:14 ` Prezeroing V3 [2/4]: Extension of " Christoph Lameter
2005-01-05 23:25 ` Christoph Lameter
2005-01-06 13:52 ` Andi Kleen
2005-01-06 17:47 ` Christoph Lameter
2005-01-04 23:15 ` Prezeroing V3 [3/4]: Page zeroing through kscrubd Christoph Lameter
2005-01-04 23:16 ` Prezeroing V3 [4/4]: Driver for hardware zeroing on Altix Christoph Lameter
2005-01-05 2:16 ` Andi Kleen
2005-01-05 16:24 ` Christoph Lameter
2004-12-23 19:34 ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Christoph Lameter
2004-12-23 19:35 ` Prezeroing V2 [4/4]: Hardware Zeroing through SGI BTE Christoph Lameter
2004-12-23 20:08 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Brian Gerst
2004-12-24 16:24 ` Christoph Lameter
2004-12-23 19:49 ` Prezeroing V2 [0/3]: Why and When it works Arjan van de Ven
2004-12-23 20:57 ` Matt Mackall
2004-12-23 21:01 ` Paul Mackerras
2004-12-23 21:11 ` Paul Mackerras
2004-12-23 21:37 ` Andrew Morton
2004-12-23 23:00 ` Paul Mackerras
2004-12-23 21:48 ` Linus Torvalds
2004-12-23 22:34 ` Zwane Mwaikambo
2004-12-24 9:14 ` Arjan van de Ven
2004-12-24 18:21 ` Linus Torvalds
2004-12-24 18:57 ` Arjan van de Ven
2004-12-27 22:50 ` David S. Miller
2004-12-28 11:53 ` Marcelo Tosatti
2004-12-24 16:17 ` Christoph Lameter
2004-12-24 18:31 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Andrea Arcangeli
2005-01-03 17:54 ` Christoph Lameter
[not found] <B8E391BBE9FE384DAA4C5C003888BE6F02900FBD@scsmsx401.amr.corp.intel.com.suse.lists.linux.kernel>
[not found] ` <41C20E3E.3070209@yahoo.com.au.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.58.0412211154100.1313@schroedinger.engr.sgi.com.suse.lists.linux.kernel>
[not found] ` <Pine.LNX.4.58.0412231119540.31791@schroedinger.engr.sgi.com.suse.lists.linux.kernel>
2004-12-23 20:27 ` Prezeroing V2 [0/3]: Why and When it works Andi Kleen
2004-12-23 21:02 ` Christoph Lameter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).