* Re: Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps [not found] ` <fa.n04s9ar.17sg3f@ifi.uio.no> @ 2004-12-24 21:10 ` Bodo Eggert 2004-12-26 23:02 ` Florian Weimer 0 siblings, 1 reply; 9+ messages in thread From: Bodo Eggert @ 2004-12-24 21:10 UTC (permalink / raw) To: Christoph Lameter, akpm, linux-ia64, torvalds, linux-mm, linux-kernel Christoph Lameter wrote: > o Add scrub daemon Please use names a simple user may understand. What about memcleand or zeropaged instead? ¢¢ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps 2004-12-24 21:10 ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Bodo Eggert @ 2004-12-26 23:02 ` Florian Weimer 2004-12-26 23:12 ` Linus Torvalds 0 siblings, 1 reply; 9+ messages in thread From: Florian Weimer @ 2004-12-26 23:02 UTC (permalink / raw) To: 7eggert Cc: Christoph Lameter, akpm, linux-ia64, torvalds, linux-mm, linux-kernel * Bodo Eggert: > Christoph Lameter wrote: > >> o Add scrub daemon > > Please use names a simple user may understand. > > What about memcleand or zeropaged instead? But overwritting with zeros is commonly called "scrubbing", as in "password scrubbing". ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps 2004-12-26 23:02 ` Florian Weimer @ 2004-12-26 23:12 ` Linus Torvalds 2004-12-26 23:24 ` Florian Weimer ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Linus Torvalds @ 2004-12-26 23:12 UTC (permalink / raw) To: Florian Weimer Cc: 7eggert, Christoph Lameter, akpm, linux-ia64, linux-mm, linux-kernel On Mon, 27 Dec 2004, Florian Weimer wrote: > > But overwritting with zeros is commonly called "scrubbing", as in > "password scrubbing". On the other hand, "memory scrubbing" in an OS sense is most often used for reading and re-writing the same thing to fix correctable ECC failures. Anyway, at this point I think the most interesting question is whether it actually improves any macro-benchmark behaviour, rather than just a page fault latency tester microbenchmark.. Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps 2004-12-26 23:12 ` Linus Torvalds @ 2004-12-26 23:24 ` Florian Weimer 2004-12-27 1:37 ` Ingo Oeser 2004-12-27 0:01 ` Chris Wedgwood 2005-01-03 20:30 ` Christoph Lameter 2 siblings, 1 reply; 9+ messages in thread From: Florian Weimer @ 2004-12-26 23:24 UTC (permalink / raw) To: Linus Torvalds Cc: 7eggert, Christoph Lameter, akpm, linux-ia64, linux-mm, linux-kernel * Linus Torvalds: > Anyway, at this point I think the most interesting question is whether it > actually improves any macro-benchmark behaviour, rather than just a page > fault latency tester microbenchmark.. By the way, some crazy idea that occurred to me: What about incrementally scrubbing a page which has been assigned previously to this CPU, while spinning inside spinlocks (or busy-waiting somewhere else)? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps 2004-12-26 23:24 ` Florian Weimer @ 2004-12-27 1:37 ` Ingo Oeser 2004-12-27 4:33 ` Zwane Mwaikambo 0 siblings, 1 reply; 9+ messages in thread From: Ingo Oeser @ 2004-12-27 1:37 UTC (permalink / raw) To: linux-kernel Cc: Florian Weimer, Linus Torvalds, 7eggert, Christoph Lameter, akpm, linux-ia64, linux-mm -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Monday 27 December 2004 00:24, Florian Weimer wrote: > By the way, some crazy idea that occurred to me: What about > incrementally scrubbing a page which has been assigned previously to > this CPU, while spinning inside spinlocks (or busy-waiting somewhere > else)? Crazy idea, indeed. spinlocks are like safety belts: You should actually not need them in the normal case, but they will save your butt and you'll be glad you have them, when they actually trigger. So if you are making serious progress here, you have just uncovered a spinlockcontention problem in the kernel ;-) Regards Ingo Oeser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBz2dvU56oYWuOrkARAvc+AJ0RpaIg6JzC28B8SOXE3irCBtaTVgCg1eas 5zACIzV2CtvlNvg6Bit+/G8= =rdE7 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps 2004-12-27 1:37 ` Ingo Oeser @ 2004-12-27 4:33 ` Zwane Mwaikambo 0 siblings, 0 replies; 9+ messages in thread From: Zwane Mwaikambo @ 2004-12-27 4:33 UTC (permalink / raw) To: Ingo Oeser Cc: linux-kernel, Florian Weimer, Linus Torvalds, 7eggert, Christoph Lameter, akpm, linux-ia64, linux-mm On Mon, 27 Dec 2004, Ingo Oeser wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Monday 27 December 2004 00:24, Florian Weimer wrote: > > By the way, some crazy idea that occurred to me: What about > > incrementally scrubbing a page which has been assigned previously to > > this CPU, while spinning inside spinlocks (or busy-waiting somewhere > > else)? > > Crazy idea, indeed. spinlocks are like safety belts: You should > actually not need them in the normal case, but they will save your butt > and you'll be glad you have them, when they actually trigger. > > So if you are making serious progress here, you have just uncovered > a spinlockcontention problem in the kernel ;-) You'd also be evicting the cache contents thus making the lock contention case even worse. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps 2004-12-26 23:12 ` Linus Torvalds 2004-12-26 23:24 ` Florian Weimer @ 2004-12-27 0:01 ` Chris Wedgwood 2005-01-03 20:30 ` Christoph Lameter 2 siblings, 0 replies; 9+ messages in thread From: Chris Wedgwood @ 2004-12-27 0:01 UTC (permalink / raw) To: Linus Torvalds Cc: Florian Weimer, 7eggert, Christoph Lameter, akpm, linux-ia64, linux-mm, linux-kernel On Sun, Dec 26, 2004 at 03:12:45PM -0800, Linus Torvalds wrote: > Anyway, at this point I think the most interesting question is > whether it actually improves any macro-benchmark behaviour, rather > than just a page fault latency tester microbenchmark.. i can't see how is many cases it won't make things *worse* in many cases, especially if you use hardware it seems you will be evicting (potentially) useful cache-lines from the CPU when using hardware scrubbing in many cases and when using the CPU if the tuning isn't right just trashing the caches anyhow I'd really like to see how it affects something like make -j<n> sorta things (since gcc performance is something i personally care about more than how well some contrived benchmark does) ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps 2004-12-26 23:12 ` Linus Torvalds 2004-12-26 23:24 ` Florian Weimer 2004-12-27 0:01 ` Chris Wedgwood @ 2005-01-03 20:30 ` Christoph Lameter 2 siblings, 0 replies; 9+ messages in thread From: Christoph Lameter @ 2005-01-03 20:30 UTC (permalink / raw) To: Linus Torvalds Cc: Florian Weimer, 7eggert, akpm, linux-ia64, linux-mm, linux-kernel On Sun, 26 Dec 2004, Linus Torvalds wrote: > Anyway, at this point I think the most interesting question is whether it > actually improves any macro-benchmark behaviour, rather than just a page > fault latency tester microbenchmark.. Any suggestion as to what macro-benchmark would allow that kind of testing? I tried lmbench but it immediately writes to the complete page that was allocated. I tried to vary the number of cache cells touched after an allocation of an prezeroed page. Unsurprisingly it degenerates to regular behavior if all cache lines are touched. So we would need a benchmar that allows sparse memory use testing and preferably is able to also allow SMP tests. I will test with some of the typical apps running on Altix machines but those are extremely heavy in terms of memory use and will likely be as positive as my microbenches. BTW my bench does simulate the typical behavior of such an app using a sparse array and allows the configuration of the number of cache lines per page to touch. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <B8E391BBE9FE384DAA4C5C003888BE6F02900FBD@scsmsx401.amr.corp.intel.com>]
[parent not found: <41C20E3E.3070209@yahoo.com.au>]
* Increase page fault rate by prezeroing V1 [0/3]: Overview [not found] ` <41C20E3E.3070209@yahoo.com.au> @ 2004-12-21 19:55 ` Christoph Lameter 2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter 0 siblings, 1 reply; 9+ messages in thread From: Christoph Lameter @ 2004-12-21 19:55 UTC (permalink / raw) To: Nick Piggin Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds, linux-mm, linux-kernel The patches increasing the page fault rate (introduction of atomic pte operations and anticipatory prefaulting) do so by reducing the locking overhead and are therefore mainly of interest for applications running in SMP systems with a high number of cpus. The single thread performance does just show minor increases. Only the performance of multi-threaded applications increase significantly. The most expensive operation in the page fault handler is (apart of SMP locking overhead) the zeroing of the page that is also done in the page fault handler. Others have seen this too and have tried provide a way to provide zeroed pages to the page fault handler: http://marc.theaimsgroup.com/?t=109914559100004&r=1&w=2 http://marc.theaimsgroup.com/?t=109777267500005&r=1&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=104931944213955&w=2 The problem so far has been that simple zeroing of pages simply shifts the time spend somewhere else. Plus one would not want to zero hot pages. This patch addresses those issues by making it more effective to zero pages by: 1. Aggregating zeroing operations to mainly apply to larger order pages which results in many later order 0 pages to be zeroed in one go. For that purpose a new achitecture specific function zero_page(page, order) is introduced. 2. Hardware support for offloading zeroing from the cpu. This avoids the invalidation of the cpu caches by extensive zeroing operations. The result is a significant increase of the page fault performance even for single threaded applications: w/o patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.146s 11.155s 11.030s 69584.896 69566.852 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 1 1 1 0.014s 0.110s 0.012s524292.194 517665.538 This is a performance increase by a factor 8! The performance can only be upheld if enough zeroed pages are available. In a heavy memory intensive benchmark the system will run out of these very fast but the efficient algorithm for page zeroing still makes this a winner (8 way system with 6 GB RAM, no hardware zeroing support): w/o patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.146s 11.155s 11.030s 69584.896 69566.852 4 3 2 0.170s 14.909s 7.097s 52150.369 98643.687 4 3 4 0.181s 16.597s 5.079s 46869.167 135642.420 4 3 8 0.166s 23.239s 4.037s 33599.215 179791.120 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.183s 2.750s 2.093s268077.996 267952.890 4 3 2 0.185s 4.876s 2.097s155344.562 263967.292 4 3 4 0.150s 6.617s 2.097s116205.793 264774.080 4 3 8 0.186s 13.693s 3.054s 56659.819 221701.073 The patch is composed of 3 parts: [1/3] Introduce __GFP_ZERO Modifies the page allocator to be able to take the __GFP_ZERO flag and returns zeroed memory on request. Modifies locations throughout the linux sources that retrieve a page and then zeroe it to request a zeroed page. Adds new low level zero_page functions for i386, ia64 and x86_64. (x64_64 untested) [2/3] Page Zeroing Adds management of ZEROED and NOT_ZEROED pages and a background daemon called scrubd. scrubd is disable by default but can be enabled by writing an order number to /proc/sys/vm/scrub_start. If a page is coalesced of that order then the scrub daemon will start zeroing until all pages of order /proc/sys/vm/scrub_stop and higher are zeroed. [3/3] SGI Altix Block Transfer Engine Support Implements a driver to shift the zeroing off the cpu into hardware. With hardware support there will be minimal impact of zeroing on the performance of the system. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Prezeroing V2 [0/3]: Why and When it works 2004-12-21 19:55 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter @ 2004-12-23 19:29 ` Christoph Lameter 2004-12-23 19:33 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter 0 siblings, 1 reply; 9+ messages in thread From: Christoph Lameter @ 2004-12-23 19:29 UTC (permalink / raw) Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel Change from V1 to V2: o Add explanation--and some bench results--as to why and when this optimization works and why other approaches have not worked. o Instead of zero_page(p,order) extend clear_page to take second argument o Update all architectures to accept second argument for clear_pages o Extensive removal of all page allocs/clear_page combination from all archs o Blank / typo fixups o SGI BTE zero driver update: Use node specific variables instead of cpu specific since a cpu may be responsible for multiple nodes. The patches increasing the page fault rate (introduction of atomic pte operations and anticipatory prefaulting) do so by reducing the locking overhead and are therefore mainly of interest for applications running in SMP systems with a high number of cpus. The single thread performance does just show minor increases. Only the performance of multi-threaded applications increase significantly. The most expensive operation in the page fault handler is (apart of SMP locking overhead) the zeroing of the page. This zeroing means that all cachelines of the faulted page (on Altix that means all 128 cachelines of 128 byte each) must be loaded and later written back. This patch allows to avoid having to load all cachelines if only a part of the cachelines of that page is needed immediately after the fault. Thus the patch will only be effective for sparsely accessed memory which is typicalfor anonymous memory and pte maps. Prezeroed pages will be used for those purposes. Unzeroed pages will be used as usual for the other purposes. Others have also thought that prezeroing could be a benefit and have tried provide a way to provide zeroed pages to the page fault handler: http://marc.theaimsgroup.com/?t=109914559100004&r=1&w=2 http://marc.theaimsgroup.com/?t=109777267500005&r=1&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=104931944213955&w=2 However, these attempt have tried to zero pages soon to be accessed (and which may already have recently been accessed). Elements of these pages are thus already in the cache. Approaches like that will only shift processing a bit and not yield performance benefits. Prezeroing only makes sense for pages that are not currently needed and that are not in the cpu caches. Pages that have recently been touched and that soon will be touched again are better hot zeroed since the zeroing will largely be done to cachelines already in the cpu caches. The patch makes prezeroing very effective by: 1. Aggregating zeroing operations to only apply to pages of higher order, which results in many pages that will later become order 0 to be zeroed in one go. For that purpose the existing clear_page function is extended and made to take an additional argument specifying the order of the page to be cleared. 2. Hardware support for offloading zeroing from the cpu. This avoids the invalidation of the cpu caches by extensive zeroing operations. The result is a significant increase of the page fault performance even for single threaded applications: w/o patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.146s 11.155s 11.030s 69584.896 69566.852 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 1 1 1 0.014s 0.110s 0.012s524292.194 517665.538 The performance can only be upheld if enough zeroed pages are available. In a heavy memory intensive benchmarks the system could potentially run out of zeroed pages but the efficient algorithm for page zeroing still shows this to be a winner: (8 way system with 6 GB RAM, no hardware zeroing support) w/o patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.146s 11.155s 11.030s 69584.896 69566.852 4 3 2 0.170s 14.909s 7.097s 52150.369 98643.687 4 3 4 0.181s 16.597s 5.079s 46869.167 135642.420 4 3 8 0.166s 23.239s 4.037s 33599.215 179791.120 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.183s 2.750s 2.093s268077.996 267952.890 4 3 2 0.185s 4.876s 2.097s155344.562 263967.292 4 3 4 0.150s 6.617s 2.097s116205.793 264774.080 4 3 8 0.186s 13.693s 3.054s 56659.819 221701.073 Note that zeroing of pages makes no sense if the application touches all cache lines of a page allocated (there is no influence of prezeroing on benchmarks like lmbench for that reason) since the extensive caching of modern cpus means that the zeroes written to a hot zeroed page will then be overwritten by the application in the cpu cache and thus the zeros will never make it to memory! The test program used above only touches one 128 byte cache line of a 16k page (ia64). Here is another test in order to gauge the influence of the number of cache lines touched on the performance of the prezero enhancements: Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 1 1 1 0.01s 0.12s 0.01s500813.853 497925.891 1 1 1 2 0.01s 0.11s 0.01s493453.103 472877.725 1 1 1 4 0.02s 0.10s 0.01s479351.658 471507.415 1 1 1 8 0.01s 0.13s 0.01s424742.054 416725.013 1 1 1 16 0.05s 0.12s 0.01s347715.359 336983.834 1 1 1 32 0.12s 0.13s 0.02s258112.286 256246.731 1 1 1 64 0.24s 0.14s 0.03s169896.381 168189.283 1 1 1 128 0.49s 0.14s 0.06s102300.257 101674.435 The benefits of prezeroing become smaller the more cache lines of a page are touched. Prezeroing can only be effective if memory is not immediately touched after the anonymous page fault. The patch is composed of 4 parts: [1/4] Introduce __GFP_ZERO Modifies the page allocator to be able to take the __GFP_ZERO flag and returns zeroed memory on request. Modifies locations throughout the linux sources that retrieve a page and then zero it to request a zeroed page. [2/4] Architecture specific clear_page updates Adds second order argument to clear_page and updates all arches. Note: The two first pages may be used alone if no zeroing engine is wanted. [3/4] Page Zeroing Adds management of ZEROED and NOT_ZEROED pages and a background daemon called scrubd. scrubd is disabled by default but can be enabled by writing an order number to /proc/sys/vm/scrub_start. If a page is coalesced of that order or higher then the scrub daemon will start zeroing until all pages of order /proc/sys/vm/scrub_stop and higher are zeroed and then go back to sleep. In an SMP environment the scrub daemon is typically running on the most idle cpu. Thus a single threaded application running on one cpu may have the other cpu zeroing pages for it etc. The scrub daemon is hardly noticable and usually finished zeroing quickly since most processors are optimized for linear memory filling. [4/4] SGI Altix Block Transfer Engine Support Implements a driver to shift the zeroing off the cpu into hardware. With hardware support there will be minimal impact of zeroing on the performance of the system. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal 2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter @ 2004-12-23 19:33 ` Christoph Lameter 2004-12-23 19:34 ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Christoph Lameter 0 siblings, 1 reply; 9+ messages in thread From: Christoph Lameter @ 2004-12-23 19:33 UTC (permalink / raw) To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel This patch introduces __GFP_ZERO as an additional gfp_mask element to allow to request zeroed pages from the page allocator. o Modifies the page allocator so that it zeroes memory if __GFP_ZERO is set o Replace all page zeroing after allocating pages by request for zeroed pages. o requires arch updates to clear_page in order to function properly. Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/mm/page_alloc.c =================================================================== --- linux-2.6.9.orig/mm/page_alloc.c 2004-12-22 16:48:20.000000000 -0800 +++ linux-2.6.9/mm/page_alloc.c 2004-12-22 17:23:43.000000000 -0800 @@ -575,6 +575,18 @@ BUG_ON(bad_range(zone, page)); mod_page_state_zone(zone, pgalloc, 1 << order); prep_new_page(page, order); + + if (gfp_flags & __GFP_ZERO) { +#ifdef CONFIG_HIGHMEM + if (PageHighMem(page)) { + int n = 1 << order; + + while (n-- >0) + clear_highpage(page + n); + } else +#endif + clear_page(page_address(page), order); + } if (order && (gfp_flags & __GFP_COMP)) prep_compound_page(page, order); } @@ -767,12 +779,9 @@ */ BUG_ON(gfp_mask & __GFP_HIGHMEM); - page = alloc_pages(gfp_mask, 0); - if (page) { - void *address = page_address(page); - clear_page(address); - return (unsigned long) address; - } + page = alloc_pages(gfp_mask | __GFP_ZERO, 0); + if (page) + return (unsigned long) page_address(page); return 0; } Index: linux-2.6.9/include/linux/gfp.h =================================================================== --- linux-2.6.9.orig/include/linux/gfp.h 2004-10-18 14:53:44.000000000 -0700 +++ linux-2.6.9/include/linux/gfp.h 2004-12-22 17:23:43.000000000 -0800 @@ -37,6 +37,7 @@ #define __GFP_NORETRY 0x1000 /* Do not retry. Might fail */ #define __GFP_NO_GROW 0x2000 /* Slab internal usage */ #define __GFP_COMP 0x4000 /* Add compound page metadata */ +#define __GFP_ZERO 0x8000 /* Return zeroed page on success */ #define __GFP_BITS_SHIFT 16 /* Room for 16 __GFP_FOO bits */ #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1) @@ -52,6 +53,7 @@ #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS) #define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS) #define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM) +#define GFP_HIGHZERO (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM | __GFP_ZERO) /* Flag - indicates that the buffer will be suitable for DMA. Ignored on some platforms, used as appropriate on others */ Index: linux-2.6.9/mm/memory.c =================================================================== --- linux-2.6.9.orig/mm/memory.c 2004-12-22 16:48:20.000000000 -0800 +++ linux-2.6.9/mm/memory.c 2004-12-22 17:23:43.000000000 -0800 @@ -1445,10 +1445,9 @@ if (unlikely(anon_vma_prepare(vma))) goto no_mem; - page = alloc_page_vma(GFP_HIGHUSER, vma, addr); + page = alloc_page_vma(GFP_HIGHZERO, vma, addr); if (!page) goto no_mem; - clear_user_highpage(page, addr); spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, addr); Index: linux-2.6.9/kernel/profile.c =================================================================== --- linux-2.6.9.orig/kernel/profile.c 2004-12-22 16:48:20.000000000 -0800 +++ linux-2.6.9/kernel/profile.c 2004-12-22 17:23:43.000000000 -0800 @@ -326,17 +326,15 @@ node = cpu_to_node(cpu); per_cpu(cpu_profile_flip, cpu) = 0; if (!per_cpu(cpu_profile_hits, cpu)[1]) { - page = alloc_pages_node(node, GFP_KERNEL, 0); + page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0); if (!page) return NOTIFY_BAD; - clear_highpage(page); per_cpu(cpu_profile_hits, cpu)[1] = page_address(page); } if (!per_cpu(cpu_profile_hits, cpu)[0]) { - page = alloc_pages_node(node, GFP_KERNEL, 0); + page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0); if (!page) goto out_free; - clear_highpage(page); per_cpu(cpu_profile_hits, cpu)[0] = page_address(page); } break; @@ -510,16 +508,14 @@ int node = cpu_to_node(cpu); struct page *page; - page = alloc_pages_node(node, GFP_KERNEL, 0); + page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0); if (!page) goto out_cleanup; - clear_highpage(page); per_cpu(cpu_profile_hits, cpu)[1] = (struct profile_hit *)page_address(page); - page = alloc_pages_node(node, GFP_KERNEL, 0); + page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0); if (!page) goto out_cleanup; - clear_highpage(page); per_cpu(cpu_profile_hits, cpu)[0] = (struct profile_hit *)page_address(page); } Index: linux-2.6.9/mm/shmem.c =================================================================== --- linux-2.6.9.orig/mm/shmem.c 2004-12-22 16:48:20.000000000 -0800 +++ linux-2.6.9/mm/shmem.c 2004-12-22 17:23:43.000000000 -0800 @@ -369,9 +369,8 @@ } spin_unlock(&info->lock); - page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping)); + page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO); if (page) { - clear_highpage(page); page->nr_swapped = 0; } spin_lock(&info->lock); @@ -910,7 +909,7 @@ pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx); pvma.vm_pgoff = idx; pvma.vm_end = PAGE_SIZE; - page = alloc_page_vma(gfp, &pvma, 0); + page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0); mpol_free(pvma.vm_policy); return page; } @@ -926,7 +925,7 @@ shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info, unsigned long idx) { - return alloc_page(gfp); + return alloc_page(gfp | __GFP_ZERO); } #endif @@ -1135,7 +1134,6 @@ info->alloced++; spin_unlock(&info->lock); - clear_highpage(filepage); flush_dcache_page(filepage); SetPageUptodate(filepage); } Index: linux-2.6.9/mm/hugetlb.c =================================================================== --- linux-2.6.9.orig/mm/hugetlb.c 2004-10-18 14:54:37.000000000 -0700 +++ linux-2.6.9/mm/hugetlb.c 2004-12-22 17:23:43.000000000 -0800 @@ -77,7 +77,6 @@ struct page *alloc_huge_page(void) { struct page *page; - int i; spin_lock(&hugetlb_lock); page = dequeue_huge_page(); @@ -88,8 +87,7 @@ spin_unlock(&hugetlb_lock); set_page_count(page, 1); page[1].mapping = (void *)free_huge_page; - for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i) - clear_highpage(&page[i]); + clear_page(page_address(page), HUGETLB_PAGE_ORDER); return page; } Index: linux-2.6.9/include/asm-ia64/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-ia64/pgalloc.h 2004-10-18 14:53:06.000000000 -0700 +++ linux-2.6.9/include/asm-ia64/pgalloc.h 2004-12-22 17:23:43.000000000 -0800 @@ -61,9 +61,7 @@ pgd_t *pgd = pgd_alloc_one_fast(mm); if (unlikely(pgd == NULL)) { - pgd = (pgd_t *)__get_free_page(GFP_KERNEL); - if (likely(pgd != NULL)) - clear_page(pgd); + pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO); } return pgd; } @@ -107,10 +105,8 @@ static inline pmd_t* pmd_alloc_one (struct mm_struct *mm, unsigned long addr) { - pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT); + pmd_t *pmd = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); - if (likely(pmd != NULL)) - clear_page(pmd); return pmd; } @@ -141,20 +137,16 @@ static inline struct page * pte_alloc_one (struct mm_struct *mm, unsigned long addr) { - struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0); + struct page *pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0); - if (likely(pte != NULL)) - clear_page(page_address(pte)); return pte; } static inline pte_t * pte_alloc_one_kernel (struct mm_struct *mm, unsigned long addr) { - pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT); + pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); - if (likely(pte != NULL)) - clear_page(pte); return pte; } Index: linux-2.6.9/arch/i386/mm/pgtable.c =================================================================== --- linux-2.6.9.orig/arch/i386/mm/pgtable.c 2004-12-22 16:48:14.000000000 -0800 +++ linux-2.6.9/arch/i386/mm/pgtable.c 2004-12-22 17:23:43.000000000 -0800 @@ -132,10 +132,7 @@ pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT); - if (pte) - clear_page(pte); - return pte; + return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); } struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address) @@ -143,12 +140,10 @@ struct page *pte; #ifdef CONFIG_HIGHPTE - pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT, 0); + pte = alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0); #else - pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0); + pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0); #endif - if (pte) - clear_highpage(pte); return pte; } Index: linux-2.6.9/drivers/block/pktcdvd.c =================================================================== --- linux-2.6.9.orig/drivers/block/pktcdvd.c 2004-12-22 16:48:15.000000000 -0800 +++ linux-2.6.9/drivers/block/pktcdvd.c 2004-12-22 17:23:43.000000000 -0800 @@ -125,22 +125,19 @@ int i; struct packet_data *pkt; - pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL); + pkt = kmalloc(sizeof(struct packet_data), GFP_KERNEL|__GFP_ZERO); if (!pkt) goto no_pkt; - memset(pkt, 0, sizeof(struct packet_data)); pkt->w_bio = pkt_bio_alloc(PACKET_MAX_SIZE); if (!pkt->w_bio) goto no_bio; for (i = 0; i < PAGES_PER_PACKET; i++) { - pkt->pages[i] = alloc_page(GFP_KERNEL); + pkt->pages[i] = alloc_page(GFP_KERNEL|__GFP_ZERO); if (!pkt->pages[i]) goto no_page; } - for (i = 0; i < PAGES_PER_PACKET; i++) - clear_page(page_address(pkt->pages[i])); spin_lock_init(&pkt->lock); Index: linux-2.6.9/arch/m68k/mm/motorola.c =================================================================== --- linux-2.6.9.orig/arch/m68k/mm/motorola.c 2004-12-22 16:48:14.000000000 -0800 +++ linux-2.6.9/arch/m68k/mm/motorola.c 2004-12-22 17:23:43.000000000 -0800 @@ -1,4 +1,4 @@ -/* +* * linux/arch/m68k/motorola.c * * Routines specific to the Motorola MMU, originally from: @@ -50,7 +50,7 @@ ptablep = (pte_t *)alloc_bootmem_low_pages(PAGE_SIZE); - clear_page(ptablep); + clear_page(ptablep, 0); __flush_page_to_ram(ptablep); flush_tlb_kernel_page(ptablep); nocache_page(ptablep); @@ -90,7 +90,7 @@ if (((unsigned long)last_pgtable & ~PAGE_MASK) == 0) { last_pgtable = (pmd_t *)alloc_bootmem_low_pages(PAGE_SIZE); - clear_page(last_pgtable); + clear_page(last_pgtable, 0); __flush_page_to_ram(last_pgtable); flush_tlb_kernel_page(last_pgtable); nocache_page(last_pgtable); Index: linux-2.6.9/include/asm-mips/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-mips/pgalloc.h 2004-10-18 14:54:30.000000000 -0700 +++ linux-2.6.9/include/asm-mips/pgalloc.h 2004-12-22 17:23:43.000000000 -0800 @@ -56,9 +56,7 @@ { pte_t *pte; - pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PTE_ORDER); - if (pte) - clear_page(pte); + pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, PTE_ORDER); return pte; } Index: linux-2.6.9/arch/alpha/mm/init.c =================================================================== --- linux-2.6.9.orig/arch/alpha/mm/init.c 2004-10-18 14:55:07.000000000 -0700 +++ linux-2.6.9/arch/alpha/mm/init.c 2004-12-22 17:23:43.000000000 -0800 @@ -42,10 +42,9 @@ { pgd_t *ret, *init; - ret = (pgd_t *)__get_free_page(GFP_KERNEL); + ret = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO); init = pgd_offset(&init_mm, 0UL); if (ret) { - clear_page(ret); #ifdef CONFIG_ALPHA_LARGE_VMALLOC memcpy (ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD, (PTRS_PER_PGD - USER_PTRS_PER_PGD - 1)*sizeof(pgd_t)); @@ -63,9 +62,7 @@ pte_t * pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT); - if (pte) - clear_page(pte); + pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); return pte; } Index: linux-2.6.9/include/asm-parisc/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-parisc/pgalloc.h 2004-10-18 14:55:28.000000000 -0700 +++ linux-2.6.9/include/asm-parisc/pgalloc.h 2004-12-22 17:23:43.000000000 -0800 @@ -120,18 +120,14 @@ static inline struct page * pte_alloc_one(struct mm_struct *mm, unsigned long address) { - struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT); - if (likely(page != NULL)) - clear_page(page_address(page)); + struct page *page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); return page; } static inline pte_t * pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr) { - pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT); - if (likely(pte != NULL)) - clear_page(pte); + pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); return pte; } Index: linux-2.6.9/arch/sh/mm/pg-sh4.c =================================================================== --- linux-2.6.9.orig/arch/sh/mm/pg-sh4.c 2004-10-18 14:53:46.000000000 -0700 +++ linux-2.6.9/arch/sh/mm/pg-sh4.c 2004-12-22 17:23:43.000000000 -0800 @@ -34,7 +34,7 @@ { __set_bit(PG_mapped, &page->flags); if (((address ^ (unsigned long)to) & CACHE_ALIAS) == 0) - clear_page(to); + clear_page(to, 0); else { pgprot_t pgprot = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_CACHABLE | Index: linux-2.6.9/include/asm-sparc64/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-sparc64/pgalloc.h 2004-10-18 14:55:28.000000000 -0700 +++ linux-2.6.9/include/asm-sparc64/pgalloc.h 2004-12-22 17:23:43.000000000 -0800 @@ -73,10 +73,9 @@ struct page *page; preempt_enable(); - page = alloc_page(GFP_KERNEL|__GFP_REPEAT); + page = alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); if (page) { ret = (struct page *)page_address(page); - clear_page(ret); page->lru.prev = (void *) 2UL; preempt_disable(); Index: linux-2.6.9/include/asm-sh/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-sh/pgalloc.h 2004-10-18 14:54:08.000000000 -0700 +++ linux-2.6.9/include/asm-sh/pgalloc.h 2004-12-22 17:23:43.000000000 -0800 @@ -44,9 +44,7 @@ { pte_t *pte; - pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT); - if (pte) - clear_page(pte); + pte = (pte_t *) __get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO); return pte; } @@ -56,9 +54,7 @@ { struct page *pte; - pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0); - if (pte) - clear_page(page_address(pte)); + pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0); return pte; } Index: linux-2.6.9/include/asm-m32r/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-m32r/pgalloc.h 2004-10-18 14:55:07.000000000 -0700 +++ linux-2.6.9/include/asm-m32r/pgalloc.h 2004-12-22 17:23:43.000000000 -0800 @@ -23,10 +23,7 @@ */ static __inline__ pgd_t *pgd_alloc(struct mm_struct *mm) { - pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL); - - if (pgd) - clear_page(pgd); + pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO); return pgd; } @@ -39,10 +36,7 @@ static __inline__ pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL); - - if (pte) - clear_page(pte); + pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO); return pte; } @@ -50,10 +44,8 @@ static __inline__ struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address) { - struct page *pte = alloc_page(GFP_KERNEL); + struct page *pte = alloc_page(GFP_KERNEL|__GFP_ZERO); - if (pte) - clear_page(page_address(pte)); return pte; } Index: linux-2.6.9/arch/um/kernel/mem.c =================================================================== --- linux-2.6.9.orig/arch/um/kernel/mem.c 2004-10-18 14:53:51.000000000 -0700 +++ linux-2.6.9/arch/um/kernel/mem.c 2004-12-22 17:23:43.000000000 -0800 @@ -307,9 +307,7 @@ { pte_t *pte; - pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT); - if (pte) - clear_page(pte); + pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); return pte; } @@ -317,9 +315,7 @@ { struct page *pte; - pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0); - if (pte) - clear_highpage(pte); + pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0); return pte; } Index: linux-2.6.9/arch/ppc64/mm/init.c =================================================================== --- linux-2.6.9.orig/arch/ppc64/mm/init.c 2004-12-22 16:48:14.000000000 -0800 +++ linux-2.6.9/arch/ppc64/mm/init.c 2004-12-22 17:23:43.000000000 -0800 @@ -761,7 +761,7 @@ void clear_user_page(void *page, unsigned long vaddr, struct page *pg) { - clear_page(page); + clear_page(page, 0); if (cur_cpu_spec->cpu_features & CPU_FTR_COHERENT_ICACHE) return; Index: linux-2.6.9/include/asm-sh64/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-sh64/pgalloc.h 2004-10-18 14:53:21.000000000 -0700 +++ linux-2.6.9/include/asm-sh64/pgalloc.h 2004-12-22 17:23:43.000000000 -0800 @@ -112,9 +112,7 @@ { pte_t *pte; - pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT); - if (pte) - clear_page(pte); + pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT|__GFP_ZERO); return pte; } @@ -123,9 +121,7 @@ { struct page *pte; - pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0); - if (pte) - clear_page(page_address(pte)); + pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0); return pte; } @@ -150,9 +146,7 @@ static __inline__ pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) { pmd_t *pmd; - pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT); - if (pmd) - clear_page(pmd); + pmd = (pmd_t *) __get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); return pmd; } Index: linux-2.6.9/include/asm-cris/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-cris/pgalloc.h 2004-10-18 14:55:06.000000000 -0700 +++ linux-2.6.9/include/asm-cris/pgalloc.h 2004-12-22 17:23:43.000000000 -0800 @@ -24,18 +24,14 @@ extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT); - if (pte) - clear_page(pte); + pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); return pte; } extern inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address) { struct page *pte; - pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0); - if (pte) - clear_page(page_address(pte)); + pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0); return pte; } Index: linux-2.6.9/arch/ppc/mm/pgtable.c =================================================================== --- linux-2.6.9.orig/arch/ppc/mm/pgtable.c 2004-12-22 16:48:14.000000000 -0800 +++ linux-2.6.9/arch/ppc/mm/pgtable.c 2004-12-22 17:23:43.000000000 -0800 @@ -85,8 +85,7 @@ { pgd_t *ret; - if ((ret = (pgd_t *)__get_free_pages(GFP_KERNEL, PGDIR_ORDER)) != NULL) - clear_pages(ret, PGDIR_ORDER); + ret = (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, PGDIR_ORDER); return ret; } @@ -102,7 +101,7 @@ extern void *early_get_page(void); if (mem_init_done) { - pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT); + pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); if (pte) { struct page *ptepage = virt_to_page(pte); ptepage->mapping = (void *) mm; @@ -110,8 +109,6 @@ } } else pte = (pte_t *)early_get_page(); - if (pte) - clear_page(pte); return pte; } Index: linux-2.6.9/arch/ppc/mm/init.c =================================================================== --- linux-2.6.9.orig/arch/ppc/mm/init.c 2004-10-18 14:53:43.000000000 -0700 +++ linux-2.6.9/arch/ppc/mm/init.c 2004-12-22 17:23:43.000000000 -0800 @@ -595,7 +595,7 @@ } void clear_user_page(void *page, unsigned long vaddr, struct page *pg) { - clear_page(page); + clear_page(page, 0); clear_bit(PG_arch_1, &pg->flags); } Index: linux-2.6.9/fs/afs/file.c =================================================================== --- linux-2.6.9.orig/fs/afs/file.c 2004-10-18 14:55:36.000000000 -0700 +++ linux-2.6.9/fs/afs/file.c 2004-12-22 17:23:43.000000000 -0800 @@ -172,7 +172,7 @@ (size_t) PAGE_SIZE); desc.buffer = kmap(page); - clear_page(desc.buffer); + clear_page(desc.buffer, 0); /* read the contents of the file from the server into the * page */ Index: linux-2.6.9/include/asm-alpha/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-alpha/pgalloc.h 2004-10-18 14:53:06.000000000 -0700 +++ linux-2.6.9/include/asm-alpha/pgalloc.h 2004-12-22 17:23:43.000000000 -0800 @@ -40,9 +40,7 @@ static inline pmd_t * pmd_alloc_one(struct mm_struct *mm, unsigned long address) { - pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT); - if (ret) - clear_page(ret); + pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); return ret; } Index: linux-2.6.9/include/linux/highmem.h =================================================================== --- linux-2.6.9.orig/include/linux/highmem.h 2004-10-18 14:54:54.000000000 -0700 +++ linux-2.6.9/include/linux/highmem.h 2004-12-22 17:23:43.000000000 -0800 @@ -47,7 +47,7 @@ static inline void clear_highpage(struct page *page) { void *kaddr = kmap_atomic(page, KM_USER0); - clear_page(kaddr); + clear_page(kaddr, 0); kunmap_atomic(kaddr, KM_USER0); } Index: linux-2.6.9/arch/sh64/mm/ioremap.c =================================================================== --- linux-2.6.9.orig/arch/sh64/mm/ioremap.c 2004-10-18 14:54:32.000000000 -0700 +++ linux-2.6.9/arch/sh64/mm/ioremap.c 2004-12-22 17:23:43.000000000 -0800 @@ -399,7 +399,7 @@ if (pte_none(*ptep) || !pte_present(*ptep)) return; - clear_page((void *)ptep); + clear_page((void *)ptep, 0); pte_clear(ptep); } Index: linux-2.6.9/include/asm-m68k/motorola_pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-m68k/motorola_pgalloc.h 2004-10-18 14:55:36.000000000 -0700 +++ linux-2.6.9/include/asm-m68k/motorola_pgalloc.h 2004-12-22 17:23:43.000000000 -0800 @@ -12,9 +12,8 @@ { pte_t *pte; - pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT); + pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); if (pte) { - clear_page(pte); __flush_page_to_ram(pte); flush_tlb_kernel_page(pte); nocache_page(pte); @@ -31,7 +30,7 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address) { - struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0); + struct page *page = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0); pte_t *pte; if(!page) @@ -39,7 +38,6 @@ pte = kmap(page); if (pte) { - clear_page(pte); __flush_page_to_ram(pte); flush_tlb_kernel_page(pte); nocache_page(pte); Index: linux-2.6.9/arch/sh/mm/pg-sh7705.c =================================================================== --- linux-2.6.9.orig/arch/sh/mm/pg-sh7705.c 2004-12-22 16:48:15.000000000 -0800 +++ linux-2.6.9/arch/sh/mm/pg-sh7705.c 2004-12-22 17:23:43.000000000 -0800 @@ -78,13 +78,13 @@ __set_bit(PG_mapped, &page->flags); if (((address ^ (unsigned long)to) & CACHE_ALIAS) == 0) { - clear_page(to); + clear_page(to, 0); __flush_wback_region(to, PAGE_SIZE); } else { __flush_purge_virtual_region(to, (void *)(address & 0xfffff000), PAGE_SIZE); - clear_page(to); + clear_page(to, 0); __flush_wback_region(to, PAGE_SIZE); } } Index: linux-2.6.9/arch/sparc64/mm/init.c =================================================================== --- linux-2.6.9.orig/arch/sparc64/mm/init.c 2004-12-22 16:48:15.000000000 -0800 +++ linux-2.6.9/arch/sparc64/mm/init.c 2004-12-22 17:23:43.000000000 -0800 @@ -1687,13 +1687,12 @@ * Set up the zero page, mark it reserved, so that page count * is not manipulated when freeing the page from user ptes. */ - mem_map_zero = alloc_pages(GFP_KERNEL, 0); + mem_map_zero = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0); if (mem_map_zero == NULL) { prom_printf("paging_init: Cannot alloc zero page.\n"); prom_halt(); } SetPageReserved(mem_map_zero); - clear_page(page_address(mem_map_zero)); codepages = (((unsigned long) _etext) - ((unsigned long) _start)); codepages = PAGE_ALIGN(codepages) >> PAGE_SHIFT; Index: linux-2.6.9/include/asm-arm/pgalloc.h =================================================================== --- linux-2.6.9.orig/include/asm-arm/pgalloc.h 2004-10-18 14:55:27.000000000 -0700 +++ linux-2.6.9/include/asm-arm/pgalloc.h 2004-12-22 17:23:43.000000000 -0800 @@ -50,9 +50,8 @@ { pte_t *pte; - pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT); + pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); if (pte) { - clear_page(pte); clean_dcache_area(pte, sizeof(pte_t) * PTRS_PER_PTE); pte += PTRS_PER_PTE; } @@ -65,10 +64,9 @@ { struct page *pte; - pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT, 0); + pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0); if (pte) { void *page = page_address(pte); - clear_page(page); clean_dcache_area(page, sizeof(pte_t) * PTRS_PER_PTE); } ^ permalink raw reply [flat|nested] 9+ messages in thread
* Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps 2004-12-23 19:33 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter @ 2004-12-23 19:34 ` Christoph Lameter 0 siblings, 0 replies; 9+ messages in thread From: Christoph Lameter @ 2004-12-23 19:34 UTC (permalink / raw) To: akpm, linux-ia64, torvalds, linux-mm, linux-kernel o Add page zeroing o Add scrub daemon o Add ability to view amount of zeroed information in /proc/meninfo Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.9/mm/page_alloc.c =================================================================== --- linux-2.6.9.orig/mm/page_alloc.c 2004-12-22 13:31:02.000000000 -0800 +++ linux-2.6.9/mm/page_alloc.c 2004-12-22 14:24:56.000000000 -0800 @@ -12,6 +12,7 @@ * Zone balancing, Kanoj Sarcar, SGI, Jan 2000 * Per cpu hot/cold page lists, bulk allocation, Martin J. Bligh, Sept 2002 * (lots of bits borrowed from Ingo Molnar & Andrew Morton) + * Support for page zeroing, Christoph Lameter, SGI, Dec 2004 */ #include <linux/config.h> @@ -32,6 +33,7 @@ #include <linux/sysctl.h> #include <linux/cpu.h> #include <linux/nodemask.h> +#include <linux/scrub.h> #include <asm/tlbflush.h> @@ -179,7 +181,7 @@ * -- wli */ -static inline void __free_pages_bulk (struct page *page, struct page *base, +static inline int __free_pages_bulk (struct page *page, struct page *base, struct zone *zone, struct free_area *area, unsigned int order) { unsigned long page_idx, index, mask; @@ -192,11 +194,10 @@ BUG(); index = page_idx >> (1 + order); - zone->free_pages += 1 << order; while (order < MAX_ORDER-1) { struct page *buddy1, *buddy2; - BUG_ON(area >= zone->free_area + MAX_ORDER); + BUG_ON(area >= zone->free_area[ZEROED] + MAX_ORDER); if (!__test_and_change_bit(index, area->map)) /* * the buddy page is still allocated. @@ -216,6 +217,7 @@ page_idx &= mask; } list_add(&(base + page_idx)->lru, &area->free_list); + return order; } static inline void free_pages_check(const char *function, struct page *page) @@ -258,7 +260,7 @@ int ret = 0; base = zone->zone_mem_map; - area = zone->free_area + order; + area = zone->free_area[NOT_ZEROED] + order; spin_lock_irqsave(&zone->lock, flags); zone->all_unreclaimable = 0; zone->pages_scanned = 0; @@ -266,7 +268,10 @@ page = list_entry(list->prev, struct page, lru); /* have to delete it as __free_pages_bulk list manipulates */ list_del(&page->lru); - __free_pages_bulk(page, base, zone, area, order); + zone->free_pages += 1 << order; + if (__free_pages_bulk(page, base, zone, area, order) + >= sysctl_scrub_start) + wakeup_kscrubd(zone); ret++; } spin_unlock_irqrestore(&zone->lock, flags); @@ -288,6 +293,21 @@ free_pages_bulk(page_zone(page), 1, &list, order); } +void end_zero_page(struct page *page) +{ + unsigned long flags; + int order = page->index; + struct zone * zone = page_zone(page); + + spin_lock_irqsave(&zone->lock, flags); + + zone->zero_pages += 1 << order; + __free_pages_bulk(page, zone->zone_mem_map, zone, zone->free_area[ZEROED] + order, order); + + spin_unlock_irqrestore(&zone->lock, flags); +} + + #define MARK_USED(index, order, area) \ __change_bit((index) >> (1+(order)), (area)->map) @@ -366,25 +386,46 @@ * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ -static struct page *__rmqueue(struct zone *zone, unsigned int order) +static void inline rmpage(struct page *page, struct zone *zone, struct free_area *area, int order) +{ + list_del(&page->lru); + if (order != MAX_ORDER-1) + MARK_USED(page - zone->zone_mem_map, order, area); +} + +struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order) +{ + unsigned long flags; + struct page *page = NULL; + + spin_lock_irqsave(&zone->lock, flags); + + if (!list_empty(&area->free_list)) { + page = list_entry(area->free_list.next, struct page, lru); + + rmpage(page, zone, area, order); + } + spin_unlock_irqrestore(&zone->lock, flags); + return page; +} + +static struct page *__rmqueue(struct zone *zone, unsigned int order, int zero) { struct free_area * area; unsigned int current_order; struct page *page; - unsigned int index; for (current_order = order; current_order < MAX_ORDER; ++current_order) { - area = zone->free_area + current_order; + area = zone->free_area[zero] + current_order; if (list_empty(&area->free_list)) continue; page = list_entry(area->free_list.next, struct page, lru); - list_del(&page->lru); - index = page - zone->zone_mem_map; - if (current_order != MAX_ORDER-1) - MARK_USED(index, current_order, area); + rmpage(page, zone, area, current_order); zone->free_pages -= 1UL << order; - return expand(zone, page, index, order, current_order, area); + if (zero) + zone->zero_pages -= 1UL << order; + return expand(zone, page, page - zone->zone_mem_map, order, current_order, area); } return NULL; @@ -396,7 +437,7 @@ * Returns the number of new pages which were placed at *list. */ static int rmqueue_bulk(struct zone *zone, unsigned int order, - unsigned long count, struct list_head *list) + unsigned long count, struct list_head *list, int zero) { unsigned long flags; int i; @@ -405,7 +446,7 @@ spin_lock_irqsave(&zone->lock, flags); for (i = 0; i < count; ++i) { - page = __rmqueue(zone, order); + page = __rmqueue(zone, order, zero); if (page == NULL) break; allocated++; @@ -546,7 +587,9 @@ { unsigned long flags; struct page *page = NULL; - int cold = !!(gfp_flags & __GFP_COLD); + int nr_pages = 1 << order; + int zero = !!((gfp_flags & __GFP_ZERO) && zone->zero_pages >= nr_pages); + int cold = !!(gfp_flags & __GFP_COLD) + 2*zero; if (order == 0) { struct per_cpu_pages *pcp; @@ -555,7 +598,7 @@ local_irq_save(flags); if (pcp->count <= pcp->low) pcp->count += rmqueue_bulk(zone, 0, - pcp->batch, &pcp->list); + pcp->batch, &pcp->list, zero); if (pcp->count) { page = list_entry(pcp->list.next, struct page, lru); list_del(&page->lru); @@ -567,19 +610,30 @@ if (page == NULL) { spin_lock_irqsave(&zone->lock, flags); - page = __rmqueue(zone, order); + + page = __rmqueue(zone, order, zero); + + /* + * If we failed to obtain a zero and/or unzeroed page + * then we may still be able to obtain the other + * type of page. + */ + if (!page) { + page = __rmqueue(zone, order, !zero); + zero = 0; + } + spin_unlock_irqrestore(&zone->lock, flags); } if (page != NULL) { BUG_ON(bad_range(zone, page)); - mod_page_state_zone(zone, pgalloc, 1 << order); - prep_new_page(page, order); + mod_page_state_zone(zone, pgalloc, nr_pages); - if (gfp_flags & __GFP_ZERO) { + if ((gfp_flags & __GFP_ZERO) && !zero) { #ifdef CONFIG_HIGHMEM if (PageHighMem(page)) { - int n = 1 << order; + int n = nr_pages; while (n-- >0) clear_highpage(page + n); @@ -587,6 +641,7 @@ #endif clear_page(page_address(page), order); } + prep_new_page(page, order); if (order && (gfp_flags & __GFP_COMP)) prep_compound_page(page, order); } @@ -974,7 +1029,7 @@ } void __get_zone_counts(unsigned long *active, unsigned long *inactive, - unsigned long *free, struct pglist_data *pgdat) + unsigned long *free, unsigned long *zero, struct pglist_data *pgdat) { struct zone *zones = pgdat->node_zones; int i; @@ -982,27 +1037,31 @@ *active = 0; *inactive = 0; *free = 0; + *zero = 0; for (i = 0; i < MAX_NR_ZONES; i++) { *active += zones[i].nr_active; *inactive += zones[i].nr_inactive; *free += zones[i].free_pages; + *zero += zones[i].zero_pages; } } void get_zone_counts(unsigned long *active, - unsigned long *inactive, unsigned long *free) + unsigned long *inactive, unsigned long *free, unsigned long *zero) { struct pglist_data *pgdat; *active = 0; *inactive = 0; *free = 0; + *zero = 0; for_each_pgdat(pgdat) { - unsigned long l, m, n; - __get_zone_counts(&l, &m, &n, pgdat); + unsigned long l, m, n,o; + __get_zone_counts(&l, &m, &n, &o, pgdat); *active += l; *inactive += m; *free += n; + *zero += o; } } @@ -1039,6 +1098,7 @@ #define K(x) ((x) << (PAGE_SHIFT-10)) +const char *temp[3] = { "hot", "cold", "zero" }; /* * Show free area list (used inside shift_scroll-lock stuff) * We also calculate the percentage fragmentation. We do this by counting the @@ -1051,6 +1111,7 @@ unsigned long active; unsigned long inactive; unsigned long free; + unsigned long zero; struct zone *zone; for_each_zone(zone) { @@ -1071,10 +1132,10 @@ pageset = zone->pageset + cpu; - for (temperature = 0; temperature < 2; temperature++) + for (temperature = 0; temperature < 3; temperature++) printk("cpu %d %s: low %d, high %d, batch %d\n", cpu, - temperature ? "cold" : "hot", + temp[temperature], pageset->pcp[temperature].low, pageset->pcp[temperature].high, pageset->pcp[temperature].batch); @@ -1082,20 +1143,21 @@ } get_page_state(&ps); - get_zone_counts(&active, &inactive, &free); + get_zone_counts(&active, &inactive, &free, &zero); printk("\nFree pages: %11ukB (%ukB HighMem)\n", K(nr_free_pages()), K(nr_free_highpages())); printk("Active:%lu inactive:%lu dirty:%lu writeback:%lu " - "unstable:%lu free:%u slab:%lu mapped:%lu pagetables:%lu\n", + "unstable:%lu free:%u zero:%lu slab:%lu mapped:%lu pagetables:%lu\n", active, inactive, ps.nr_dirty, ps.nr_writeback, ps.nr_unstable, nr_free_pages(), + zero, ps.nr_slab, ps.nr_mapped, ps.nr_page_table_pages); @@ -1146,7 +1208,7 @@ spin_lock_irqsave(&zone->lock, flags); for (order = 0; order < MAX_ORDER; order++) { nr = 0; - list_for_each(elem, &zone->free_area[order].free_list) + list_for_each(elem, &zone->free_area[NOT_ZEROED][order].free_list) ++nr; total += nr << order; printk("%lu*%lukB ", nr, K(1UL) << order); @@ -1470,14 +1532,18 @@ for (order = 0; ; order++) { unsigned long bitmap_size; - INIT_LIST_HEAD(&zone->free_area[order].free_list); + INIT_LIST_HEAD(&zone->free_area[NOT_ZEROED][order].free_list); + INIT_LIST_HEAD(&zone->free_area[ZEROED][order].free_list); if (order == MAX_ORDER-1) { - zone->free_area[order].map = NULL; + zone->free_area[NOT_ZEROED][order].map = NULL; + zone->free_area[ZEROED][order].map = NULL; break; } bitmap_size = pages_to_bitmap_size(order, size); - zone->free_area[order].map = + zone->free_area[NOT_ZEROED][order].map = + (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size); + zone->free_area[ZEROED][order].map = (unsigned long *) alloc_bootmem_node(pgdat, bitmap_size); } } @@ -1503,6 +1569,7 @@ pgdat->nr_zones = 0; init_waitqueue_head(&pgdat->kswapd_wait); + init_waitqueue_head(&pgdat->kscrubd_wait); for (j = 0; j < MAX_NR_ZONES; j++) { struct zone *zone = pgdat->node_zones + j; @@ -1525,6 +1592,7 @@ spin_lock_init(&zone->lru_lock); zone->zone_pgdat = pgdat; zone->free_pages = 0; + zone->zero_pages = 0; zone->temp_priority = zone->prev_priority = DEF_PRIORITY; @@ -1558,6 +1626,13 @@ pcp->high = 2 * batch; pcp->batch = 1 * batch; INIT_LIST_HEAD(&pcp->list); + + pcp = &zone->pageset[cpu].pcp[2]; /* zero pages */ + pcp->count = 0; + pcp->low = 0; + pcp->high = 2 * batch; + pcp->batch = 1 * batch; + INIT_LIST_HEAD(&pcp->list); } printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n", zone_names[j], realsize, batch); @@ -1687,7 +1762,7 @@ unsigned long nr_bufs = 0; struct list_head *elem; - list_for_each(elem, &(zone->free_area[order].free_list)) + list_for_each(elem, &(zone->free_area[NOT_ZEROED][order].free_list)) ++nr_bufs; seq_printf(m, "%6lu ", nr_bufs); } Index: linux-2.6.9/include/linux/mmzone.h =================================================================== --- linux-2.6.9.orig/include/linux/mmzone.h 2004-12-17 14:40:16.000000000 -0800 +++ linux-2.6.9/include/linux/mmzone.h 2004-12-22 14:24:56.000000000 -0800 @@ -51,7 +51,7 @@ }; struct per_cpu_pageset { - struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */ + struct per_cpu_pages pcp[3]; /* 0: hot. 1: cold 2: cold zeroed pages */ #ifdef CONFIG_NUMA unsigned long numa_hit; /* allocated in intended node */ unsigned long numa_miss; /* allocated in non intended node */ @@ -107,10 +107,14 @@ * ZONE_HIGHMEM > 896 MB only page cache and user processes */ +#define NOT_ZEROED 0 +#define ZEROED 1 + struct zone { /* Fields commonly accessed by the page allocator */ unsigned long free_pages; unsigned long pages_min, pages_low, pages_high; + unsigned long zero_pages; /* * protection[] is a pre-calculated number of extra pages that must be * available in a zone in order for __alloc_pages() to allocate memory @@ -131,7 +135,7 @@ * free areas of different sizes */ spinlock_t lock; - struct free_area free_area[MAX_ORDER]; + struct free_area free_area[2][MAX_ORDER]; ZONE_PADDING(_pad1_) @@ -265,6 +269,9 @@ struct pglist_data *pgdat_next; wait_queue_head_t kswapd_wait; struct task_struct *kswapd; + + wait_queue_head_t kscrubd_wait; + struct task_struct *kscrubd; } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) @@ -274,9 +281,9 @@ extern struct pglist_data *pgdat_list; void __get_zone_counts(unsigned long *active, unsigned long *inactive, - unsigned long *free, struct pglist_data *pgdat); + unsigned long *free, unsigned long *zero, struct pglist_data *pgdat); void get_zone_counts(unsigned long *active, unsigned long *inactive, - unsigned long *free); + unsigned long *free, unsigned long *zero); void build_all_zonelists(void); void wakeup_kswapd(struct zone *zone); Index: linux-2.6.9/fs/proc/proc_misc.c =================================================================== --- linux-2.6.9.orig/fs/proc/proc_misc.c 2004-12-17 14:40:15.000000000 -0800 +++ linux-2.6.9/fs/proc/proc_misc.c 2004-12-22 14:24:56.000000000 -0800 @@ -158,13 +158,14 @@ unsigned long inactive; unsigned long active; unsigned long free; + unsigned long zero; unsigned long vmtot; unsigned long committed; unsigned long allowed; struct vmalloc_info vmi; get_page_state(&ps); - get_zone_counts(&active, &inactive, &free); + get_zone_counts(&active, &inactive, &free, &zero); /* * display in kilobytes. @@ -187,6 +188,7 @@ len = sprintf(page, "MemTotal: %8lu kB\n" "MemFree: %8lu kB\n" + "MemZero: %8lu kB\n" "Buffers: %8lu kB\n" "Cached: %8lu kB\n" "SwapCached: %8lu kB\n" @@ -210,6 +212,7 @@ "VmallocChunk: %8lu kB\n", K(i.totalram), K(i.freeram), + K(zero), K(i.bufferram), K(get_page_cache_size()-total_swapcache_pages-i.bufferram), K(total_swapcache_pages), Index: linux-2.6.9/mm/readahead.c =================================================================== --- linux-2.6.9.orig/mm/readahead.c 2004-10-18 14:53:11.000000000 -0700 +++ linux-2.6.9/mm/readahead.c 2004-12-22 14:24:56.000000000 -0800 @@ -570,7 +570,8 @@ unsigned long active; unsigned long inactive; unsigned long free; + unsigned long zero; - __get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id())); + __get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(numa_node_id())); return min(nr, (inactive + free) / 2); } Index: linux-2.6.9/drivers/base/node.c =================================================================== --- linux-2.6.9.orig/drivers/base/node.c 2004-10-18 14:53:22.000000000 -0700 +++ linux-2.6.9/drivers/base/node.c 2004-12-22 14:24:56.000000000 -0800 @@ -41,13 +41,15 @@ unsigned long inactive; unsigned long active; unsigned long free; + unsigned long zero; si_meminfo_node(&i, nid); - __get_zone_counts(&active, &inactive, &free, NODE_DATA(nid)); + __get_zone_counts(&active, &inactive, &free, &zero, NODE_DATA(nid)); n = sprintf(buf, "\n" "Node %d MemTotal: %8lu kB\n" "Node %d MemFree: %8lu kB\n" + "Node %d MemZero: %8lu kB\n" "Node %d MemUsed: %8lu kB\n" "Node %d Active: %8lu kB\n" "Node %d Inactive: %8lu kB\n" @@ -57,6 +59,7 @@ "Node %d LowFree: %8lu kB\n", nid, K(i.totalram), nid, K(i.freeram), + nid, K(zero), nid, K(i.totalram - i.freeram), nid, K(active), nid, K(inactive), Index: linux-2.6.9/include/linux/sched.h =================================================================== --- linux-2.6.9.orig/include/linux/sched.h 2004-12-17 14:40:16.000000000 -0800 +++ linux-2.6.9/include/linux/sched.h 2004-12-22 14:24:56.000000000 -0800 @@ -715,6 +715,7 @@ #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */ #define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */ +#define PF_KSCRUBD 0x00800000 /* I am kscrubd */ #ifdef CONFIG_SMP extern int set_cpus_allowed(task_t *p, cpumask_t new_mask); Index: linux-2.6.9/mm/Makefile =================================================================== --- linux-2.6.9.orig/mm/Makefile 2004-10-18 14:54:37.000000000 -0700 +++ linux-2.6.9/mm/Makefile 2004-12-22 14:24:56.000000000 -0800 @@ -5,7 +5,7 @@ mmu-y := nommu.o mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \ mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \ - vmalloc.o + vmalloc.o scrubd.o obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \ page_alloc.o page-writeback.o pdflush.o prio_tree.o \ Index: linux-2.6.9/mm/scrubd.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.9/mm/scrubd.c 2004-12-22 14:26:35.000000000 -0800 @@ -0,0 +1,146 @@ +#include <linux/mm.h> +#include <linux/module.h> +#include <linux/init.h> +#include <linux/highmem.h> +#include <linux/file.h> +#include <linux/suspend.h> +#include <linux/sysctl.h> +#include <linux/scrub.h> + +unsigned int sysctl_scrub_start = MAX_ORDER; /* Off */ +unsigned int sysctl_scrub_stop = 2; /* Mininum order of page to zero */ + +/* + * sysctl handler for /proc/sys/vm/scrub_start + */ +int scrub_start_handler(ctl_table *table, int write, + struct file *file, void __user *buffer, size_t *length, loff_t *ppos) +{ + proc_dointvec(table, write, file, buffer, length, ppos); + if (sysctl_scrub_start < MAX_ORDER) { + struct zone *zone; + + for_each_zone(zone) + wakeup_kscrubd(zone); + } + return 0; +} + +LIST_HEAD(zero_drivers); + +/* + * zero_highest_order_page takes a page off the freelist + * and then hands it off to block zeroing agents. + * The cleared pages are added to the back of + * the freelist where the page allocator may pick them up. + */ +int zero_highest_order_page(struct zone *z) +{ + int order; + + for(order = MAX_ORDER-1; order >= sysctl_scrub_stop; order--) { + struct free_area *area = z->free_area[NOT_ZEROED] + order; + if (!list_empty(&area->free_list)) { + struct page *page = scrubd_rmpage(z, area, order); + struct list_head *l; + + if (!page) + continue; + + page->index = order; + + list_for_each(l, &zero_drivers) { + struct zero_driver *driver = list_entry(l, struct zero_driver, list); + unsigned long size = PAGE_SIZE << order; + + if (driver->start(page_address(page), size) == 0) { + + unsigned ticks = (size*HZ)/driver->rate; + if (ticks) { + /* Wait the minimum time of the transfer */ + current->state = TASK_INTERRUPTIBLE; + schedule_timeout(ticks); + } + /* Then keep on checking until transfer is complete */ + while (!driver->check()) + schedule(); + goto out; + } + } + + /* Unable to find a zeroing device that would + * deal with this page so just do it on our own. + * This will likely thrash the cpu caches. + */ + cond_resched(); + clear_page(page_address(page), order); +out: + end_zero_page(page); + cond_resched(); + return 1 << order; + } + } + return 0; +} + +/* + * scrub_pgdat() will work across all this node's zones. + */ +static void scrub_pgdat(pg_data_t *pgdat) +{ + int i; + unsigned long pages_zeroed; + + if (system_state != SYSTEM_RUNNING) + return; + + do { + pages_zeroed = 0; + for (i = 0; i < pgdat->nr_zones; i++) { + struct zone *zone = pgdat->node_zones + i; + + pages_zeroed += zero_highest_order_page(zone); + } + } while (pages_zeroed); +} + +/* + * The background scrub daemon, started as a kernel thread + * from the init process. + */ +static int kscrubd(void *p) +{ + pg_data_t *pgdat = (pg_data_t*)p; + struct task_struct *tsk = current; + DEFINE_WAIT(wait); + cpumask_t cpumask; + + daemonize("kscrubd%d", pgdat->node_id); + cpumask = node_to_cpumask(pgdat->node_id); + if (!cpus_empty(cpumask)) + set_cpus_allowed(tsk, cpumask); + + tsk->flags |= PF_MEMALLOC | PF_KSCRUBD; + + for ( ; ; ) { + if (current->flags & PF_FREEZE) + refrigerator(PF_FREEZE); + prepare_to_wait(&pgdat->kscrubd_wait, &wait, TASK_INTERRUPTIBLE); + schedule(); + finish_wait(&pgdat->kscrubd_wait, &wait); + + scrub_pgdat(pgdat); + } + return 0; +} + +static int __init kscrubd_init(void) +{ + pg_data_t *pgdat; + for_each_pgdat(pgdat) + pgdat->kscrubd + = find_task_by_pid(kernel_thread(kscrubd, pgdat, CLONE_KERNEL)); + return 0; +} + +module_init(kscrubd_init) Index: linux-2.6.9/include/linux/scrub.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.9/include/linux/scrub.h 2004-12-22 14:24:56.000000000 -0800 @@ -0,0 +1,48 @@ +#ifndef _LINUX_SCRUB_H +#define _LINUX_SCRUB_H + +/* + * Definitions for scrubbing of memory include an interface + * for drivers that may that allow the zeroing of memory + * without invalidating the caches. + * + * Christoph Lameter, December 2004. + */ + +struct zero_driver { + int (*start)(void *, unsigned long); /* Start bzero transfer */ + int (*check)(void); /* Check if bzero is complete */ + unsigned long rate; /* zeroing rate in bytes/sec */ + struct list_head list; +}; + +extern struct list_head zero_drivers; + +extern unsigned int sysctl_scrub_start; +extern unsigned int sysctl_scrub_stop; + +/* Registering and unregistering zero drivers */ +static inline void register_zero_driver(struct zero_driver *z) +{ + list_add(&z->list, &zero_drivers); +} + +static inline void unregister_zero_driver(struct zero_driver *z) +{ + list_del(&z->list); +} + +extern struct page *scrubd_rmpage(struct zone *zone, struct free_area *area, int order); + +static void inline wakeup_kscrubd(struct zone *zone) +{ + if (!waitqueue_active(&zone->zone_pgdat->kscrubd_wait)) + return; + wake_up_interruptible(&zone->zone_pgdat->kscrubd_wait); +} + +int scrub_start_handler(struct ctl_table *, int, struct file *, + void __user *, size_t *, loff_t *); + +extern void end_zero_page(struct page *page); +#endif Index: linux-2.6.9/kernel/sysctl.c =================================================================== --- linux-2.6.9.orig/kernel/sysctl.c 2004-12-17 14:40:17.000000000 -0800 +++ linux-2.6.9/kernel/sysctl.c 2004-12-22 14:24:56.000000000 -0800 @@ -40,6 +40,7 @@ #include <linux/times.h> #include <linux/limits.h> #include <linux/dcache.h> +#include <linux/scrub.h> #include <linux/syscalls.h> #include <asm/uaccess.h> @@ -816,6 +817,24 @@ .strategy = &sysctl_jiffies, }, #endif + { + .ctl_name = VM_SCRUB_START, + .procname = "scrub_start", + .data = &sysctl_scrub_start, + .maxlen = sizeof(sysctl_scrub_start), + .mode = 0644, + .proc_handler = &scrub_start_handler, + .strategy = &sysctl_intvec, + }, + { + .ctl_name = VM_SCRUB_STOP, + .procname = "scrub_stop", + .data = &sysctl_scrub_stop, + .maxlen = sizeof(sysctl_scrub_stop), + .mode = 0644, + .proc_handler = &proc_dointvec, + .strategy = &sysctl_intvec, + }, { .ctl_name = 0 } }; Index: linux-2.6.9/include/linux/sysctl.h =================================================================== --- linux-2.6.9.orig/include/linux/sysctl.h 2004-12-17 14:40:16.000000000 -0800 +++ linux-2.6.9/include/linux/sysctl.h 2004-12-22 14:24:56.000000000 -0800 @@ -168,6 +168,8 @@ VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */ VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */ VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */ + VM_SCRUB_START=30, /* percentage * 10 at which to start scrubd */ + VM_SCRUB_STOP=31, /* percentage * 10 at which to stop scrubd */ }; ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2005-01-03 20:30 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <fa.n0l29ap.1nqg39@ifi.uio.no> [not found] ` <fa.n04s9ar.17sg3f@ifi.uio.no> 2004-12-24 21:10 ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Bodo Eggert 2004-12-26 23:02 ` Florian Weimer 2004-12-26 23:12 ` Linus Torvalds 2004-12-26 23:24 ` Florian Weimer 2004-12-27 1:37 ` Ingo Oeser 2004-12-27 4:33 ` Zwane Mwaikambo 2004-12-27 0:01 ` Chris Wedgwood 2005-01-03 20:30 ` Christoph Lameter [not found] <B8E391BBE9FE384DAA4C5C003888BE6F02900FBD@scsmsx401.amr.corp.intel.com> [not found] ` <41C20E3E.3070209@yahoo.com.au> 2004-12-21 19:55 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter 2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter 2004-12-23 19:33 ` Prezeroing V2 [1/4]: __GFP_ZERO / clear_page() removal Christoph Lameter 2004-12-23 19:34 ` Prezeroing V2 [3/4]: Add support for ZEROED and NOT_ZEROED free maps Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).