* Re: Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO [not found] ` <Pine.LNX.4.58.0412211155340.1313@schroedinger.engr.sgi.com.suse.lists.linux.kernel> @ 2004-12-21 22:40 ` Andi Kleen 2004-12-21 22:54 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Andi Kleen @ 2004-12-21 22:40 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-kernel Christoph Lameter <clameter@sgi.com> writes: > @@ -0,0 +1,52 @@ > +/* > + * Zero a page. > + * rdi page > + */ > + .globl zero_page > + .p2align 4 > +zero_page: > + xorl %eax,%eax > + movl $4096/64,%ecx > + shl %ecx, %esi Surely must be shl %esi,%ecx > +zero_page_c: > + movl $4096/8,%ecx > + shl %ecx, %esi Same. Haven't tested. But for the one instruction it seems overkill to me to have a new function. How about you just extend clear_page with the order argument? BTW I think Andrea has been playing with prezeroing on x86 and he found no benefit at all. So it's doubtful it makes any sense on x86/x86-64. -Andi ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO 2004-12-21 22:40 ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Andi Kleen @ 2004-12-21 22:54 ` Christoph Lameter 2004-12-22 10:53 ` Andi Kleen 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2004-12-21 22:54 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Tue, 21 Dec 2004, Andi Kleen wrote: > Christoph Lameter <clameter@sgi.com> writes: > > @@ -0,0 +1,52 @@ > > +/* > > + * Zero a page. > > + * rdi page > > + */ > > + .globl zero_page > > + .p2align 4 > > +zero_page: > > + xorl %eax,%eax > > + movl $4096/64,%ecx > > + shl %ecx, %esi > > Surely must be shl %esi,%ecx Ahh. Thanks. > But for the one instruction it seems overkill to me to have a new > function. How about you just extend clear_page with the order argument? We can just #define clear_page(__p) zero_page(__p, 0) and remove clear_page? > > BTW I think Andrea has been playing with prezeroing on x86 and > he found no benefit at all. So it's doubtful it makes any sense > on x86/x86-64. Andrea's approach was: 1. Zero hot pages 2. Zero single pages which simply results in shifting the processing time somewhere else. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO 2004-12-21 22:54 ` Christoph Lameter @ 2004-12-22 10:53 ` Andi Kleen 2004-12-22 19:54 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Andi Kleen @ 2004-12-22 10:53 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel On Tue, Dec 21, 2004 at 02:54:46PM -0800, Christoph Lameter wrote: > On Tue, 21 Dec 2004, Andi Kleen wrote: > > > Christoph Lameter <clameter@sgi.com> writes: > > > @@ -0,0 +1,52 @@ > > > +/* > > > + * Zero a page. > > > + * rdi page > > > + */ > > > + .globl zero_page > > > + .p2align 4 > > > +zero_page: > > > + xorl %eax,%eax > > > + movl $4096/64,%ecx > > > + shl %ecx, %esi > > > > Surely must be shl %esi,%ecx > > Ahh. Thanks. > > > But for the one instruction it seems overkill to me to have a new > > function. How about you just extend clear_page with the order argument? > > We can just > > #define clear_page(__p) zero_page(__p, 0) > > and remove clear_page? It depends. If you plan to do really big zero_page then it may be worth experimenting with cache bypassing clears (movntq) or even SSE2 16 byte stores (movntdq %xmm..,..) and take out the rep ; stosq optimization. I tried it all long ago and it wasn't a win for only 4K. For normal 4K clear_page that's definitely not a win (tested) and especially cache bypassing is a loss. > > > > > BTW I think Andrea has been playing with prezeroing on x86 and > > he found no benefit at all. So it's doubtful it makes any sense > > on x86/x86-64. > > Andrea's approach was: > > 1. Zero hot pages > 2. Zero single pages > > which simply results in shifting the processing time somewhere else. Yours too at least on non Altix no? Can you demonstrate any benefit? Where are the numbers? I'm sceptical for example that there will be enough higher orders to make the batch clearing worthwhile after the system is up for a days. Normally memory tends to fragment rather badly in Linux. I suspect after some time your approach will just degenerate to be the same as Andrea's, even if it should be a win at the beginning (is it?) -Andi ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO 2004-12-22 10:53 ` Andi Kleen @ 2004-12-22 19:54 ` Christoph Lameter 0 siblings, 0 replies; 21+ messages in thread From: Christoph Lameter @ 2004-12-22 19:54 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Wed, 22 Dec 2004, Andi Kleen wrote: > It depends. If you plan to do really big zero_page then it > may be worth experimenting with cache bypassing clears > (movntq) or even SSE2 16 byte stores (movntdq %xmm..,..) > and take out the rep ; stosq optimization. I tried it all > long ago and it wasn't a win for only 4K. > > For normal 4K clear_page that's definitely not a win (tested) > and especially cache bypassing is a loss. This may be better realized using a zeroing driver then. > Yours too at least on non Altix no? Can you demonstrate any benefit? > Where are the numbers? In the initial discussion see V1 [0/3]. > I'm sceptical for example that there will be enough higher orders > to make the batch clearing worthwhile after the system is up for a days. > Normally memory tends to fragment rather badly in Linux. > I suspect after some time your approach will just degenerate to be > the same as Andrea's, even if it should be a win at the beginning (is it?) I have tried it and the number show clearly that this continues to be a win although the inital 7-8 fold speed increase degenerates into 3-4 fold over time (single thread performance). ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <Pine.LNX.4.58.0412231119540.31791@schroedinger.engr.sgi.com.suse.lists.linux.kernel>]
* Re: Prezeroing V2 [0/3]: Why and When it works [not found] ` <Pine.LNX.4.58.0412231119540.31791@schroedinger.engr.sgi.com.suse.lists.linux.kernel> @ 2004-12-23 20:27 ` Andi Kleen 2004-12-23 21:02 ` Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Andi Kleen @ 2004-12-23 20:27 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-kernel Christoph Lameter <clameter@sgi.com> writes: > and why other approaches have not worked. > o Instead of zero_page(p,order) extend clear_page to take second argument > o Update all architectures to accept second argument for clear_pages Sorry if there was a miscommunication, but ... > 1. Aggregating zeroing operations to only apply to pages of higher order, > which results in many pages that will later become order 0 to be > zeroed in one go. For that purpose the existing clear_page function is > extended and made to take an additional argument specifying the order of > the page to be cleared. But if you do that you should really use a separate function that can use cache bypassing stores. Normal clear_page cannot use that because it would be a loss when the data is soon used. So the two changes don't really make sense. Also I must say I'm still suspicious regarding your heuristic to trigger gang faulting - with bad luck it could lead to a lot more memory usage to specific applications that do very sparse usage of memory. There should be at least an madvise flag to turn it off and a sysctl and it would be better to trigger only on a longer sequence of consecutive faulted pages. > 2. Hardware support for offloading zeroing from the cpu. This avoids > the invalidation of the cpu caches by extensive zeroing operations. > > The result is a significant increase of the page fault performance even for > single threaded applications: [...] How about some numbers on i386? -Andi ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-23 20:27 ` Prezeroing V2 [0/3]: Why and When it works Andi Kleen @ 2004-12-23 21:02 ` Christoph Lameter 0 siblings, 0 replies; 21+ messages in thread From: Christoph Lameter @ 2004-12-23 21:02 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel On Thu, 23 Dec 2004, Andi Kleen wrote: > > 1. Aggregating zeroing operations to only apply to pages of higher order, > > which results in many pages that will later become order 0 to be > > zeroed in one go. For that purpose the existing clear_page function is > > extended and made to take an additional argument specifying the order of > > the page to be cleared. > > But if you do that you should really use a separate function that > can use cache bypassing stores. > > Normal clear_page cannot use that because it would be a loss > when the data is soon used. Clear_page is used both in the cache hot and no cache wanted case now. > So the two changes don't really make sense. Which two changes? If an arch can do zeroing without touching the cpu caches then that can be done with a zero driver. > Also I must say I'm still suspicious regarding your heuristic > to trigger gang faulting - with bad luck it could lead to a lot > more memory usage to specific applications that do very sparse > usage of memory. Gang faulting is not part of this patch. Please keep the issues separate. > There should be at least an madvise flag to turn it off and a sysctl > and it would be better to trigger only on a longer sequence of > consecutive faulted pages. Again this is not related to this patchset. Look at the V13 of the page fault scalability patch and you will find a /proc/sys/vm setting to manipulate things. This is V2 of the prezeroing patch. > How about some numbers on i386? Umm. Yeah. I only have smallish i386 machines here. Maybe next year ;-) ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <B8E391BBE9FE384DAA4C5C003888BE6F02900FBD@scsmsx401.amr.corp.intel.com>]
[parent not found: <41C20E3E.3070209@yahoo.com.au>]
* Increase page fault rate by prezeroing V1 [0/3]: Overview [not found] ` <41C20E3E.3070209@yahoo.com.au> @ 2004-12-21 19:55 ` Christoph Lameter 2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter 0 siblings, 1 reply; 21+ messages in thread From: Christoph Lameter @ 2004-12-21 19:55 UTC (permalink / raw) To: Nick Piggin Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds, linux-mm, linux-kernel The patches increasing the page fault rate (introduction of atomic pte operations and anticipatory prefaulting) do so by reducing the locking overhead and are therefore mainly of interest for applications running in SMP systems with a high number of cpus. The single thread performance does just show minor increases. Only the performance of multi-threaded applications increase significantly. The most expensive operation in the page fault handler is (apart of SMP locking overhead) the zeroing of the page that is also done in the page fault handler. Others have seen this too and have tried provide a way to provide zeroed pages to the page fault handler: http://marc.theaimsgroup.com/?t=109914559100004&r=1&w=2 http://marc.theaimsgroup.com/?t=109777267500005&r=1&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=104931944213955&w=2 The problem so far has been that simple zeroing of pages simply shifts the time spend somewhere else. Plus one would not want to zero hot pages. This patch addresses those issues by making it more effective to zero pages by: 1. Aggregating zeroing operations to mainly apply to larger order pages which results in many later order 0 pages to be zeroed in one go. For that purpose a new achitecture specific function zero_page(page, order) is introduced. 2. Hardware support for offloading zeroing from the cpu. This avoids the invalidation of the cpu caches by extensive zeroing operations. The result is a significant increase of the page fault performance even for single threaded applications: w/o patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.146s 11.155s 11.030s 69584.896 69566.852 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 1 1 1 0.014s 0.110s 0.012s524292.194 517665.538 This is a performance increase by a factor 8! The performance can only be upheld if enough zeroed pages are available. In a heavy memory intensive benchmark the system will run out of these very fast but the efficient algorithm for page zeroing still makes this a winner (8 way system with 6 GB RAM, no hardware zeroing support): w/o patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.146s 11.155s 11.030s 69584.896 69566.852 4 3 2 0.170s 14.909s 7.097s 52150.369 98643.687 4 3 4 0.181s 16.597s 5.079s 46869.167 135642.420 4 3 8 0.166s 23.239s 4.037s 33599.215 179791.120 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.183s 2.750s 2.093s268077.996 267952.890 4 3 2 0.185s 4.876s 2.097s155344.562 263967.292 4 3 4 0.150s 6.617s 2.097s116205.793 264774.080 4 3 8 0.186s 13.693s 3.054s 56659.819 221701.073 The patch is composed of 3 parts: [1/3] Introduce __GFP_ZERO Modifies the page allocator to be able to take the __GFP_ZERO flag and returns zeroed memory on request. Modifies locations throughout the linux sources that retrieve a page and then zeroe it to request a zeroed page. Adds new low level zero_page functions for i386, ia64 and x86_64. (x64_64 untested) [2/3] Page Zeroing Adds management of ZEROED and NOT_ZEROED pages and a background daemon called scrubd. scrubd is disable by default but can be enabled by writing an order number to /proc/sys/vm/scrub_start. If a page is coalesced of that order then the scrub daemon will start zeroing until all pages of order /proc/sys/vm/scrub_stop and higher are zeroed. [3/3] SGI Altix Block Transfer Engine Support Implements a driver to shift the zeroing off the cpu into hardware. With hardware support there will be minimal impact of zeroing on the performance of the system. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Prezeroing V2 [0/3]: Why and When it works 2004-12-21 19:55 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter @ 2004-12-23 19:29 ` Christoph Lameter 2004-12-23 19:49 ` Arjan van de Ven ` (3 more replies) 0 siblings, 4 replies; 21+ messages in thread From: Christoph Lameter @ 2004-12-23 19:29 UTC (permalink / raw) Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel Change from V1 to V2: o Add explanation--and some bench results--as to why and when this optimization works and why other approaches have not worked. o Instead of zero_page(p,order) extend clear_page to take second argument o Update all architectures to accept second argument for clear_pages o Extensive removal of all page allocs/clear_page combination from all archs o Blank / typo fixups o SGI BTE zero driver update: Use node specific variables instead of cpu specific since a cpu may be responsible for multiple nodes. The patches increasing the page fault rate (introduction of atomic pte operations and anticipatory prefaulting) do so by reducing the locking overhead and are therefore mainly of interest for applications running in SMP systems with a high number of cpus. The single thread performance does just show minor increases. Only the performance of multi-threaded applications increase significantly. The most expensive operation in the page fault handler is (apart of SMP locking overhead) the zeroing of the page. This zeroing means that all cachelines of the faulted page (on Altix that means all 128 cachelines of 128 byte each) must be loaded and later written back. This patch allows to avoid having to load all cachelines if only a part of the cachelines of that page is needed immediately after the fault. Thus the patch will only be effective for sparsely accessed memory which is typicalfor anonymous memory and pte maps. Prezeroed pages will be used for those purposes. Unzeroed pages will be used as usual for the other purposes. Others have also thought that prezeroing could be a benefit and have tried provide a way to provide zeroed pages to the page fault handler: http://marc.theaimsgroup.com/?t=109914559100004&r=1&w=2 http://marc.theaimsgroup.com/?t=109777267500005&r=1&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=104931944213955&w=2 However, these attempt have tried to zero pages soon to be accessed (and which may already have recently been accessed). Elements of these pages are thus already in the cache. Approaches like that will only shift processing a bit and not yield performance benefits. Prezeroing only makes sense for pages that are not currently needed and that are not in the cpu caches. Pages that have recently been touched and that soon will be touched again are better hot zeroed since the zeroing will largely be done to cachelines already in the cpu caches. The patch makes prezeroing very effective by: 1. Aggregating zeroing operations to only apply to pages of higher order, which results in many pages that will later become order 0 to be zeroed in one go. For that purpose the existing clear_page function is extended and made to take an additional argument specifying the order of the page to be cleared. 2. Hardware support for offloading zeroing from the cpu. This avoids the invalidation of the cpu caches by extensive zeroing operations. The result is a significant increase of the page fault performance even for single threaded applications: w/o patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.146s 11.155s 11.030s 69584.896 69566.852 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 1 1 1 0.014s 0.110s 0.012s524292.194 517665.538 The performance can only be upheld if enough zeroed pages are available. In a heavy memory intensive benchmarks the system could potentially run out of zeroed pages but the efficient algorithm for page zeroing still shows this to be a winner: (8 way system with 6 GB RAM, no hardware zeroing support) w/o patch: Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.146s 11.155s 11.030s 69584.896 69566.852 4 3 2 0.170s 14.909s 7.097s 52150.369 98643.687 4 3 4 0.181s 16.597s 5.079s 46869.167 135642.420 4 3 8 0.166s 23.239s 4.037s 33599.215 179791.120 w/patch Gb Rep Threads User System Wall flt/cpu/s fault/wsec 4 3 1 0.183s 2.750s 2.093s268077.996 267952.890 4 3 2 0.185s 4.876s 2.097s155344.562 263967.292 4 3 4 0.150s 6.617s 2.097s116205.793 264774.080 4 3 8 0.186s 13.693s 3.054s 56659.819 221701.073 Note that zeroing of pages makes no sense if the application touches all cache lines of a page allocated (there is no influence of prezeroing on benchmarks like lmbench for that reason) since the extensive caching of modern cpus means that the zeroes written to a hot zeroed page will then be overwritten by the application in the cpu cache and thus the zeros will never make it to memory! The test program used above only touches one 128 byte cache line of a 16k page (ia64). Here is another test in order to gauge the influence of the number of cache lines touched on the performance of the prezero enhancements: Gb Rep Thr CLine User System Wall flt/cpu/s fault/wsec 1 1 1 1 0.01s 0.12s 0.01s500813.853 497925.891 1 1 1 2 0.01s 0.11s 0.01s493453.103 472877.725 1 1 1 4 0.02s 0.10s 0.01s479351.658 471507.415 1 1 1 8 0.01s 0.13s 0.01s424742.054 416725.013 1 1 1 16 0.05s 0.12s 0.01s347715.359 336983.834 1 1 1 32 0.12s 0.13s 0.02s258112.286 256246.731 1 1 1 64 0.24s 0.14s 0.03s169896.381 168189.283 1 1 1 128 0.49s 0.14s 0.06s102300.257 101674.435 The benefits of prezeroing become smaller the more cache lines of a page are touched. Prezeroing can only be effective if memory is not immediately touched after the anonymous page fault. The patch is composed of 4 parts: [1/4] Introduce __GFP_ZERO Modifies the page allocator to be able to take the __GFP_ZERO flag and returns zeroed memory on request. Modifies locations throughout the linux sources that retrieve a page and then zero it to request a zeroed page. [2/4] Architecture specific clear_page updates Adds second order argument to clear_page and updates all arches. Note: The two first pages may be used alone if no zeroing engine is wanted. [3/4] Page Zeroing Adds management of ZEROED and NOT_ZEROED pages and a background daemon called scrubd. scrubd is disabled by default but can be enabled by writing an order number to /proc/sys/vm/scrub_start. If a page is coalesced of that order or higher then the scrub daemon will start zeroing until all pages of order /proc/sys/vm/scrub_stop and higher are zeroed and then go back to sleep. In an SMP environment the scrub daemon is typically running on the most idle cpu. Thus a single threaded application running on one cpu may have the other cpu zeroing pages for it etc. The scrub daemon is hardly noticable and usually finished zeroing quickly since most processors are optimized for linear memory filling. [4/4] SGI Altix Block Transfer Engine Support Implements a driver to shift the zeroing off the cpu into hardware. With hardware support there will be minimal impact of zeroing on the performance of the system. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter @ 2004-12-23 19:49 ` Arjan van de Ven 2004-12-23 20:57 ` Matt Mackall ` (2 subsequent siblings) 3 siblings, 0 replies; 21+ messages in thread From: Arjan van de Ven @ 2004-12-23 19:49 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel > The most expensive operation in the page fault handler is (apart of SMP > locking overhead) the zeroing of the page. This zeroing means that all > cachelines of the faulted page (on Altix that means all 128 cachelines of > 128 byte each) must be loaded and later written back. This patch allows to > avoid having to load all cachelines if only a part of the cachelines of > that page is needed immediately after the fault. eh why will all cachelines be loaded? Surely you can avoid the write- allocate behavior for this case..... ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter 2004-12-23 19:49 ` Arjan van de Ven @ 2004-12-23 20:57 ` Matt Mackall 2004-12-23 21:01 ` Paul Mackerras 2004-12-23 21:11 ` Paul Mackerras 3 siblings, 0 replies; 21+ messages in thread From: Matt Mackall @ 2004-12-23 20:57 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel On Thu, Dec 23, 2004 at 11:29:10AM -0800, Christoph Lameter wrote: > 2. Hardware support for offloading zeroing from the cpu. This avoids > the invalidation of the cpu caches by extensive zeroing operations. I'm wondering if it would be possible to use typical video cards for hardware zeroing. We could set aside a page's worth of zeros in video memory and then use the card's DMA engines to clear pages on the host. This could be done in fbdev drivers, which would register a zeroer with the core. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter 2004-12-23 19:49 ` Arjan van de Ven 2004-12-23 20:57 ` Matt Mackall @ 2004-12-23 21:01 ` Paul Mackerras 2004-12-23 21:11 ` Paul Mackerras 3 siblings, 0 replies; 21+ messages in thread From: Paul Mackerras @ 2004-12-23 21:01 UTC (permalink / raw) To: Christoph Lameter Christoph Lameter writes: > The most expensive operation in the page fault handler is (apart of SMP > locking overhead) the zeroing of the page. This zeroing means that all > cachelines of the faulted page (on Altix that means all 128 cachelines of > 128 byte each) must be loaded and later written back. This patch allows to > avoid having to load all cachelines if only a part of the cachelines of > that page is needed immediately after the fault. On ppc64 we avoid having to zero newly-allocated page table pages by using a slab cache for them, with a constructor function that zeroes them. Page table pages naturally end up being full of zeroes when they are freed, since ptep_get_and_clear, pmd_clear or pgd_clear has been used on every non-zero entry by that stage. Thus there is no extra work required either when allocating them or freeing them. I don't see any point in your patches for systems which don't have some magic hardware for zeroing pages. Your patch seems like a lot of extra code that only benefits a very small number of machines. Paul. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter ` (2 preceding siblings ...) 2004-12-23 21:01 ` Paul Mackerras @ 2004-12-23 21:11 ` Paul Mackerras 2004-12-23 21:37 ` Andrew Morton 2004-12-23 21:48 ` Linus Torvalds 3 siblings, 2 replies; 21+ messages in thread From: Paul Mackerras @ 2004-12-23 21:11 UTC (permalink / raw) To: Christoph Lameter Christoph Lameter writes: > The most expensive operation in the page fault handler is (apart of SMP > locking overhead) the zeroing of the page. Re-reading this I see that you mean the zeroing of the page that is mapped into the process address space, not the page table pages. So ignore my previous reply. Do you have any statistics on how often a page fault needs to supply a page of zeroes versus supplying a copy of an existing page, for real applications? In any case, unless you have magic page-zeroing hardware, I am still inclined to think that zeroing the page at the time of the fault is the most efficient, since that means the page will be hot in the cache for the process to use. If you zero it earlier using CPU stores, it can only cause more overall memory traffic, as far as I can see. I did some measurements once on my G5 powermac (running a ppc64 linux kernel) of how long clear_page takes, and it only takes 96ns for a 4kB page. This is real-life elapsed time in the kernel, not just some cache-hot benchmark measurement. Thus I don't think your patch will gain us anything on ppc64. Paul. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-23 21:11 ` Paul Mackerras @ 2004-12-23 21:37 ` Andrew Morton 2004-12-23 23:00 ` Paul Mackerras 2004-12-23 21:48 ` Linus Torvalds 1 sibling, 1 reply; 21+ messages in thread From: Andrew Morton @ 2004-12-23 21:37 UTC (permalink / raw) To: Paul Mackerras; +Cc: clameter, linux-ia64, torvalds, linux-mm, linux-kernel Paul Mackerras <paulus@samba.org> wrote: > > Christoph Lameter writes: > > > The most expensive operation in the page fault handler is (apart of SMP > > locking overhead) the zeroing of the page. > > Re-reading this I see that you mean the zeroing of the page that is > mapped into the process address space, not the page table pages. So > ignore my previous reply. > > Do you have any statistics on how often a page fault needs to supply a > page of zeroes versus supplying a copy of an existing page, for real > applications? When the workload is a gcc run, the pagefault handler dominates the system time. That's the page zeroing. > In any case, unless you have magic page-zeroing hardware, I am still > inclined to think that zeroing the page at the time of the fault is > the most efficient, since that means the page will be hot in the cache > for the process to use. If you zero it earlier using CPU stores, it > can only cause more overall memory traffic, as far as I can see. x86's movnta instructions provide a way of initialising memory without trashing the caches and it has pretty good bandwidth, I believe. We should wire that up to these patches and see if it speeds things up. > I did some measurements once on my G5 powermac (running a ppc64 linux > kernel) of how long clear_page takes, and it only takes 96ns for a 4kB > page. 40GB/s. Is that straight into L1 or does the measurement include writeback? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-23 21:37 ` Andrew Morton @ 2004-12-23 23:00 ` Paul Mackerras 0 siblings, 0 replies; 21+ messages in thread From: Paul Mackerras @ 2004-12-23 23:00 UTC (permalink / raw) To: Andrew Morton; +Cc: clameter, linux-ia64, torvalds, linux-mm, linux-kernel Andrew Morton writes: > When the workload is a gcc run, the pagefault handler dominates the system > time. That's the page zeroing. For a program which uses a lot of heap and doesn't fork, that sounds reasonable. > x86's movnta instructions provide a way of initialising memory without > trashing the caches and it has pretty good bandwidth, I believe. We should > wire that up to these patches and see if it speeds things up. Yes. I don't know the movnta instruction, but surely, whatever scheme is used, there has to be a snoop for every cache line's worth of memory that is zeroed. The other point is that having the page hot in the cache may well be a benefit to the program. Using any sort of cache-bypassing zeroing might not actually make things faster, when the user time as well as the system time is taken into account. > > I did some measurements once on my G5 powermac (running a ppc64 linux > > kernel) of how long clear_page takes, and it only takes 96ns for a 4kB > > page. > > 40GB/s. Is that straight into L1 or does the measurement include writeback? It is the average elapsed time in clear_page, so it would include the writeback of any cache lines displaced by the zeroing, but not the writeback of the newly-zeroed cache lines (which we hope will be modified by the program before they get written back anyway). This is using the dcbz (data cache block zero) instruction, which establishes a cache line in modified state with zero contents without any memory traffic other than a cache line kill transaction sent to the other CPUs and possible writeback of a dirty cache line displaced by the newly-zeroed cache line. The new cache line is established in the L2 cache, because the L1 is write-through on the G5, and all stores and dcbz instructions have to go to the L2 cache. Thus, on the G5 (and POWER4, which is similar) I don't think there will be much if any benefit from having pre-zeroed cache-cold pages. We can establish the zero lines in cache much faster using dcbz than we can by reading them in from main memory. If the program uses only a few cache lines out of each new page, then reading them from memory might be faster, but that seems unlikely. Paul. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-23 21:11 ` Paul Mackerras 2004-12-23 21:37 ` Andrew Morton @ 2004-12-23 21:48 ` Linus Torvalds 2004-12-23 22:34 ` Zwane Mwaikambo ` (2 more replies) 1 sibling, 3 replies; 21+ messages in thread From: Linus Torvalds @ 2004-12-23 21:48 UTC (permalink / raw) To: Paul Mackerras Cc: Christoph Lameter, Andrew Morton, linux-ia64, torvalds, linux-mm, Kernel Mailing List On Fri, 24 Dec 2004, Paul Mackerras wrote: > > I did some measurements once on my G5 powermac (running a ppc64 linux > kernel) of how long clear_page takes, and it only takes 96ns for a 4kB > page. This is real-life elapsed time in the kernel, not just some > cache-hot benchmark measurement. Thus I don't think your patch will > gain us anything on ppc64. Well, the thing is, if we really _know_ the machine is idle (and not just waiting for something like disk IO), it might be a good idea to just pre-zero everything we can. The question to me is whether we can have a good enough heuristic to notice that it triggers often enough to matter, but seldom enough that it really won't disturb anybody. And "disturb" very much includes things like laptop battery life, scheduling latencies, memory bus traffic _and_ cache contents. And I really don't see a very good heuristic. Maybe it might literally be something like "five-second load average goes down to zero" (we've got fixed-point arithmetic with eleven fractional bits, so we can tune just how close to "zero" we want to get). The load average is system-wide and takes disk load (which tends to imply latency-critical work) into account, so that might actually work out reasonably well as a "the system really is quiescent". So if we make the "what load is considered low" tunable, a system administrator can use that to make it more aggressive. And indeed, you might have a cron-job that says "be more aggressive at clearing pages between 2AM and 4AM in the morning" or something - if you have so much memory that it actually matters if you clear the memory just occasionally. And the tunable load-average check has another advantage: if you want to benchmark it, you can first set it to true zero (basically never), and run the benchmark, and then you can set it to something very agressive ("clear pages every five seconds regardless of load") and re-run. Does this sound sane? Christoph - can you try making the "scrub deamon" do that? Instead of the "scrub-low" and "scrub-high" (or in _addition_ to them), do a "scub-load" thing that takes a scaled integer, and compares it with "avenrun[0]" in kernel/timer.c: calc_load() when the average is updated every five seconds.. Personally, at least for a desktop usage, I think that the load average would work wonderfully well. I know my machines are often at basically zero load, and then having low-latency zero-pages when I sit down sounds like a good idea. Whether there is _enough_ free memory around for a 5-second thing to work out well, I have no idea.. Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-23 21:48 ` Linus Torvalds @ 2004-12-23 22:34 ` Zwane Mwaikambo 2004-12-24 9:14 ` Arjan van de Ven 2004-12-24 16:17 ` Christoph Lameter 2 siblings, 0 replies; 21+ messages in thread From: Zwane Mwaikambo @ 2004-12-23 22:34 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64, linux-mm, Kernel Mailing List On Thu, 23 Dec 2004, Linus Torvalds wrote: > Personally, at least for a desktop usage, I think that the load average > would work wonderfully well. I know my machines are often at basically > zero load, and then having low-latency zero-pages when I sit down sounds > like a good idea. Whether there is _enough_ free memory around for a > 5-second thing to work out well, I have no idea.. Isn't the basic premise very similar to the following paper; http://www.usenix.org/publications/library/proceedings/osdi99/full_papers/dougan/dougan_html/dougan.html In fact i thought ppc32 did something akin to this. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-23 21:48 ` Linus Torvalds 2004-12-23 22:34 ` Zwane Mwaikambo @ 2004-12-24 9:14 ` Arjan van de Ven 2004-12-24 18:21 ` Linus Torvalds 2004-12-24 16:17 ` Christoph Lameter 2 siblings, 1 reply; 21+ messages in thread From: Arjan van de Ven @ 2004-12-24 9:14 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64, linux-mm, Kernel Mailing List > Personally, at least for a desktop usage, I think that the load average > would work wonderfully well. I know my machines are often at basically > zero load, and then having low-latency zero-pages when I sit down sounds > like a good idea. Whether there is _enough_ free memory around for a > 5-second thing to work out well, I have no idea.. problem is.. will it buy you anything if you use the page again anyway... since such pages will be cold cached now. So for sure some of it is only shifting latency from kernel side to userspace side, but readprofile doesn't measure the later so it *looks* better... ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-24 9:14 ` Arjan van de Ven @ 2004-12-24 18:21 ` Linus Torvalds 2004-12-24 18:57 ` Arjan van de Ven 2004-12-27 22:50 ` David S. Miller 0 siblings, 2 replies; 21+ messages in thread From: Linus Torvalds @ 2004-12-24 18:21 UTC (permalink / raw) To: Arjan van de Ven Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64, linux-mm, Kernel Mailing List On Fri, 24 Dec 2004, Arjan van de Ven wrote: > > problem is.. will it buy you anything if you use the page again > anyway... since such pages will be cold cached now. So for sure some of > it is only shifting latency from kernel side to userspace side, but > readprofile doesn't measure the later so it *looks* better... Absolutely. I would want to see some real benchmarks before we do this. Not just some microbenchmark of "how many page faults can we take without _using_ the page at all". I agree 100% with you that we shouldn't shift the costs around. Having a hice hot-spot that we know about is a good thing, and it means that performance profiles show what the time is really spent on. Often getting rid of the hotspot just smears out the work over a wider area, making other optimizations (like trying to make the memory footprint _smaller_ and removing the work entirely that way) totally impossible because now the performance profile just has a constant background noise and you can't tell what the real problem is. Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-24 18:21 ` Linus Torvalds @ 2004-12-24 18:57 ` Arjan van de Ven 2004-12-27 22:50 ` David S. Miller 1 sibling, 0 replies; 21+ messages in thread From: Arjan van de Ven @ 2004-12-24 18:57 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64, linux-mm, Kernel Mailing List On Fri, 2004-12-24 at 10:21 -0800, Linus Torvalds wrote: > > On Fri, 24 Dec 2004, Arjan van de Ven wrote: > > > > problem is.. will it buy you anything if you use the page again > > anyway... since such pages will be cold cached now. So for sure some of > > it is only shifting latency from kernel side to userspace side, but > > readprofile doesn't measure the later so it *looks* better... > > Absolutely. I would want to see some real benchmarks before we do this. > Not just some microbenchmark of "how many page faults can we take without > _using_ the page at all". > > I agree 100% with you that we shouldn't shift the costs around. Having a > hice hot-spot that we know about is a good thing, and it means that > performance profiles show what the time is really spent on. Often getting > rid of the hotspot just smears out the work over a wider area, making > other optimizations (like trying to make the memory footprint _smaller_ > and removing the work entirely that way) totally impossible because now > the performance profile just has a constant background noise and you can't > tell what the real problem is. I suspect it's even worse. Think about it; you can spew 4k of zeroes into your L1 cache really fast (assuming your cpu is smart enough to avoid write-allocate for rep stosl; not sure which cpus are). I suspect you can do that faster than a cachemiss or two. And at that point the page is cache hot... so reads don't miss either. all this makes me wonder if there is any scenario where this thing will be a gain, other than cpus that aren't smart enough to avoid the write- allocate. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-24 18:21 ` Linus Torvalds 2004-12-24 18:57 ` Arjan van de Ven @ 2004-12-27 22:50 ` David S. Miller 2004-12-28 11:53 ` Marcelo Tosatti 1 sibling, 1 reply; 21+ messages in thread From: David S. Miller @ 2004-12-27 22:50 UTC (permalink / raw) To: Linus Torvalds Cc: arjan, paulus, clameter, akpm, linux-ia64, linux-mm, linux-kernel On Fri, 24 Dec 2004 10:21:24 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > Absolutely. I would want to see some real benchmarks before we do this. > Not just some microbenchmark of "how many page faults can we take without > _using_ the page at all". Here's my small contribution. I did three "make -j3 vmlinux" timed runs, one running a kernel without the pre-zeroing stuff applied, one with it applied. It did shave a few seconds off the build consistently. Here is the before: real 8m35.248s user 15m54.132s sys 1m1.098s real 8m32.202s user 15m54.329s sys 1m0.229s real 8m31.932s user 15m54.160s sys 1m0.245s and here is the after: real 8m29.375s user 15m43.296s sys 0m59.549s real 8m28.213s user 15m39.819s sys 0m58.790s real 8m26.140s user 15m44.145s sys 0m58.872s ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-27 22:50 ` David S. Miller @ 2004-12-28 11:53 ` Marcelo Tosatti 0 siblings, 0 replies; 21+ messages in thread From: Marcelo Tosatti @ 2004-12-28 11:53 UTC (permalink / raw) To: David S. Miller Cc: Linus Torvalds, arjan, paulus, clameter, akpm, linux-ia64, linux-mm, linux-kernel On Mon, Dec 27, 2004 at 02:50:57PM -0800, David S. Miller wrote: > On Fri, 24 Dec 2004 10:21:24 -0800 (PST) > Linus Torvalds <torvalds@osdl.org> wrote: > > > Absolutely. I would want to see some real benchmarks before we do this. > > Not just some microbenchmark of "how many page faults can we take without > > _using_ the page at all". > > Here's my small contribution. I did three "make -j3 vmlinux" timed > runs, one running a kernel without the pre-zeroing stuff applied, > one with it applied. It did shave a few seconds off the build > consistently. Here is the before: > > real 8m35.248s > user 15m54.132s > sys 1m1.098s > > real 8m32.202s > user 15m54.329s > sys 1m0.229s > > real 8m31.932s > user 15m54.160s > sys 1m0.245s > > and here is the after: > > real 8m29.375s > user 15m43.296s > sys 0m59.549s > > real 8m28.213s > user 15m39.819s > sys 0m58.790s > > real 8m26.140s > user 15m44.145s > sys 0m58.872s Christopher and other SGI fellows, Get your patch into STP, once its there we can do some wider x86 benchmarking easily. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works 2004-12-23 21:48 ` Linus Torvalds 2004-12-23 22:34 ` Zwane Mwaikambo 2004-12-24 9:14 ` Arjan van de Ven @ 2004-12-24 16:17 ` Christoph Lameter 2 siblings, 0 replies; 21+ messages in thread From: Christoph Lameter @ 2004-12-24 16:17 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Mackerras, Andrew Morton, linux-ia64, linux-mm, Kernel Mailing List On Thu, 23 Dec 2004, Linus Torvalds wrote: > So if we make the "what load is considered low" tunable, a system > administrator can use that to make it more aggressive. And indeed, you > might have a cron-job that says "be more aggressive at clearing pages > between 2AM and 4AM in the morning" or something - if you have so much > memory that it actually matters if you clear the memory just occasionally. > > And the tunable load-average check has another advantage: if you want to > benchmark it, you can first set it to true zero (basically never), and run > the benchmark, and then you can set it to something very agressive ("clear > pages every five seconds regardless of load") and re-run. > > Does this sound sane? Christoph - can you try making the "scrub deamon" do > that? Instead of the "scrub-low" and "scrub-high" (or in _addition_ to > them), do a "scub-load" thing that takes a scaled integer, and compares it > with "avenrun[0]" in kernel/timer.c: calc_load() when the average is > updated every five seconds.. Sure V3 will have that. So far the impact of zeroing is quite minimal on IA64 (even without using hardware), the big zeroing happens immediately after activating it anyways. I have not seen any measurable effect on benchmarks even with 4G allocations on a 6G machine. > Personally, at least for a desktop usage, I think that the load average > would work wonderfully well. I know my machines are often at basically > zero load, and then having low-latency zero-pages when I sit down sounds > like a good idea. Whether there is _enough_ free memory around for a > 5-second thing to work out well, I have no idea.. The CPU can do a couple of Gigs of zeroing per second per CPU and the zeroing zeros local RAM. On my 6G machine with 8 Cpus it can only take a fraction of a second to zero all RAM. Merry Christmas, I am off till now next year. SGI mandatory holiday shutdown so all addicts have to go cold turkey ;-) ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2004-12-28 14:30 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <B8E391BBE9FE384DAA4C5C003888BE6F02900FBD@scsmsx401.amr.corp.intel.com.suse.lists.linux.kernel> [not found] ` <41C20E3E.3070209@yahoo.com.au.suse.lists.linux.kernel> [not found] ` <Pine.LNX.4.58.0412211154100.1313@schroedinger.engr.sgi.com.suse.lists.linux.kernel> [not found] ` <Pine.LNX.4.58.0412211155340.1313@schroedinger.engr.sgi.com.suse.lists.linux.kernel> 2004-12-21 22:40 ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Andi Kleen 2004-12-21 22:54 ` Christoph Lameter 2004-12-22 10:53 ` Andi Kleen 2004-12-22 19:54 ` Christoph Lameter [not found] ` <Pine.LNX.4.58.0412231119540.31791@schroedinger.engr.sgi.com.suse.lists.linux.kernel> 2004-12-23 20:27 ` Prezeroing V2 [0/3]: Why and When it works Andi Kleen 2004-12-23 21:02 ` Christoph Lameter [not found] <B8E391BBE9FE384DAA4C5C003888BE6F02900FBD@scsmsx401.amr.corp.intel.com> [not found] ` <41C20E3E.3070209@yahoo.com.au> 2004-12-21 19:55 ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter 2004-12-23 19:29 ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter 2004-12-23 19:49 ` Arjan van de Ven 2004-12-23 20:57 ` Matt Mackall 2004-12-23 21:01 ` Paul Mackerras 2004-12-23 21:11 ` Paul Mackerras 2004-12-23 21:37 ` Andrew Morton 2004-12-23 23:00 ` Paul Mackerras 2004-12-23 21:48 ` Linus Torvalds 2004-12-23 22:34 ` Zwane Mwaikambo 2004-12-24 9:14 ` Arjan van de Ven 2004-12-24 18:21 ` Linus Torvalds 2004-12-24 18:57 ` Arjan van de Ven 2004-12-27 22:50 ` David S. Miller 2004-12-28 11:53 ` Marcelo Tosatti 2004-12-24 16:17 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).