Re: Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO
       [not found]     ` <Pine.LNX.4.58.0412211155340.1313@schroedinger.engr.sgi.com.suse.lists.linux.kernel>
@ 2004-12-21 22:40       ` Andi Kleen
  2004-12-21 22:54         ` Christoph Lameter
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2004-12-21 22:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

Christoph Lameter <clameter@sgi.com> writes:
> @@ -0,0 +1,52 @@
> +/*
> + * Zero a page.
> + * rdi	page
> + */
> +	.globl zero_page
> +	.p2align 4
> +zero_page:
> +	xorl   %eax,%eax
> +	movl   $4096/64,%ecx
> +	shl	%ecx, %esi

Surely must be shl %esi,%ecx

> +zero_page_c:
> +	movl $4096/8,%ecx
> +	shl	%ecx, %esi

Same.

Haven't tested.

But for the one instruction it seems overkill to me to have a new
function. How about you just extend clear_page with the order argument?

BTW I think Andrea has been playing with prezeroing on x86 and
he found no benefit at all. So it's doubtful it makes any sense
on x86/x86-64.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO
  2004-12-21 22:40       ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Andi Kleen
@ 2004-12-21 22:54         ` Christoph Lameter
  2004-12-22 10:53           ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2004-12-21 22:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Tue, 21 Dec 2004, Andi Kleen wrote:

> Christoph Lameter <clameter@sgi.com> writes:
> > @@ -0,0 +1,52 @@
> > +/*
> > + * Zero a page.
> > + * rdi	page
> > + */
> > +	.globl zero_page
> > +	.p2align 4
> > +zero_page:
> > +	xorl   %eax,%eax
> > +	movl   $4096/64,%ecx
> > +	shl	%ecx, %esi
>
> Surely must be shl %esi,%ecx

Ahh. Thanks.

> But for the one instruction it seems overkill to me to have a new
> function. How about you just extend clear_page with the order argument?

We can just

#define clear_page(__p) zero_page(__p, 0)

and remove clear_page?

>
> BTW I think Andrea has been playing with prezeroing on x86 and
> he found no benefit at all. So it's doubtful it makes any sense
> on x86/x86-64.

Andrea's approach was:

1. Zero hot pages
2. Zero single pages

which simply results in shifting the processing time somewhere else.

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO
  2004-12-21 22:54         ` Christoph Lameter
@ 2004-12-22 10:53           ` Andi Kleen
  2004-12-22 19:54             ` Christoph Lameter
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2004-12-22 10:53 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andi Kleen, linux-kernel

On Tue, Dec 21, 2004 at 02:54:46PM -0800, Christoph Lameter wrote:
> On Tue, 21 Dec 2004, Andi Kleen wrote:
> 
> > Christoph Lameter <clameter@sgi.com> writes:
> > > @@ -0,0 +1,52 @@
> > > +/*
> > > + * Zero a page.
> > > + * rdi	page
> > > + */
> > > +	.globl zero_page
> > > +	.p2align 4
> > > +zero_page:
> > > +	xorl   %eax,%eax
> > > +	movl   $4096/64,%ecx
> > > +	shl	%ecx, %esi
> >
> > Surely must be shl %esi,%ecx
> 
> Ahh. Thanks.
> 
> > But for the one instruction it seems overkill to me to have a new
> > function. How about you just extend clear_page with the order argument?
> 
> We can just
> 
> #define clear_page(__p) zero_page(__p, 0)
> 
> and remove clear_page?

It depends. If you plan to do really big zero_page then it
may be worth experimenting with cache bypassing clears 
(movntq) or even SSE2 16 byte stores (movntdq %xmm..,..) 
and take out the rep ; stosq optimization. I tried it all
long ago and it wasn't a win for only 4K. 

For normal 4K clear_page that's definitely not a win (tested) 
and especially cache bypassing is a loss.

> 
> >
> > BTW I think Andrea has been playing with prezeroing on x86 and
> > he found no benefit at all. So it's doubtful it makes any sense
> > on x86/x86-64.
> 
> Andrea's approach was:
> 
> 1. Zero hot pages
> 2. Zero single pages
> 
> which simply results in shifting the processing time somewhere else.

Yours too at least on non Altix no? Can you demonstrate any benefit? 
Where are the numbers? 

I'm sceptical for example that there will be enough higher orders
to make the batch clearing worthwhile after the system is up for a days. 
Normally memory tends to fragment rather badly in Linux.
I suspect after some time your approach will just degenerate to be 
the same as Andrea's, even if it should be a win at the beginning (is it?)

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO
  2004-12-22 10:53           ` Andi Kleen
@ 2004-12-22 19:54             ` Christoph Lameter
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2004-12-22 19:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Wed, 22 Dec 2004, Andi Kleen wrote:

> It depends. If you plan to do really big zero_page then it
> may be worth experimenting with cache bypassing clears
> (movntq) or even SSE2 16 byte stores (movntdq %xmm..,..)
> and take out the rep ; stosq optimization. I tried it all
> long ago and it wasn't a win for only 4K.
>
> For normal 4K clear_page that's definitely not a win (tested)
> and especially cache bypassing is a loss.

This may be better realized using a zeroing driver then.

> Yours too at least on non Altix no? Can you demonstrate any benefit?
> Where are the numbers?

In the initial discussion see V1 [0/3].

> I'm sceptical for example that there will be enough higher orders
> to make the batch clearing worthwhile after the system is up for a days.
> Normally memory tends to fragment rather badly in Linux.
> I suspect after some time your approach will just degenerate to be
> the same as Andrea's, even if it should be a win at the beginning (is it?)

I have tried it and the number show clearly that this continues to be a
win although the inital 7-8 fold speed increase degenerates into 3-4 fold
over time (single thread performance).

^ permalink raw reply	[flat|nested] 21+ messages in thread
[parent not found: <Pine.LNX.4.58.0412231119540.31791@schroedinger.engr.sgi.com.suse.lists.linux.kernel>]
* Re: Prezeroing V2 [0/3]: Why and When it works
       [not found]     ` <Pine.LNX.4.58.0412231119540.31791@schroedinger.engr.sgi.com.suse.lists.linux.kernel>
@ 2004-12-23 20:27       ` Andi Kleen
  2004-12-23 21:02         ` Christoph Lameter
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2004-12-23 20:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel

Christoph Lameter <clameter@sgi.com> writes:
>   and why other approaches have not worked.
> o Instead of zero_page(p,order) extend clear_page to take second argument
> o Update all architectures to accept second argument for clear_pages

Sorry if there was a miscommunication, but ...
> 1. Aggregating zeroing operations to only apply to pages of higher order,
> which results in many pages that will later become order 0 to be
> zeroed in one go. For that purpose the existing clear_page function is
> extended and made to take an additional argument specifying the order of
> the page to be cleared.

But if you do that you should really use a separate function that 
can use cache bypassing stores. 

Normal clear_page cannot use that because it would be a loss
when the data is soon used.

So the two changes don't really make sense.

Also I must say I'm still suspicious regarding your heuristic
to trigger gang faulting - with bad luck it could lead to a lot 
more memory usage to specific applications that do very sparse
usage of memory. 

There should be at least an madvise flag to turn it off and a sysctl 
and it would be better to trigger only on a longer sequence of 
consecutive faulted pages.

> 2. Hardware support for offloading zeroing from the cpu. This avoids
> the invalidation of the cpu caches by extensive zeroing operations.
> 
> The result is a significant increase of the page fault performance even for
> single threaded applications:
[...]

How about some numbers on i386? 

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 20:27       ` Prezeroing V2 [0/3]: Why and When it works Andi Kleen
@ 2004-12-23 21:02         ` Christoph Lameter
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2004-12-23 21:02 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Thu, 23 Dec 2004, Andi Kleen wrote:

> > 1. Aggregating zeroing operations to only apply to pages of higher order,
> > which results in many pages that will later become order 0 to be
> > zeroed in one go. For that purpose the existing clear_page function is
> > extended and made to take an additional argument specifying the order of
> > the page to be cleared.
>
> But if you do that you should really use a separate function that
> can use cache bypassing stores.
>
> Normal clear_page cannot use that because it would be a loss
> when the data is soon used.

Clear_page is used both in the cache hot and no cache wanted case now.

> So the two changes don't really make sense.

Which two changes?

If an arch can do zeroing without touching the cpu caches then that can
be done with a zero driver.

> Also I must say I'm still suspicious regarding your heuristic
> to trigger gang faulting - with bad luck it could lead to a lot
> more memory usage to specific applications that do very sparse
> usage of memory.

Gang faulting is not part of this patch. Please keep the issues separate.

> There should be at least an madvise flag to turn it off and a sysctl
> and it would be better to trigger only on a longer sequence of
> consecutive faulted pages.

Again this is not related to this patchset. Look at the V13 of the page
fault scalability patch and you will find a /proc/sys/vm setting to
manipulate things. This is V2 of the prezeroing patch.

> How about some numbers on i386?

Umm. Yeah. I only have smallish i386 machines here. Maybe next year ;-)

^ permalink raw reply	[flat|nested] 21+ messages in thread
[parent not found: <B8E391BBE9FE384DAA4C5C003888BE6F02900FBD@scsmsx401.amr.corp.intel.com>]
[parent not found: <41C20E3E.3070209@yahoo.com.au>]
* Increase page fault rate by prezeroing V1 [0/3]: Overview
       [not found] ` <41C20E3E.3070209@yahoo.com.au>
@ 2004-12-21 19:55   ` Christoph Lameter
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2004-12-21 19:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Luck, Tony, Robin Holt, Adam Litke, linux-ia64, torvalds,
	linux-mm, linux-kernel

The patches increasing the page fault rate (introduction of atomic pte operations
and anticipatory prefaulting) do so by reducing the locking overhead and are
therefore mainly of interest for applications running in SMP systems with a high
number of cpus. The single thread performance does just show minor increases.
Only the performance of multi-threaded applications increase significantly.

The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page that is also done in the page fault
handler. Others have seen this too and have tried provide a way to provide
zeroed pages to the page fault handler:

http://marc.theaimsgroup.com/?t=109914559100004&r=1&w=2
http://marc.theaimsgroup.com/?t=109777267500005&r=1&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=104931944213955&w=2

The problem so far has been that simple zeroing of pages simply shifts
the time spend somewhere else. Plus one would not want to zero hot
pages.

This patch addresses those issues by making it more effective to zero pages by:

1. Aggregating zeroing operations to mainly apply to larger order pages
which results in many later order 0 pages to be zeroed in one go.
For that purpose a new achitecture specific function zero_page(page, order)
is introduced.

2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.

The result is a significant increase of the page fault performance even for
single threaded applications:

w/o patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
   4   3    1    0.146s     11.155s  11.030s 69584.896  69566.852

w/patch
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
   1   1    1    0.014s      0.110s   0.012s524292.194 517665.538

This is a performance increase by a factor 8!

The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmark the system will run out of these very
fast but the efficient algorithm for page zeroing still makes this a winner
(8 way system with 6 GB RAM, no hardware zeroing support):

w/o patch:

Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 4   3    1    0.146s     11.155s  11.030s 69584.896  69566.852
 4   3    2    0.170s     14.909s   7.097s 52150.369  98643.687
 4   3    4    0.181s     16.597s   5.079s 46869.167 135642.420
 4   3    8    0.166s     23.239s   4.037s 33599.215 179791.120

w/patch
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 4   3    1    0.183s      2.750s   2.093s268077.996 267952.890
 4   3    2    0.185s      4.876s   2.097s155344.562 263967.292
 4   3    4    0.150s      6.617s   2.097s116205.793 264774.080
 4   3    8    0.186s     13.693s   3.054s 56659.819 221701.073

The patch is composed of 3 parts:

[1/3] Introduce __GFP_ZERO
	Modifies the page allocator to be able to take the __GFP_ZERO flag
	and returns zeroed memory on request. Modifies locations throughout
	the linux sources that retrieve a page and then zeroe it to request
	a zeroed page.
	Adds new low level zero_page functions for i386, ia64 and x86_64.
	(x64_64 untested)

[2/3] Page Zeroing
	Adds management of ZEROED and NOT_ZEROED pages and a background daemon
	called scrubd. scrubd is disable by default but can be enabled
	by writing an order number to /proc/sys/vm/scrub_start. If a page
	is coalesced of that order then the scrub daemon will start zeroing
	until all pages of order /proc/sys/vm/scrub_stop and higher are
	zeroed.

[3/3]	SGI Altix Block Transfer Engine Support
	Implements a driver to shift the zeroing off the cpu into hardware.
	With hardware support there will be minimal impact of zeroing
	on the performance of the system.

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Prezeroing V2 [0/3]: Why and When it works
  2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
@ 2004-12-23 19:29     ` Christoph Lameter
  2004-12-23 19:49       ` Arjan van de Ven
                         ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Christoph Lameter @ 2004-12-23 19:29 UTC (permalink / raw)
  Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

Change from V1 to V2:
o Add explanation--and some bench results--as to why and when this optimization works
  and why other approaches have not worked.
o Instead of zero_page(p,order) extend clear_page to take second argument
o Update all architectures to accept second argument for clear_pages
o Extensive removal of all page allocs/clear_page combination from all archs
o Blank / typo fixups
o SGI BTE zero driver update: Use node specific variables instead of cpu specific
  since a cpu may be responsible for multiple nodes.

The patches increasing the page fault rate (introduction of atomic pte operations
and anticipatory prefaulting) do so by reducing the locking overhead and are
therefore mainly of interest for applications running in SMP systems with a high
number of cpus. The single thread performance does just show minor increases.
Only the performance of multi-threaded applications increase significantly.

The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the zeroing of the page. This zeroing means that all
cachelines of the faulted page (on Altix that means all 128 cachelines of
128 byte each) must be loaded and later written back. This patch allows to
avoid having to load all cachelines if only a part of the cachelines of
that page is needed immediately after the fault.

Thus the patch will only be effective for sparsely accessed memory which
is typicalfor anonymous memory and pte maps. Prezeroed pages will be used
for those purposes. Unzeroed pages will be used as usual for the other
purposes.

Others have also thought that prezeroing could be a benefit and have tried
provide a way to provide zeroed pages to the page fault handler:

http://marc.theaimsgroup.com/?t=109914559100004&r=1&w=2
http://marc.theaimsgroup.com/?t=109777267500005&r=1&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=104931944213955&w=2

However, these attempt have tried to zero pages soon to be
accessed (and which may already have recently been accessed). Elements of
these pages are thus already in the cache. Approaches like that will only
shift processing a bit and not yield performance benefits.
Prezeroing only makes sense for pages that are not currently needed and
that are not in the cpu caches. Pages that have recently been touched and
that soon will be touched again are better hot zeroed since the zeroing
will largely be done to cachelines already in the cpu caches.

The patch makes prezeroing very effective by:

1. Aggregating zeroing operations to only apply to pages of higher order,
which results in many pages that will later become order 0 to be
zeroed in one go. For that purpose the existing clear_page function is
extended and made to take an additional argument specifying the order of
the page to be cleared.

2. Hardware support for offloading zeroing from the cpu. This avoids
the invalidation of the cpu caches by extensive zeroing operations.

The result is a significant increase of the page fault performance even for
single threaded applications:

w/o patch:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
   4   3    1    0.146s     11.155s  11.030s 69584.896  69566.852

w/patch
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
   1   1    1    0.014s      0.110s   0.012s524292.194 517665.538

The performance can only be upheld if enough zeroed pages are available.
In a heavy memory intensive benchmarks the system could potentially
run out of zeroed pages but the efficient algorithm for page zeroing still
shows this to be a winner:

(8 way system with 6 GB RAM, no hardware zeroing support)

w/o patch:
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 4   3    1    0.146s     11.155s  11.030s 69584.896  69566.852
 4   3    2    0.170s     14.909s   7.097s 52150.369  98643.687
 4   3    4    0.181s     16.597s   5.079s 46869.167 135642.420
 4   3    8    0.166s     23.239s   4.037s 33599.215 179791.120

w/patch
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 4   3    1    0.183s      2.750s   2.093s268077.996 267952.890
 4   3    2    0.185s      4.876s   2.097s155344.562 263967.292
 4   3    4    0.150s      6.617s   2.097s116205.793 264774.080
 4   3    8    0.186s     13.693s   3.054s 56659.819 221701.073

Note that zeroing of pages makes no sense if the application
touches all cache lines of a page allocated (there is no influence of
prezeroing on benchmarks like lmbench for that reason) since the extensive
caching of modern cpus means that the zeroes written to a hot zeroed page
will then be overwritten by the application in the cpu cache and thus
the zeros will never make it to memory! The test program used above only
touches one 128 byte cache line of a 16k page (ia64).

Here is another test in order to gauge the influence of the number of cache
lines touched on the performance of the prezero enhancements:

 Gb Rep Thr CLine  User      System   Wall  flt/cpu/s fault/wsec
  1  1    1   1    0.01s      0.12s   0.01s500813.853 497925.891
  1  1    1   2    0.01s      0.11s   0.01s493453.103 472877.725
  1  1    1   4    0.02s      0.10s   0.01s479351.658 471507.415
  1  1    1   8    0.01s      0.13s   0.01s424742.054 416725.013
  1  1    1  16    0.05s      0.12s   0.01s347715.359 336983.834
  1  1    1  32    0.12s      0.13s   0.02s258112.286 256246.731
  1  1    1  64    0.24s      0.14s   0.03s169896.381 168189.283
  1  1    1 128    0.49s      0.14s   0.06s102300.257 101674.435

The benefits of prezeroing become smaller the more cache lines of
a page are touched. Prezeroing can only be effective if memory is not
immediately touched after the anonymous page fault.

The patch is composed of 4 parts:

[1/4] Introduce __GFP_ZERO
	Modifies the page allocator to be able to take the __GFP_ZERO flag
	and returns zeroed memory on request. Modifies locations throughout
	the linux sources that retrieve a page and then zero it to request
	a zeroed page.

[2/4] Architecture specific clear_page updates
	Adds second order argument to clear_page and updates all arches.

Note: The two first pages may be used alone if no zeroing engine is wanted.

[3/4] Page Zeroing
	Adds management of ZEROED and NOT_ZEROED pages and a background daemon
	called scrubd. scrubd is disabled by default but can be enabled
	by writing an order number to /proc/sys/vm/scrub_start. If a page
	is coalesced of that order or higher then the scrub daemon will
	start zeroing until all pages of order /proc/sys/vm/scrub_stop and
	higher are zeroed and then go back to sleep.

	In an SMP environment the scrub daemon is typically
	running on the most idle cpu. Thus a single threaded application running
	on one cpu may have the other cpu zeroing pages for it etc. The scrub
	daemon is hardly noticable and usually finished zeroing quickly since most
	processors are optimized for linear memory filling.

[4/4]	SGI Altix Block Transfer Engine Support
	Implements a driver to shift the zeroing off the cpu into hardware.
	With hardware support there will be minimal impact of zeroing
	on the performance of the system.

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
@ 2004-12-23 19:49       ` Arjan van de Ven
  2004-12-23 20:57       ` Matt Mackall
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Arjan van de Ven @ 2004-12-23 19:49 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

> The most expensive operation in the page fault handler is (apart of SMP
> locking overhead) the zeroing of the page. This zeroing means that all
> cachelines of the faulted page (on Altix that means all 128 cachelines of
> 128 byte each) must be loaded and later written back. This patch allows to
> avoid having to load all cachelines if only a part of the cachelines of
> that page is needed immediately after the fault.

eh why will all cachelines be loaded? Surely you can avoid the write-
allocate behavior for this case.....

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
  2004-12-23 19:49       ` Arjan van de Ven
@ 2004-12-23 20:57       ` Matt Mackall
  2004-12-23 21:01       ` Paul Mackerras
  2004-12-23 21:11       ` Paul Mackerras
  3 siblings, 0 replies; 21+ messages in thread
From: Matt Mackall @ 2004-12-23 20:57 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-ia64, torvalds, linux-mm, linux-kernel

On Thu, Dec 23, 2004 at 11:29:10AM -0800, Christoph Lameter wrote:
> 2. Hardware support for offloading zeroing from the cpu. This avoids
> the invalidation of the cpu caches by extensive zeroing operations.

I'm wondering if it would be possible to use typical video cards for
hardware zeroing. We could set aside a page's worth of zeros in video
memory and then use the card's DMA engines to clear pages on the host.

This could be done in fbdev drivers, which would register a zeroer
with the core.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
  2004-12-23 19:49       ` Arjan van de Ven
  2004-12-23 20:57       ` Matt Mackall
@ 2004-12-23 21:01       ` Paul Mackerras
  2004-12-23 21:11       ` Paul Mackerras
  3 siblings, 0 replies; 21+ messages in thread
From: Paul Mackerras @ 2004-12-23 21:01 UTC (permalink / raw)
  To: Christoph Lameter

Christoph Lameter writes:

> The most expensive operation in the page fault handler is (apart of SMP
> locking overhead) the zeroing of the page. This zeroing means that all
> cachelines of the faulted page (on Altix that means all 128 cachelines of
> 128 byte each) must be loaded and later written back. This patch allows to
> avoid having to load all cachelines if only a part of the cachelines of
> that page is needed immediately after the fault.

On ppc64 we avoid having to zero newly-allocated page table pages by
using a slab cache for them, with a constructor function that zeroes
them.  Page table pages naturally end up being full of zeroes when
they are freed, since ptep_get_and_clear, pmd_clear or pgd_clear has
been used on every non-zero entry by that stage.  Thus there is no
extra work required either when allocating them or freeing them.

I don't see any point in your patches for systems which don't have
some magic hardware for zeroing pages.  Your patch seems like a lot of
extra code that only benefits a very small number of machines.

Paul.

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
                         ` (2 preceding siblings ...)
  2004-12-23 21:01       ` Paul Mackerras
@ 2004-12-23 21:11       ` Paul Mackerras
  2004-12-23 21:37         ` Andrew Morton
  2004-12-23 21:48         ` Linus Torvalds
  3 siblings, 2 replies; 21+ messages in thread
From: Paul Mackerras @ 2004-12-23 21:11 UTC (permalink / raw)
  To: Christoph Lameter

Christoph Lameter writes:

> The most expensive operation in the page fault handler is (apart of SMP
> locking overhead) the zeroing of the page.

Re-reading this I see that you mean the zeroing of the page that is
mapped into the process address space, not the page table pages.  So
ignore my previous reply.

Do you have any statistics on how often a page fault needs to supply a
page of zeroes versus supplying a copy of an existing page, for real
applications?

In any case, unless you have magic page-zeroing hardware, I am still
inclined to think that zeroing the page at the time of the fault is
the most efficient, since that means the page will be hot in the cache
for the process to use.  If you zero it earlier using CPU stores, it
can only cause more overall memory traffic, as far as I can see.

I did some measurements once on my G5 powermac (running a ppc64 linux
kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
page.  This is real-life elapsed time in the kernel, not just some
cache-hot benchmark measurement.  Thus I don't think your patch will
gain us anything on ppc64.

Paul.

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:11       ` Paul Mackerras
@ 2004-12-23 21:37         ` Andrew Morton
  2004-12-23 23:00           ` Paul Mackerras
  2004-12-23 21:48         ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2004-12-23 21:37 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: clameter, linux-ia64, torvalds, linux-mm, linux-kernel

Paul Mackerras <paulus@samba.org> wrote:
>
> Christoph Lameter writes:
> 
> > The most expensive operation in the page fault handler is (apart of SMP
> > locking overhead) the zeroing of the page.
> 
> Re-reading this I see that you mean the zeroing of the page that is
> mapped into the process address space, not the page table pages.  So
> ignore my previous reply.
> 
> Do you have any statistics on how often a page fault needs to supply a
> page of zeroes versus supplying a copy of an existing page, for real
> applications?

When the workload is a gcc run, the pagefault handler dominates the system
time.  That's the page zeroing.

> In any case, unless you have magic page-zeroing hardware, I am still
> inclined to think that zeroing the page at the time of the fault is
> the most efficient, since that means the page will be hot in the cache
> for the process to use.  If you zero it earlier using CPU stores, it
> can only cause more overall memory traffic, as far as I can see.

x86's movnta instructions provide a way of initialising memory without
trashing the caches and it has pretty good bandwidth, I believe.  We should
wire that up to these patches and see if it speeds things up.

> I did some measurements once on my G5 powermac (running a ppc64 linux
> kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> page.

40GB/s.  Is that straight into L1 or does the measurement include writeback?

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:37         ` Andrew Morton
@ 2004-12-23 23:00           ` Paul Mackerras
  0 siblings, 0 replies; 21+ messages in thread
From: Paul Mackerras @ 2004-12-23 23:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: clameter, linux-ia64, torvalds, linux-mm, linux-kernel

Andrew Morton writes:

> When the workload is a gcc run, the pagefault handler dominates the system
> time.  That's the page zeroing.

For a program which uses a lot of heap and doesn't fork, that sounds
reasonable.

> x86's movnta instructions provide a way of initialising memory without
> trashing the caches and it has pretty good bandwidth, I believe.  We should
> wire that up to these patches and see if it speeds things up.

Yes.  I don't know the movnta instruction, but surely, whatever scheme
is used, there has to be a snoop for every cache line's worth of
memory that is zeroed.

The other point is that having the page hot in the cache may well be a
benefit to the program.  Using any sort of cache-bypassing zeroing
might not actually make things faster, when the user time as well as
the system time is taken into account.

> > I did some measurements once on my G5 powermac (running a ppc64 linux
> > kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> > page.
> 
> 40GB/s.  Is that straight into L1 or does the measurement include writeback?

It is the average elapsed time in clear_page, so it would include the
writeback of any cache lines displaced by the zeroing, but not the
writeback of the newly-zeroed cache lines (which we hope will be
modified by the program before they get written back anyway).

This is using the dcbz (data cache block zero) instruction, which
establishes a cache line in modified state with zero contents without
any memory traffic other than a cache line kill transaction sent to
the other CPUs and possible writeback of a dirty cache line displaced
by the newly-zeroed cache line.  The new cache line is established in
the L2 cache, because the L1 is write-through on the G5, and all
stores and dcbz instructions have to go to the L2 cache.

Thus, on the G5 (and POWER4, which is similar) I don't think there
will be much if any benefit from having pre-zeroed cache-cold pages.
We can establish the zero lines in cache much faster using dcbz than
we can by reading them in from main memory.  If the program uses only
a few cache lines out of each new page, then reading them from memory
might be faster, but that seems unlikely.

Paul.

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:11       ` Paul Mackerras
  2004-12-23 21:37         ` Andrew Morton
@ 2004-12-23 21:48         ` Linus Torvalds
  2004-12-23 22:34           ` Zwane Mwaikambo
                             ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Linus Torvalds @ 2004-12-23 21:48 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Christoph Lameter, Andrew Morton, linux-ia64, torvalds, linux-mm,
	Kernel Mailing List

On Fri, 24 Dec 2004, Paul Mackerras wrote:
> 
> I did some measurements once on my G5 powermac (running a ppc64 linux
> kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> page.  This is real-life elapsed time in the kernel, not just some
> cache-hot benchmark measurement.  Thus I don't think your patch will
> gain us anything on ppc64.

Well, the thing is, if we really _know_ the machine is idle (and not just 
waiting for something like disk IO), it might be a good idea to just 
pre-zero everything we can.

The question to me is whether we can have a good enough heuristic to
notice that it triggers often enough to matter, but seldom enough that it
really won't disturb anybody.

And "disturb" very much includes things like laptop battery life,
scheduling latencies, memory bus traffic _and_ cache contents. 

And I really don't see a very good heuristic. Maybe it might literally be
something like "five-second load average goes down to zero" (we've got
fixed-point arithmetic with eleven fractional bits, so we can tune just
how close to "zero" we want to get). The load average is system-wide and
takes disk load (which tends to imply latency-critical work) into account,
so that might actually work out reasonably well as a "the system really is
quiescent".

So if we make the "what load is considered low" tunable, a system 
administrator can use that to make it more aggressive. And indeed, you 
might have a cron-job that says "be more aggressive at clearing pages 
between 2AM and 4AM in the morning" or something - if you have so much 
memory that it actually matters if you clear the memory just occasionally.

And the tunable load-average check has another advantage: if you want to 
benchmark it, you can first set it to true zero (basically never), and run 
the benchmark, and then you can set it to something very agressive ("clear 
pages every five seconds regardless of load") and re-run.

Does this sound sane? Christoph - can you try making the "scrub deamon" do 
that? Instead of the "scrub-low" and "scrub-high" (or in _addition_ to 
them), do a "scub-load" thing that takes a scaled integer, and compares it 
with "avenrun[0]" in kernel/timer.c: calc_load() when the average is 
updated every five seconds..

Personally, at least for a desktop usage, I think that the load average 
would work wonderfully well. I know my machines are often at basically 
zero load, and then having low-latency zero-pages when I sit down sounds 
like a good idea. Whether there is _enough_ free memory around for a 
5-second thing to work out well, I have no idea..

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:48         ` Linus Torvalds
@ 2004-12-23 22:34           ` Zwane Mwaikambo
  2004-12-24  9:14           ` Arjan van de Ven
  2004-12-24 16:17           ` Christoph Lameter
  2 siblings, 0 replies; 21+ messages in thread
From: Zwane Mwaikambo @ 2004-12-23 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
	linux-mm, Kernel Mailing List

On Thu, 23 Dec 2004, Linus Torvalds wrote:

> Personally, at least for a desktop usage, I think that the load average 
> would work wonderfully well. I know my machines are often at basically 
> zero load, and then having low-latency zero-pages when I sit down sounds 
> like a good idea. Whether there is _enough_ free memory around for a 
> 5-second thing to work out well, I have no idea..

Isn't the basic premise very similar to the following paper;

http://www.usenix.org/publications/library/proceedings/osdi99/full_papers/dougan/dougan_html/dougan.html

In fact i thought ppc32 did something akin to this.

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:48         ` Linus Torvalds
  2004-12-23 22:34           ` Zwane Mwaikambo
@ 2004-12-24  9:14           ` Arjan van de Ven
  2004-12-24 18:21             ` Linus Torvalds
  2004-12-24 16:17           ` Christoph Lameter
  2 siblings, 1 reply; 21+ messages in thread
From: Arjan van de Ven @ 2004-12-24  9:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
	linux-mm, Kernel Mailing List

> Personally, at least for a desktop usage, I think that the load average 
> would work wonderfully well. I know my machines are often at basically 
> zero load, and then having low-latency zero-pages when I sit down sounds 
> like a good idea. Whether there is _enough_ free memory around for a 
> 5-second thing to work out well, I have no idea..

problem is.. will it buy you anything if you use the page again
anyway... since such pages will be cold cached now. So for sure some of
it is only shifting latency from kernel side to userspace side, but
readprofile doesn't measure the later so it *looks* better...

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-24  9:14           ` Arjan van de Ven
@ 2004-12-24 18:21             ` Linus Torvalds
  2004-12-24 18:57               ` Arjan van de Ven
  2004-12-27 22:50               ` David S. Miller
  0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2004-12-24 18:21 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
	linux-mm, Kernel Mailing List

On Fri, 24 Dec 2004, Arjan van de Ven wrote:
> 
> problem is.. will it buy you anything if you use the page again
> anyway... since such pages will be cold cached now. So for sure some of
> it is only shifting latency from kernel side to userspace side, but
> readprofile doesn't measure the later so it *looks* better...

Absolutely. I would want to see some real benchmarks before we do this.  
Not just some microbenchmark of "how many page faults can we take without
_using_ the page at all".

I agree 100% with you that we shouldn't shift the costs around. Having a
hice hot-spot that we know about is a good thing, and it means that
performance profiles show what the time is really spent on. Often getting
rid of the hotspot just smears out the work over a wider area, making
other optimizations (like trying to make the memory footprint _smaller_
and removing the work entirely that way) totally impossible because now
the performance profile just has a constant background noise and you can't 
tell what the real problem is.

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-24 18:21             ` Linus Torvalds
@ 2004-12-24 18:57               ` Arjan van de Ven
  2004-12-27 22:50               ` David S. Miller
  1 sibling, 0 replies; 21+ messages in thread
From: Arjan van de Ven @ 2004-12-24 18:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Christoph Lameter, Andrew Morton, linux-ia64,
	linux-mm, Kernel Mailing List

On Fri, 2004-12-24 at 10:21 -0800, Linus Torvalds wrote:
> 
> On Fri, 24 Dec 2004, Arjan van de Ven wrote:
> > 
> > problem is.. will it buy you anything if you use the page again
> > anyway... since such pages will be cold cached now. So for sure some of
> > it is only shifting latency from kernel side to userspace side, but
> > readprofile doesn't measure the later so it *looks* better...
> 
> Absolutely. I would want to see some real benchmarks before we do this.  
> Not just some microbenchmark of "how many page faults can we take without
> _using_ the page at all".
> 
> I agree 100% with you that we shouldn't shift the costs around. Having a
> hice hot-spot that we know about is a good thing, and it means that
> performance profiles show what the time is really spent on. Often getting
> rid of the hotspot just smears out the work over a wider area, making
> other optimizations (like trying to make the memory footprint _smaller_
> and removing the work entirely that way) totally impossible because now
> the performance profile just has a constant background noise and you can't 
> tell what the real problem is.

I suspect it's even worse.
Think about it; you can spew 4k of zeroes into your L1 cache really fast
(assuming your cpu is smart enough to avoid write-allocate for rep
stosl; not sure which cpus are). I suspect you can do that faster than a
cachemiss or two. And at that point the page is cache hot... so reads
don't miss either.

all this makes me wonder if there is any scenario where this thing will
be a gain, other than cpus that aren't smart enough to avoid the write-
allocate.

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-24 18:21             ` Linus Torvalds
  2004-12-24 18:57               ` Arjan van de Ven
@ 2004-12-27 22:50               ` David S. Miller
  2004-12-28 11:53                 ` Marcelo Tosatti
  1 sibling, 1 reply; 21+ messages in thread
From: David S. Miller @ 2004-12-27 22:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: arjan, paulus, clameter, akpm, linux-ia64, linux-mm, linux-kernel

On Fri, 24 Dec 2004 10:21:24 -0800 (PST)
Linus Torvalds <torvalds@osdl.org> wrote:

> Absolutely. I would want to see some real benchmarks before we do this.  
> Not just some microbenchmark of "how many page faults can we take without
> _using_ the page at all".

Here's my small contribution.  I did three "make -j3 vmlinux" timed
runs, one running a kernel without the pre-zeroing stuff applied,
one with it applied.  It did shave a few seconds off the build
consistently.  Here is the before:

real	8m35.248s
user	15m54.132s
sys	1m1.098s

real	8m32.202s
user	15m54.329s
sys	1m0.229s

real	8m31.932s
user	15m54.160s
sys	1m0.245s

and here is the after:

real	8m29.375s
user	15m43.296s
sys	0m59.549s

real	8m28.213s
user	15m39.819s
sys	0m58.790s

real	8m26.140s
user	15m44.145s
sys	0m58.872s

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-27 22:50               ` David S. Miller
@ 2004-12-28 11:53                 ` Marcelo Tosatti
  0 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2004-12-28 11:53 UTC (permalink / raw)
  To: David S. Miller
  Cc: Linus Torvalds, arjan, paulus, clameter, akpm, linux-ia64,
	linux-mm, linux-kernel

On Mon, Dec 27, 2004 at 02:50:57PM -0800, David S. Miller wrote:
> On Fri, 24 Dec 2004 10:21:24 -0800 (PST)
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
> > Absolutely. I would want to see some real benchmarks before we do this.  
> > Not just some microbenchmark of "how many page faults can we take without
> > _using_ the page at all".
> 
> Here's my small contribution.  I did three "make -j3 vmlinux" timed
> runs, one running a kernel without the pre-zeroing stuff applied,
> one with it applied.  It did shave a few seconds off the build
> consistently.  Here is the before:
> 
> real	8m35.248s
> user	15m54.132s
> sys	1m1.098s
> 
> real	8m32.202s
> user	15m54.329s
> sys	1m0.229s
> 
> real	8m31.932s
> user	15m54.160s
> sys	1m0.245s
> 
> and here is the after:
> 
> real	8m29.375s
> user	15m43.296s
> sys	0m59.549s
> 
> real	8m28.213s
> user	15m39.819s
> sys	0m58.790s
> 
> real	8m26.140s
> user	15m44.145s
> sys	0m58.872s

Christopher and other SGI fellows,

Get your patch into STP, once its there we can do some wider x86 benchmarking 
easily.

^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: Prezeroing V2 [0/3]: Why and When it works
  2004-12-23 21:48         ` Linus Torvalds
  2004-12-23 22:34           ` Zwane Mwaikambo
  2004-12-24  9:14           ` Arjan van de Ven
@ 2004-12-24 16:17           ` Christoph Lameter
  2 siblings, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2004-12-24 16:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Andrew Morton, linux-ia64, linux-mm, Kernel Mailing List

On Thu, 23 Dec 2004, Linus Torvalds wrote:

> So if we make the "what load is considered low" tunable, a system
> administrator can use that to make it more aggressive. And indeed, you
> might have a cron-job that says "be more aggressive at clearing pages
> between 2AM and 4AM in the morning" or something - if you have so much
> memory that it actually matters if you clear the memory just occasionally.
>
> And the tunable load-average check has another advantage: if you want to
> benchmark it, you can first set it to true zero (basically never), and run
> the benchmark, and then you can set it to something very agressive ("clear
> pages every five seconds regardless of load") and re-run.
>
> Does this sound sane? Christoph - can you try making the "scrub deamon" do
> that? Instead of the "scrub-low" and "scrub-high" (or in _addition_ to
> them), do a "scub-load" thing that takes a scaled integer, and compares it
> with "avenrun[0]" in kernel/timer.c: calc_load() when the average is
> updated every five seconds..

Sure V3 will have that. So far the impact of zeroing is quite minimal
on IA64 (even without using hardware), the big zeroing happens immediately
after activating it anyways. I have not seen any measurable effect on
benchmarks even with 4G allocations on a 6G machine.

> Personally, at least for a desktop usage, I think that the load average
> would work wonderfully well. I know my machines are often at basically
> zero load, and then having low-latency zero-pages when I sit down sounds
> like a good idea. Whether there is _enough_ free memory around for a
> 5-second thing to work out well, I have no idea..

The CPU can do a couple of Gigs of zeroing per second per CPU and the
zeroing zeros local RAM. On my 6G machine with 8 Cpus it can only
take a fraction of a second to zero all RAM.

Merry Christmas, I am off till now next year. SGI mandatory holiday
shutdown so all addicts have to go cold turkey ;-)

^ permalink raw reply	[flat|nested] 21+ messages in thread
end of thread, other threads:[~2004-12-28 14:30 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <B8E391BBE9FE384DAA4C5C003888BE6F02900FBD@scsmsx401.amr.corp.intel.com.suse.lists.linux.kernel>
     [not found] ` <41C20E3E.3070209@yahoo.com.au.suse.lists.linux.kernel>
     [not found]   ` <Pine.LNX.4.58.0412211154100.1313@schroedinger.engr.sgi.com.suse.lists.linux.kernel>
     [not found]     ` <Pine.LNX.4.58.0412211155340.1313@schroedinger.engr.sgi.com.suse.lists.linux.kernel>
2004-12-21 22:40       ` Increase page fault rate by prezeroing V1 [1/3]: Introduce __GFP_ZERO Andi Kleen
2004-12-21 22:54         ` Christoph Lameter
2004-12-22 10:53           ` Andi Kleen
2004-12-22 19:54             ` Christoph Lameter
     [not found]     ` <Pine.LNX.4.58.0412231119540.31791@schroedinger.engr.sgi.com.suse.lists.linux.kernel>
2004-12-23 20:27       ` Prezeroing V2 [0/3]: Why and When it works Andi Kleen
2004-12-23 21:02         ` Christoph Lameter
     [not found] <B8E391BBE9FE384DAA4C5C003888BE6F02900FBD@scsmsx401.amr.corp.intel.com>
     [not found] ` <41C20E3E.3070209@yahoo.com.au>
2004-12-21 19:55   ` Increase page fault rate by prezeroing V1 [0/3]: Overview Christoph Lameter
2004-12-23 19:29     ` Prezeroing V2 [0/3]: Why and When it works Christoph Lameter
2004-12-23 19:49       ` Arjan van de Ven
2004-12-23 20:57       ` Matt Mackall
2004-12-23 21:01       ` Paul Mackerras
2004-12-23 21:11       ` Paul Mackerras
2004-12-23 21:37         ` Andrew Morton
2004-12-23 23:00           ` Paul Mackerras
2004-12-23 21:48         ` Linus Torvalds
2004-12-23 22:34           ` Zwane Mwaikambo
2004-12-24  9:14           ` Arjan van de Ven
2004-12-24 18:21             ` Linus Torvalds
2004-12-24 18:57               ` Arjan van de Ven
2004-12-27 22:50               ` David S. Miller
2004-12-28 11:53                 ` Marcelo Tosatti
2004-12-24 16:17           ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).