linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Performance regression in write() syscall
@ 2009-02-24  2:03 Salman Qazi
  2009-02-24  4:10 ` Nick Piggin
  2009-02-24 10:09 ` Andi Kleen
  0 siblings, 2 replies; 50+ messages in thread
From: Salman Qazi @ 2009-02-24  2:03 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Andi Kleen

While the introduction of __copy_from_user_nocache (see commit:
0812a579c92fefa57506821fa08e90f47cb6dbdd) may have been an improvement
for sufficiently large writes, there is evidence to show that it is
deterimental for small writes.  Unixbench's fstime test gives the
following results for 256 byte writes with MAX_BLOCK of 2000:
    
    2.6.29-rc6 ( 5 samples, each in KB/sec ):
    283750, 295200, 294500, 293000, 293300
    
    2.6.29-rc6 + this patch (5 samples, each in KB/sec):
    313050, 3106750, 293350, 306300, 307900

    2.6.18
    395700, 342000, 399100, 366050, 359850

    See w_test() in src/fstime.c in unixbench version 4.1.0.  Basically, the above test
    consists of counting how much we can write in this manner:

    alarm(10);
    while (!sigalarm) {
            for (f_blocks = 0; f_blocks < 2000; ++f_blocks) {
                   write(f, buf, 256);
            }
            lseek(f, 0L, 0);
    }

I realised that there are other components to the write syscall regression
that are not addressed here.  I will send another email shortly stating the
source of another one.

Signed-off-by: Salman Qazi <sqazi@google.com>
---
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 84210c4..efe7315 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -192,14 +192,20 @@ static inline int __copy_from_user_nocache(void *dst, const void __user *src,
 					   unsigned size)
 {
 	might_sleep();
-	return __copy_user_nocache(dst, src, size, 1);
+	if (likely(size >= PAGE_SIZE))
+		return __copy_user_nocache(dst, src, size, 1);
+	else
+		return __copy_from_user(dst, src, size);
 }
 
 static inline int __copy_from_user_inatomic_nocache(void *dst,
 						    const void __user *src,
 						    unsigned size)
 {
-	return __copy_user_nocache(dst, src, size, 0);
+	if (likely(size >= PAGE_SIZE))
+		return __copy_user_nocache(dst, src, size, 0);
+	else
+		return __copy_from_user_inatomic(dst, src, size);
 }
 
 unsigned long

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: Performance regression in write() syscall
  2009-02-24  2:03 Performance regression in write() syscall Salman Qazi
@ 2009-02-24  4:10 ` Nick Piggin
  2009-02-24  4:28   ` Linus Torvalds
  2009-02-24  5:43   ` Performance regression in write() syscall Salman Qazi
  2009-02-24 10:09 ` Andi Kleen
  1 sibling, 2 replies; 50+ messages in thread
From: Nick Piggin @ 2009-02-24  4:10 UTC (permalink / raw)
  To: Salman Qazi, Linus Torvalds, davem
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Andi Kleen

added some ccs

On Tuesday 24 February 2009 13:03:04 Salman Qazi wrote:
> While the introduction of __copy_from_user_nocache (see commit:
> 0812a579c92fefa57506821fa08e90f47cb6dbdd) may have been an improvement
> for sufficiently large writes, there is evidence to show that it is
> deterimental for small writes.  Unixbench's fstime test gives the
> following results for 256 byte writes with MAX_BLOCK of 2000:
>
>     2.6.29-rc6 ( 5 samples, each in KB/sec ):
>     283750, 295200, 294500, 293000, 293300
>
>     2.6.29-rc6 + this patch (5 samples, each in KB/sec):
>     313050, 3106750, 293350, 306300, 307900
>
>     2.6.18
>     395700, 342000, 399100, 366050, 359850

What does unixbench's fstime test do? If it is just writing to the
pagecache, then this would be unexpected. If it is reading and writing,
then perhaps this could be a problem, but how realistic is it for a
performance critical application to read data out of the pagecache that
it has recently written? Do you have something at google actually doing
real work that speeds up with this patch?


>     See w_test() in src/fstime.c in unixbench version 4.1.0.  Basically,
> the above test consists of counting how much we can write in this manner:
>
>     alarm(10);
>     while (!sigalarm) {
>             for (f_blocks = 0; f_blocks < 2000; ++f_blocks) {
>                    write(f, buf, 256);
>             }
>             lseek(f, 0L, 0);
>     }
>
> I realised that there are other components to the write syscall regression
> that are not addressed here.  I will send another email shortly stating the
> source of another one.
>
> Signed-off-by: Salman Qazi <sqazi@google.com>
> ---
> diff --git a/arch/x86/include/asm/uaccess_64.h
> b/arch/x86/include/asm/uaccess_64.h index 84210c4..efe7315 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -192,14 +192,20 @@ static inline int __copy_from_user_nocache(void *dst,
> const void __user *src, unsigned size)
>  {
>  	might_sleep();
> -	return __copy_user_nocache(dst, src, size, 1);
> +	if (likely(size >= PAGE_SIZE))
> +		return __copy_user_nocache(dst, src, size, 1);
> +	else
> +		return __copy_from_user(dst, src, size);
>  }
>
>  static inline int __copy_from_user_inatomic_nocache(void *dst,
>  						    const void __user *src,
>  						    unsigned size)
>  {
> -	return __copy_user_nocache(dst, src, size, 0);
> +	if (likely(size >= PAGE_SIZE))
> +		return __copy_user_nocache(dst, src, size, 0);
> +	else
> +		return __copy_from_user_inatomic(dst, src, size);
>  }
>
>  unsigned long



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Performance regression in write() syscall
  2009-02-24  4:10 ` Nick Piggin
@ 2009-02-24  4:28   ` Linus Torvalds
  2009-02-24  9:02     ` Nick Piggin
  2009-02-24  5:43   ` Performance regression in write() syscall Salman Qazi
  1 sibling, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2009-02-24  4:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Salman Qazi, davem, linux-kernel, Ingo Molnar, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen



On Tue, 24 Feb 2009, Nick Piggin wrote:
> 
> What does unixbench's fstime test do? If it is just writing to the
> pagecache, then this would be unexpected.

Hmm. Not necessarily. Even just plain writes may be slowed down, since the 
nontemporal loads and stores are generally slower than the normal case. So 
it does make some kind of sense to try to avoid the noncached versions for 
small writes - because small writes tend to be for temp-files.

I don't know if PAGE_SIZE is the right thing to test, and I also don't 
know if this is necessarily the best place to test it in, but I don't 
think it's necessarily wrong to do something like this.

In fact, I think we might also just check alignment. Doing the nontemporal 
crud makes little sense for a non-8-byte-aligned destination, since it 
will have to do the alignment stores cached anyway - and mixing them 
across a cacheline is just crazy.

(The test program doesn't seem to be testing that particular case)

			Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Performance regression in write() syscall
  2009-02-24  4:10 ` Nick Piggin
  2009-02-24  4:28   ` Linus Torvalds
@ 2009-02-24  5:43   ` Salman Qazi
  1 sibling, 0 replies; 50+ messages in thread
From: Salman Qazi @ 2009-02-24  5:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, davem, linux-kernel, Ingo Molnar,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen

On Mon, Feb 23, 2009 at 8:10 PM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> added some ccs
>
> On Tuesday 24 February 2009 13:03:04 Salman Qazi wrote:
> > While the introduction of __copy_from_user_nocache (see commit:
> > 0812a579c92fefa57506821fa08e90f47cb6dbdd) may have been an improvement
> > for sufficiently large writes, there is evidence to show that it is
> > deterimental for small writes.  Unixbench's fstime test gives the
> > following results for 256 byte writes with MAX_BLOCK of 2000:
> >
> >     2.6.29-rc6 ( 5 samples, each in KB/sec ):
> >     283750, 295200, 294500, 293000, 293300
> >
> >     2.6.29-rc6 + this patch (5 samples, each in KB/sec):
> >     313050, 3106750, 293350, 306300, 307900
> >
> >     2.6.18
> >     395700, 342000, 399100, 366050, 359850
>
> What does unixbench's fstime test do? If it is just writing to the
> pagecache, then this would be unexpected. If it is reading and writing,
> then perhaps this could be a problem, but how realistic is it for a
> performance critical application to read data out of the pagecache that
> it has recently written? Do you have something at google actually doing
> real work that speeds up with this patch?

It has 3 parts, each producing a number corresponding to write, read
and copy.  The first one only does writes and lseeks.  This produces
the numbers that I have provided.  We are actually not sure at this
point if this slows down one of our real application.  We noticed that
Unixbench fstime, which is part of our automated testing was slower,
and upon investigation this was one of the causes.  We will be
forthcoming with at least one other regressions in this exact usage of
write() system call shortly (I am gathering relevant numbers  for
upstream kernels now).  In both cases, the regression was noticed for
sub page writes.

>
>
> >     See w_test() in src/fstime.c in unixbench version 4.1.0.  Basically,
> > the above test consists of counting how much we can write in this manner:
> >
> >     alarm(10);
> >     while (!sigalarm) {
> >             for (f_blocks = 0; f_blocks < 2000; ++f_blocks) {
> >                    write(f, buf, 256);
> >             }
> >             lseek(f, 0L, 0);
> >     }
> >
> > I realised that there are other components to the write syscall regression
> > that are not addressed here.  I will send another email shortly stating the
> > source of another one.
> >
> > Signed-off-by: Salman Qazi <sqazi@google.com>
> > ---
> > diff --git a/arch/x86/include/asm/uaccess_64.h
> > b/arch/x86/include/asm/uaccess_64.h index 84210c4..efe7315 100644
> > --- a/arch/x86/include/asm/uaccess_64.h
> > +++ b/arch/x86/include/asm/uaccess_64.h
> > @@ -192,14 +192,20 @@ static inline int __copy_from_user_nocache(void *dst,
> > const void __user *src, unsigned size)
> >  {
> >       might_sleep();
> > -     return __copy_user_nocache(dst, src, size, 1);
> > +     if (likely(size >= PAGE_SIZE))
> > +             return __copy_user_nocache(dst, src, size, 1);
> > +     else
> > +             return __copy_from_user(dst, src, size);
> >  }
> >
> >  static inline int __copy_from_user_inatomic_nocache(void *dst,
> >                                                   const void __user *src,
> >                                                   unsigned size)
> >  {
> > -     return __copy_user_nocache(dst, src, size, 0);
> > +     if (likely(size >= PAGE_SIZE))
> > +             return __copy_user_nocache(dst, src, size, 0);
> > +     else
> > +             return __copy_from_user_inatomic(dst, src, size);
> >  }
> >
> >  unsigned long
>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Performance regression in write() syscall
  2009-02-24  4:28   ` Linus Torvalds
@ 2009-02-24  9:02     ` Nick Piggin
  2009-02-24 15:52       ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Nick Piggin @ 2009-02-24  9:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Salman Qazi, davem, linux-kernel, Ingo Molnar, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen

On Tuesday 24 February 2009 15:28:54 Linus Torvalds wrote:
> On Tue, 24 Feb 2009, Nick Piggin wrote:
> > What does unixbench's fstime test do? If it is just writing to the
> > pagecache, then this would be unexpected.
>
> Hmm. Not necessarily. Even just plain writes may be slowed down, since the
> nontemporal loads and stores are generally slower than the normal case. So

Well it does nontemporal stores only, ie. store into the pagecache without
doubling the cache footprint (source is possibly already in cache if we're
write(2)ing it anyway).

But yes I could see if it it fills up store queue and starts going
synchronous, then for sizes < CPU cache size, it will probably go slower.
On the other hand, for cached writes they still have to be sent back
to RAM at some point, but maybe that can be pipelined with other things.


> it does make some kind of sense to try to avoid the noncached versions for
> small writes - because small writes tend to be for temp-files.

I don't see the significance of a temp file. If the pagecache is truncated,
then the cachelines remain dirty and so you can't avoid an eventual store
back to RAM? 


> I don't know if PAGE_SIZE is the right thing to test, and I also don't
> know if this is necessarily the best place to test it in, but I don't
> think it's necessarily wrong to do something like this.

No, but I think it should be in arch code, and the "_nocache" suffix
should just be a hint to the architecture that the destination is not
so likely to be used.


> In fact, I think we might also just check alignment. Doing the nontemporal
> crud makes little sense for a non-8-byte-aligned destination, since it
> will have to do the alignment stores cached anyway - and mixing them
> across a cacheline is just crazy.

That is exactly the kind of thing that could probably be better optimised
in arch code IMO.

It would have been nice to have had some numbers to justify
0812a579c92fefa57506821fa08e90f47cb6dbdd in the first place, so you have
a point of reference to see what happens to your speed-up-case when you
change things like this. Sigh.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Performance regression in write() syscall
  2009-02-24  2:03 Performance regression in write() syscall Salman Qazi
  2009-02-24  4:10 ` Nick Piggin
@ 2009-02-24 10:09 ` Andi Kleen
  2009-02-24 16:13   ` Ingo Molnar
  1 sibling, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2009-02-24 10:09 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Andi Kleen

On Mon, Feb 23, 2009 at 06:03:04PM -0800, Salman Qazi wrote:
> While the introduction of __copy_from_user_nocache (see commit:
> 0812a579c92fefa57506821fa08e90f47cb6dbdd) may have been an improvement
> for sufficiently large writes, there is evidence to show that it is
> deterimental for small writes.  Unixbench's fstime test gives the
> following results for 256 byte writes with MAX_BLOCK of 2000:

Do you have some data on where the cycles are spent? 

In theory it should be neutral on small writes.

> @@ -192,14 +192,20 @@ static inline int __copy_from_user_nocache(void *dst, const void __user *src,
>  					   unsigned size)
>  {
>  	might_sleep();
> -	return __copy_user_nocache(dst, src, size, 1);
> +	if (likely(size >= PAGE_SIZE))
> +		return __copy_user_nocache(dst, src, size, 1);
> +	else
> +		return __copy_from_user(dst, src, size);

I think you disabled it completely, the kernel never really does
any copies larger than page size because all its internal objects
are page sized only.

That check would need to be higher up the VFS stack (above filemap.c code) 
before the copies are split up.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Performance regression in write() syscall
  2009-02-24  9:02     ` Nick Piggin
@ 2009-02-24 15:52       ` Linus Torvalds
  2009-02-24 16:24         ` Andi Kleen
                           ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Linus Torvalds @ 2009-02-24 15:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Salman Qazi, davem, linux-kernel, Ingo Molnar, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen



On Tue, 24 Feb 2009, Nick Piggin wrote:
>
> > it does make some kind of sense to try to avoid the noncached versions for
> > small writes - because small writes tend to be for temp-files.
> 
> I don't see the significance of a temp file. If the pagecache is truncated,
> then the cachelines remain dirty and so you can't avoid an eventual store
> back to RAM? 

No, because many small files end up being used as scratch-pads (think 
shell script sequences etc), and get read back immediately again. Doing 
non-temporal stores might just be bad simply because trying to play games 
with caching may simply do the wrong thing.


> > I don't know if PAGE_SIZE is the right thing to test, and I also don't
> > know if this is necessarily the best place to test it in, but I don't
> > think it's necessarily wrong to do something like this.
> 
> No, but I think it should be in arch code, and the "_nocache" suffix
> should just be a hint to the architecture that the destination is not
> so likely to be used.

Yes. Especially since arch code is likely to need various arch-specific 
checks anyway (like the x86 code does about aligning the destination).

> It would have been nice to have had some numbers to justify
> 0812a579c92fefa57506821fa08e90f47cb6dbdd in the first place, so you have
> a point of reference to see what happens to your speed-up-case when you
> change things like this. Sigh.

Well, there were no performance numbers for that commit, since it didn't 
actually tie it into anything, but I'm pretty sure we saw several 
performance numbers for the change.

Yes, and they are in the commit logs. See "x86: cache pollution aware 
__copy_from_user_ll()", commit c22ce143d15eb288543fe9873e1c5ac1c01b69a1.

But notice how that is iozone numbers. Very much about _big_ writes.

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Performance regression in write() syscall
  2009-02-24 10:09 ` Andi Kleen
@ 2009-02-24 16:13   ` Ingo Molnar
  2009-02-24 16:51     ` Andi Kleen
  0 siblings, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2009-02-24 16:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Salman Qazi, linux-kernel, Thomas Gleixner, H. Peter Anvin


* Andi Kleen <andi@firstfloor.org> wrote:

> On Mon, Feb 23, 2009 at 06:03:04PM -0800, Salman Qazi wrote:

> > -	return __copy_user_nocache(dst, src, size, 1);
> > +	if (likely(size >= PAGE_SIZE))
> > +		return __copy_user_nocache(dst, src, size, 1);
> > +	else
> > +		return __copy_from_user(dst, src, size);
> 
> I think you disabled it completely, the kernel never really 
> does any copies larger than page size because all its internal 
> objects are page sized only.

No, look again, it's not disabled completely - the check now 
basically special-cases 4K writes _only_, and makes them 
non-temporal. That still covers the big/midsize file case.

And that kind of 4K limit makes a lot of sense. A small file 
write will unlikely to have a perfect 4K sized copy. Big file 
writes (and raw/direct IO related copies, etc.) will be chunked 
down to 4K sized units.

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Performance regression in write() syscall
  2009-02-24 15:52       ` Linus Torvalds
@ 2009-02-24 16:24         ` Andi Kleen
  2009-02-24 16:51         ` Ingo Molnar
  2009-02-25  3:23         ` Nick Piggin
  2 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2009-02-24 16:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Salman Qazi, davem, linux-kernel, Ingo Molnar,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen

On Tue, Feb 24, 2009 at 07:52:34AM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 24 Feb 2009, Nick Piggin wrote:
> >
> > > it does make some kind of sense to try to avoid the noncached versions for
> > > small writes - because small writes tend to be for temp-files.
> > 
> > I don't see the significance of a temp file. If the pagecache is truncated,
> > then the cachelines remain dirty and so you can't avoid an eventual store
> > back to RAM? 
> 
> No, because many small files end up being used as scratch-pads (think 
> shell script sequences etc), and get read back immediately again. Doing 
> non-temporal stores might just be bad simply because trying to play games 
> with caching may simply do the wrong thing.

I usually recommend to use shmfs for /tmp. Perhaps that should be done
more widely.

BTW I got some upcomming patches to improve c_*_u() and memcpy
on modern Intel x86 CPUs.

-Andi


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Performance regression in write() syscall
  2009-02-24 15:52       ` Linus Torvalds
  2009-02-24 16:24         ` Andi Kleen
@ 2009-02-24 16:51         ` Ingo Molnar
  2009-02-25  3:23         ` Nick Piggin
  2 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2009-02-24 16:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Salman Qazi, davem, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > No, but I think it should be in arch code, and the 
> > "_nocache" suffix should just be a hint to the architecture 
> > that the destination is not so likely to be used.
> 
> Yes. Especially since arch code is likely to need various 
> arch-specific checks anyway (like the x86 code does about 
> aligning the destination).

I'm inclined to do this in two or three phases: first apply the 
fix from Salman in the form below.

In practice if we get a 4K copy request the likelyhood is large 
that this is for a larger write and for a full pagecache page. 
The target is very unlikely to be misaligned, the source might 
be as it comes from user-space.

This portion of __copy_user_nocache() becomes largely 
unnecessary:

ENTRY(__copy_user_nocache)
        CFI_STARTPROC
        cmpl $8,%edx
        jb 20f          /* less then 8 bytes, go to byte copy loop */
        ALIGN_DESTINATION
        movl %edx,%ecx
        andl $63,%edx
        shrl $6,%ecx
        jz 17f

And the tail portion becomes unnecessary too. Those are over a 
dozen instructions so probably worth optimizing out.

But i'd rather express this in terms of a separate 
__copy_user_page_nocache function and keep the generic 
implementation too.

I.e. like the second patch below. (not tested)

With this __copy_user_nocache() becomes unused - and once we are 
happy with the performance characteristics of 4K non-temporal 
copies, we can remove this more generic implementation.

Does this sound reasonable, or do you think we can be smarter 
than this?

	Ingo

-------------------->
>From 30d697fa3a25fed809a873b17531a00282dc1234 Mon Sep 17 00:00:00 2001
From: Salman Qazi <sqazi@google.com>
Date: Mon, 23 Feb 2009 18:03:04 -0800
Subject: [PATCH] x86: fix performance regression in write() syscall

While the introduction of __copy_from_user_nocache (see commit:
0812a579c92fefa57506821fa08e90f47cb6dbdd) may have been an improvement
for sufficiently large writes, there is evidence to show that it is
deterimental for small writes.  Unixbench's fstime test gives the
following results for 256 byte writes with MAX_BLOCK of 2000:

    2.6.29-rc6 ( 5 samples, each in KB/sec ):
    283750, 295200, 294500, 293000, 293300

    2.6.29-rc6 + this patch (5 samples, each in KB/sec):
    313050, 3106750, 293350, 306300, 307900

    2.6.18
    395700, 342000, 399100, 366050, 359850

    See w_test() in src/fstime.c in unixbench version 4.1.0.  Basically, the above test
    consists of counting how much we can write in this manner:

    alarm(10);
    while (!sigalarm) {
            for (f_blocks = 0; f_blocks < 2000; ++f_blocks) {
                   write(f, buf, 256);
            }
            lseek(f, 0L, 0);
    }

Note, there are other components to the write syscall regression
that are not addressed here.

Signed-off-by: Salman Qazi <sqazi@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/include/asm/uaccess_64.h |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 84210c4..987a2c1 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -192,14 +192,26 @@ static inline int __copy_from_user_nocache(void *dst, const void __user *src,
 					   unsigned size)
 {
 	might_sleep();
-	return __copy_user_nocache(dst, src, size, 1);
+	/*
+	 * In practice this limit means that large file write()s
+	 * which get chunked to 4K copies get handled via
+	 * non-temporal stores here. Smaller writes get handled
+	 * via regular __copy_from_user():
+	 */
+	if (likely(size >= PAGE_SIZE))
+		return __copy_user_nocache(dst, src, size, 1);
+	else
+		return __copy_from_user(dst, src, size);
 }
 
 static inline int __copy_from_user_inatomic_nocache(void *dst,
 						    const void __user *src,
 						    unsigned size)
 {
-	return __copy_user_nocache(dst, src, size, 0);
+	if (likely(size >= PAGE_SIZE))
+		return __copy_user_nocache(dst, src, size, 0);
+	else
+		return __copy_from_user_inatomic(dst, src, size);
 }
 
 unsigned long

>From 577e10ab54dacde4c9fc2d09adfb7e817cadac83 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Tue, 24 Feb 2009 17:38:12 +0100
Subject: [PATCH] x86: add __copy_user_page_nocache() optimized memcpy

Add a hardcoded 4K copy __copy_user_page_nocache() implementation,
used for pagecache copies.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/include/asm/uaccess_64.h   |   10 +++--
 arch/x86/lib/copy_user_nocache_64.S |   84 +++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 987a2c1..71cbcda 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -188,6 +188,8 @@ __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size)
 extern long __copy_user_nocache(void *dst, const void __user *src,
 				unsigned size, int zerorest);
 
+extern long __copy_user_page_nocache(void *dst, const void __user *src);
+
 static inline int __copy_from_user_nocache(void *dst, const void __user *src,
 					   unsigned size)
 {
@@ -198,8 +200,8 @@ static inline int __copy_from_user_nocache(void *dst, const void __user *src,
 	 * non-temporal stores here. Smaller writes get handled
 	 * via regular __copy_from_user():
 	 */
-	if (likely(size >= PAGE_SIZE))
-		return __copy_user_nocache(dst, src, size, 1);
+	if (likely(size == PAGE_SIZE))
+		return __copy_user_page_nocache(dst, src);
 	else
 		return __copy_from_user(dst, src, size);
 }
@@ -208,8 +210,8 @@ static inline int __copy_from_user_inatomic_nocache(void *dst,
 						    const void __user *src,
 						    unsigned size)
 {
-	if (likely(size >= PAGE_SIZE))
-		return __copy_user_nocache(dst, src, size, 0);
+	if (likely(size == PAGE_SIZE))
+		return __copy_user_page_nocache(dst, src);
 	else
 		return __copy_from_user_inatomic(dst, src, size);
 }
diff --git a/arch/x86/lib/copy_user_nocache_64.S b/arch/x86/lib/copy_user_nocache_64.S
index cb0c112..387f08e 100644
--- a/arch/x86/lib/copy_user_nocache_64.S
+++ b/arch/x86/lib/copy_user_nocache_64.S
@@ -135,3 +135,87 @@ ENTRY(__copy_user_nocache)
 	.previous
 	CFI_ENDPROC
 ENDPROC(__copy_user_nocache)
+
+/*
+ * copy_user_page_nocache - Uncached memory copy of a single page using
+ * non-temporal stores.
+ *
+ * Used for big 4K sized writes - where the chance of repeat access to
+ * the same data is low.
+ */
+ENTRY(__copy_user_page_nocache)
+	CFI_STARTPROC
+
+	/*
+	 * We'll do 64 iterations of 64 bytes each == 4096 bytes,
+	 * hardcoded:
+	 */
+	movl $64, %ecx
+
+1:	movq	0*8(%rsi),	     %r8
+2:	movq	1*8(%rsi),	     %r9
+3:	movq	2*8(%rsi),	    %r10
+4:	movq	3*8(%rsi),	    %r11
+
+5:	movnti	      %r8	0*8(%rdi)
+6:	movnti	      %r9,	1*8(%rdi)
+7:	movnti	     %r10,	2*8(%rdi)
+8:	movnti	     %r11,	3*8(%rdi)
+
+9:	movq	4*8(%rsi),	     %r8
+10:	movq	5*8(%rsi),	     %r9
+11:	movq	6*8(%rsi),	    %r10
+12:	movq	7*8(%rsi),	    %r11
+
+13:	movnti	      %r8,	4*8(%rdi)
+14:	movnti	      %r9,	5*8(%rdi)
+15:	movnti	     %r10,	6*8(%rdi)
+16:	movnti	     %r11,	7*8(%rdi)
+
+	leaq 64(%rsi),%rsi
+	leaq 64(%rdi),%rdi
+
+	decl %ecx
+	jnz 1b
+
+	sfence
+
+	ret
+
+	.section .fixup,"ax"
+30:	shll $6,%ecx
+	movl %ecx,%edx
+	jmp 60f
+40:	xorl %edx, %edx
+	lea (%rdx,%rcx,8),%rdx
+	jmp 60f
+50:	movl %ecx,%edx
+60:	sfence
+	jmp copy_user_handle_tail
+	.previous
+
+	.section __ex_table,"a"
+	.quad 1b,30b
+	.quad 2b,30b
+	.quad 3b,30b
+	.quad 4b,30b
+	.quad 5b,30b
+	.quad 6b,30b
+	.quad 7b,30b
+	.quad 8b,30b
+	.quad 9b,30b
+	.quad 10b,30b
+	.quad 11b,30b
+	.quad 12b,30b
+	.quad 13b,30b
+	.quad 14b,30b
+	.quad 15b,30b
+	.quad 16b,30b
+	.quad 18b,40b
+	.quad 19b,40b
+	.quad 21b,50b
+	.quad 22b,50b
+	.previous
+
+	CFI_ENDPROC
+ENDPROC(__copy_user_page_nocache)

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: Performance regression in write() syscall
  2009-02-24 16:13   ` Ingo Molnar
@ 2009-02-24 16:51     ` Andi Kleen
  0 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2009-02-24 16:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Salman Qazi, linux-kernel, Thomas Gleixner, H. Peter Anvin


Thanks everyone for the correction. I indeed misread the patch.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Performance regression in write() syscall
  2009-02-24 15:52       ` Linus Torvalds
  2009-02-24 16:24         ` Andi Kleen
  2009-02-24 16:51         ` Ingo Molnar
@ 2009-02-25  3:23         ` Nick Piggin
  2009-02-25  7:25           ` [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() Ingo Molnar
  2 siblings, 1 reply; 50+ messages in thread
From: Nick Piggin @ 2009-02-25  3:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Salman Qazi, davem, linux-kernel, Ingo Molnar, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen

On Wednesday 25 February 2009 02:52:34 Linus Torvalds wrote:
> On Tue, 24 Feb 2009, Nick Piggin wrote:
> > > it does make some kind of sense to try to avoid the noncached versions
> > > for small writes - because small writes tend to be for temp-files.
> >
> > I don't see the significance of a temp file. If the pagecache is
> > truncated, then the cachelines remain dirty and so you can't avoid an
> > eventual store back to RAM?
>
> No, because many small files end up being used as scratch-pads (think
> shell script sequences etc), and get read back immediately again. Doing
> non-temporal stores might just be bad simply because trying to play games
> with caching may simply do the wrong thing.

OK, for that angle it could make sense. Although as has been noted earlier,
at this point of the copy, we don't have much idea about the length of the
write passed into the vfs (and obviously will never know the higher level
intention of userspace).

I don't know if we can say a 1 page write is nontemporal, but anything
smaller is temporal. And having these kinds of behavioural cutoffs I
would worry will create strange performance boundary conditions in code.


> > > I don't know if PAGE_SIZE is the right thing to test, and I also don't
> > > know if this is necessarily the best place to test it in, but I don't
> > > think it's necessarily wrong to do something like this.
> >
> > No, but I think it should be in arch code, and the "_nocache" suffix
> > should just be a hint to the architecture that the destination is not
> > so likely to be used.
>
> Yes. Especially since arch code is likely to need various arch-specific
> checks anyway (like the x86 code does about aligning the destination).
>
> > It would have been nice to have had some numbers to justify
> > 0812a579c92fefa57506821fa08e90f47cb6dbdd in the first place, so you have
> > a point of reference to see what happens to your speed-up-case when you
> > change things like this. Sigh.
>
> Well, there were no performance numbers for that commit, since it didn't
> actually tie it into anything, but I'm pretty sure we saw several
> performance numbers for the change.
>
> Yes, and they are in the commit logs. See "x86: cache pollution aware
> __copy_from_user_ll()", commit c22ce143d15eb288543fe9873e1c5ac1c01b69a1.
>
> But notice how that is iozone numbers. Very much about _big_ writes.

Yeah I see, thanks.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-25  3:23         ` Nick Piggin
@ 2009-02-25  7:25           ` Ingo Molnar
  2009-02-25  8:09             ` Nick Piggin
  2009-02-25 16:04             ` Linus Torvalds
  0 siblings, 2 replies; 50+ messages in thread
From: Ingo Molnar @ 2009-02-25  7:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Wednesday 25 February 2009 02:52:34 Linus Torvalds wrote:
> > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > > > it does make some kind of sense to try to avoid the noncached versions
> > > > for small writes - because small writes tend to be for temp-files.
> > >
> > > I don't see the significance of a temp file. If the pagecache is
> > > truncated, then the cachelines remain dirty and so you can't avoid an
> > > eventual store back to RAM?
> >
> > No, because many small files end up being used as scratch-pads (think
> > shell script sequences etc), and get read back immediately again. Doing
> > non-temporal stores might just be bad simply because trying to play games
> > with caching may simply do the wrong thing.
> 
> OK, for that angle it could make sense. Although as has been 
> noted earlier, at this point of the copy, we don't have much 
> idea about the length of the write passed into the vfs (and 
> obviously will never know the higher level intention of 
> userspace).
> 
> I don't know if we can say a 1 page write is nontemporal, but 
> anything smaller is temporal. And having these kinds of 
> behavioural cutoffs I would worry will create strange 
> performance boundary conditions in code.

I agree in principle.

The main artifact would be the unaligned edges around a bigger 
write. In particular the tail portion of a big write will be 
cached.

For example if we write a 100,000 bytes file, we'll copy the 
first 24 pages (98304 bytes) uncached, while the final 1696 
bytes cached. But there is nothing that necessiates this 
assymetry.

For that reason it would be nice to pass down the total size of 
the write to the assembly code. These are single-usage-site APIs 
anyway so it should be easy.

I.e. something like the patch below (untested). I've extended 
the copy APIs to also pass down a 'total' size as well, and 
check for that instead of the chunk 'len'. Note that it's 
checked in the inlined portion so all the "total == len" special 
cases will optimize out the new parameter completely.

This should express the 'large vs. small' question adequately, 
with no artifacts. Agreed?

	Ingo

Index: tip/arch/x86/include/asm/uaccess_32.h
===================================================================
--- tip.orig/arch/x86/include/asm/uaccess_32.h
+++ tip/arch/x86/include/asm/uaccess_32.h
@@ -157,7 +157,7 @@ __copy_from_user(void *to, const void __
 }
 
 static __always_inline unsigned long __copy_from_user_nocache(void *to,
-				const void __user *from, unsigned long n)
+		const void __user *from, unsigned long n, unsigned long total)
 {
 	might_fault();
 	if (__builtin_constant_p(n)) {
@@ -180,7 +180,7 @@ static __always_inline unsigned long __c
 
 static __always_inline unsigned long
 __copy_from_user_inatomic_nocache(void *to, const void __user *from,
-				  unsigned long n)
+				  unsigned long n, unsigned long total)
 {
        return __copy_from_user_ll_nocache_nozero(to, from, n);
 }
Index: tip/arch/x86/include/asm/uaccess_64.h
===================================================================
--- tip.orig/arch/x86/include/asm/uaccess_64.h
+++ tip/arch/x86/include/asm/uaccess_64.h
@@ -189,7 +189,7 @@ extern long __copy_user_nocache(void *ds
 				unsigned size, int zerorest);
 
 static inline int __copy_from_user_nocache(void *dst, const void __user *src,
-					   unsigned size)
+				   unsigned size, unsigned long total)
 {
 	might_sleep();
 	/*
@@ -198,17 +198,16 @@ static inline int __copy_from_user_nocac
 	 * non-temporal stores here. Smaller writes get handled
 	 * via regular __copy_from_user():
 	 */
-	if (likely(size >= PAGE_SIZE))
+	if (likely(total >= PAGE_SIZE))
 		return __copy_user_nocache(dst, src, size, 1);
 	else
 		return __copy_from_user(dst, src, size);
 }
 
 static inline int __copy_from_user_inatomic_nocache(void *dst,
-						    const void __user *src,
-						    unsigned size)
+	    const void __user *src, unsigned size, unsigned total)
 {
-	if (likely(size >= PAGE_SIZE))
+	if (likely(total >= PAGE_SIZE))
 		return __copy_user_nocache(dst, src, size, 0);
 	else
 		return __copy_from_user_inatomic(dst, src, size);
Index: tip/drivers/gpu/drm/i915/i915_gem.c
===================================================================
--- tip.orig/drivers/gpu/drm/i915/i915_gem.c
+++ tip/drivers/gpu/drm/i915/i915_gem.c
@@ -211,7 +211,7 @@ fast_user_write(struct io_mapping *mappi
 
 	vaddr_atomic = io_mapping_map_atomic_wc(mapping, page_base);
 	unwritten = __copy_from_user_inatomic_nocache(vaddr_atomic + page_offset,
-						      user_data, length);
+						      user_data, length, length);
 	io_mapping_unmap_atomic(vaddr_atomic);
 	if (unwritten)
 		return -EFAULT;
Index: tip/include/linux/uaccess.h
===================================================================
--- tip.orig/include/linux/uaccess.h
+++ tip/include/linux/uaccess.h
@@ -41,13 +41,13 @@ static inline void pagefault_enable(void
 #ifndef ARCH_HAS_NOCACHE_UACCESS
 
 static inline unsigned long __copy_from_user_inatomic_nocache(void *to,
-				const void __user *from, unsigned long n)
+		const void __user *from, unsigned long n, unsigned long total)
 {
 	return __copy_from_user_inatomic(to, from, n);
 }
 
 static inline unsigned long __copy_from_user_nocache(void *to,
-				const void __user *from, unsigned long n)
+		const void __user *from, unsigned long n, unsigned long total)
 {
 	return __copy_from_user(to, from, n);
 }
Index: tip/mm/filemap.c
===================================================================
--- tip.orig/mm/filemap.c
+++ tip/mm/filemap.c
@@ -1816,14 +1816,14 @@ EXPORT_SYMBOL(file_remove_suid);
 static size_t __iovec_copy_from_user_inatomic(char *vaddr,
 			const struct iovec *iov, size_t base, size_t bytes)
 {
-	size_t copied = 0, left = 0;
+	size_t copied = 0, left = 0, total = bytes;
 
 	while (bytes) {
 		char __user *buf = iov->iov_base + base;
 		int copy = min(bytes, iov->iov_len - base);
 
 		base = 0;
-		left = __copy_from_user_inatomic_nocache(vaddr, buf, copy);
+		left = __copy_from_user_inatomic_nocache(vaddr, buf, copy, total);
 		copied += copy;
 		bytes -= copy;
 		vaddr += copy;
@@ -1851,8 +1851,9 @@ size_t iov_iter_copy_from_user_atomic(st
 	if (likely(i->nr_segs == 1)) {
 		int left;
 		char __user *buf = i->iov->iov_base + i->iov_offset;
+
 		left = __copy_from_user_inatomic_nocache(kaddr + offset,
-							buf, bytes);
+							buf, bytes, bytes);
 		copied = bytes - left;
 	} else {
 		copied = __iovec_copy_from_user_inatomic(kaddr + offset,
@@ -1880,7 +1881,8 @@ size_t iov_iter_copy_from_user(struct pa
 	if (likely(i->nr_segs == 1)) {
 		int left;
 		char __user *buf = i->iov->iov_base + i->iov_offset;
-		left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+
+		left = __copy_from_user_nocache(kaddr + offset, buf, bytes, bytes);
 		copied = bytes - left;
 	} else {
 		copied = __iovec_copy_from_user_inatomic(kaddr + offset,

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-25  7:25           ` [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() Ingo Molnar
@ 2009-02-25  8:09             ` Nick Piggin
  2009-02-25  8:29               ` Ingo Molnar
  2009-02-25 16:04             ` Linus Torvalds
  1 sibling, 1 reply; 50+ messages in thread
From: Nick Piggin @ 2009-02-25  8:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen

On Wednesday 25 February 2009 18:25:03 Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > On Wednesday 25 February 2009 02:52:34 Linus Torvalds wrote:
> > > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > > > > it does make some kind of sense to try to avoid the noncached
> > > > > versions for small writes - because small writes tend to be for
> > > > > temp-files.
> > > >
> > > > I don't see the significance of a temp file. If the pagecache is
> > > > truncated, then the cachelines remain dirty and so you can't avoid an
> > > > eventual store back to RAM?
> > >
> > > No, because many small files end up being used as scratch-pads (think
> > > shell script sequences etc), and get read back immediately again. Doing
> > > non-temporal stores might just be bad simply because trying to play
> > > games with caching may simply do the wrong thing.
> >
> > OK, for that angle it could make sense. Although as has been
> > noted earlier, at this point of the copy, we don't have much
> > idea about the length of the write passed into the vfs (and
> > obviously will never know the higher level intention of
> > userspace).
> >
> > I don't know if we can say a 1 page write is nontemporal, but
> > anything smaller is temporal. And having these kinds of
> > behavioural cutoffs I would worry will create strange
> > performance boundary conditions in code.
>
> I agree in principle.
>
> The main artifact would be the unaligned edges around a bigger
> write. In particular the tail portion of a big write will be
> cached.
>
> For example if we write a 100,000 bytes file, we'll copy the
> first 24 pages (98304 bytes) uncached, while the final 1696
> bytes cached. But there is nothing that necessiates this
> assymetry.
>
> For that reason it would be nice to pass down the total size of
> the write to the assembly code. These are single-usage-site APIs
> anyway so it should be easy.
>
> I.e. something like the patch below (untested). I've extended
> the copy APIs to also pass down a 'total' size as well, and
> check for that instead of the chunk 'len'. Note that it's
> checked in the inlined portion so all the "total == len" special
> cases will optimize out the new parameter completely.

This does give more information, not exactly all (it could be
a big total write with many smaller writes especially if the
source is generated on the fly and will be in cache, or if the
source is not in cache, then we would also want to do nontemporal
loads from there etc etc).


> This should express the 'large vs. small' question adequately,
> with no artifacts. Agreed?

Well, no artifacts, but it still has a boundary condition where
one might cut from temporal to nontemporal behaviour.

If it is a *really* important issue, maybe some flags should be
incorporated into an extended API? It not, then I wonder if it
is important enough to add such complexity for?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-25  8:09             ` Nick Piggin
@ 2009-02-25  8:29               ` Ingo Molnar
  2009-02-25  8:59                 ` Nick Piggin
  0 siblings, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2009-02-25  8:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Wednesday 25 February 2009 18:25:03 Ingo Molnar wrote:
> > * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > On Wednesday 25 February 2009 02:52:34 Linus Torvalds wrote:
> > > > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > > > > > it does make some kind of sense to try to avoid the noncached
> > > > > > versions for small writes - because small writes tend to be for
> > > > > > temp-files.
> > > > >
> > > > > I don't see the significance of a temp file. If the pagecache is
> > > > > truncated, then the cachelines remain dirty and so you can't avoid an
> > > > > eventual store back to RAM?
> > > >
> > > > No, because many small files end up being used as scratch-pads (think
> > > > shell script sequences etc), and get read back immediately again. Doing
> > > > non-temporal stores might just be bad simply because trying to play
> > > > games with caching may simply do the wrong thing.
> > >
> > > OK, for that angle it could make sense. Although as has been
> > > noted earlier, at this point of the copy, we don't have much
> > > idea about the length of the write passed into the vfs (and
> > > obviously will never know the higher level intention of
> > > userspace).
> > >
> > > I don't know if we can say a 1 page write is nontemporal, but
> > > anything smaller is temporal. And having these kinds of
> > > behavioural cutoffs I would worry will create strange
> > > performance boundary conditions in code.
> >
> > I agree in principle.
> >
> > The main artifact would be the unaligned edges around a bigger
> > write. In particular the tail portion of a big write will be
> > cached.
> >
> > For example if we write a 100,000 bytes file, we'll copy the
> > first 24 pages (98304 bytes) uncached, while the final 1696
> > bytes cached. But there is nothing that necessiates this
> > assymetry.
> >
> > For that reason it would be nice to pass down the total size of
> > the write to the assembly code. These are single-usage-site APIs
> > anyway so it should be easy.
> >
> > I.e. something like the patch below (untested). I've extended
> > the copy APIs to also pass down a 'total' size as well, and
> > check for that instead of the chunk 'len'. Note that it's
> > checked in the inlined portion so all the "total == len" special
> > cases will optimize out the new parameter completely.
> 
> This does give more information, not exactly all (it could be 
> a big total write with many smaller writes especially if the 
> source is generated on the fly and will be in cache, or if the 
> source is not in cache, then we would also want to do 
> nontemporal loads from there etc etc).

A big total write with many smaller writes should already be 
handled in this codepath, as long as it's properly iovec-ed.

We can do little about user-space doing stupid things as doing a 
big write as a series of many smaller-than-4K writes.

> > This should express the 'large vs. small' question 
> > adequately, with no artifacts. Agreed?
> 
> Well, no artifacts, but it still has a boundary condition 
> where one might cut from temporal to nontemporal behaviour.
> 
> If it is a *really* important issue, maybe some flags should 
> be incorporated into an extended API? It not, then I wonder if 
> it is important enough to add such complexity for?

the iozone numbers in the old commits certainly are convincing. 
The new numbers from Salman are convincing too - and his fix 
should preserve the iozone [large-write] numbers as well.

I.e. i think this is a reasonable compromise, it should handle 
all the sane cases.

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-25  8:29               ` Ingo Molnar
@ 2009-02-25  8:59                 ` Nick Piggin
  2009-02-25 12:01                   ` Ingo Molnar
  0 siblings, 1 reply; 50+ messages in thread
From: Nick Piggin @ 2009-02-25  8:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen

On Wednesday 25 February 2009 19:29:58 Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > On Wednesday 25 February 2009 18:25:03 Ingo Molnar wrote:
> > > * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > > > On Wednesday 25 February 2009 02:52:34 Linus Torvalds wrote:
> > > > > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > > > > > > it does make some kind of sense to try to avoid the noncached
> > > > > > > versions for small writes - because small writes tend to be for
> > > > > > > temp-files.
> > > > > >
> > > > > > I don't see the significance of a temp file. If the pagecache is
> > > > > > truncated, then the cachelines remain dirty and so you can't
> > > > > > avoid an eventual store back to RAM?
> > > > >
> > > > > No, because many small files end up being used as scratch-pads
> > > > > (think shell script sequences etc), and get read back immediately
> > > > > again. Doing non-temporal stores might just be bad simply because
> > > > > trying to play games with caching may simply do the wrong thing.
> > > >
> > > > OK, for that angle it could make sense. Although as has been
> > > > noted earlier, at this point of the copy, we don't have much
> > > > idea about the length of the write passed into the vfs (and
> > > > obviously will never know the higher level intention of
> > > > userspace).
> > > >
> > > > I don't know if we can say a 1 page write is nontemporal, but
> > > > anything smaller is temporal. And having these kinds of
> > > > behavioural cutoffs I would worry will create strange
> > > > performance boundary conditions in code.
> > >
> > > I agree in principle.
> > >
> > > The main artifact would be the unaligned edges around a bigger
> > > write. In particular the tail portion of a big write will be
> > > cached.
> > >
> > > For example if we write a 100,000 bytes file, we'll copy the
> > > first 24 pages (98304 bytes) uncached, while the final 1696
> > > bytes cached. But there is nothing that necessiates this
> > > assymetry.
> > >
> > > For that reason it would be nice to pass down the total size of
> > > the write to the assembly code. These are single-usage-site APIs
> > > anyway so it should be easy.
> > >
> > > I.e. something like the patch below (untested). I've extended
> > > the copy APIs to also pass down a 'total' size as well, and
> > > check for that instead of the chunk 'len'. Note that it's
> > > checked in the inlined portion so all the "total == len" special
> > > cases will optimize out the new parameter completely.
> >
> > This does give more information, not exactly all (it could be
> > a big total write with many smaller writes especially if the
> > source is generated on the fly and will be in cache, or if the
> > source is not in cache, then we would also want to do
> > nontemporal loads from there etc etc).
>
> A big total write with many smaller writes should already be
> handled in this codepath, as long as it's properly iovec-ed.

No I'm talking about this next case:

> We can do little about user-space doing stupid things as doing a
> big write as a series of many smaller-than-4K writes.

Not necessarily smaller than 4K writes, but even as a series of 4K
writes. It isn't stupid thing to do if the source memory is always
in cache. But if you destination is unlikely to be used, then you
still would want nontemporal stores.


> > > This should express the 'large vs. small' question
> > > adequately, with no artifacts. Agreed?
> >
> > Well, no artifacts, but it still has a boundary condition
> > where one might cut from temporal to nontemporal behaviour.
> >
> > If it is a *really* important issue, maybe some flags should
> > be incorporated into an extended API? It not, then I wonder if
> > it is important enough to add such complexity for?
>
> the iozone numbers in the old commits certainly are convincing.

Yes. The common case should be not touching the page again.


> The new numbers from Salman are convincing too - and his fix

I'm not exactly convinced. The boundary behaviour condition is a
real negative. What I question is whether that benchmark is not
doing something stupid. He is quoting the write(2)-only portion
of the benchmark, so the speedup does not come from the app reading
back results from cache. It comes from either overwriting the same
dirty cachelines (a performance critical program should really
avoid doing this if possible anyway); or the cached stores simply
pipelining better with non-store operations (but in that case you
probably still want non-temporal stores anyway because if your
workload is doing any real work, you don't want to push its cache
out with these stores).

So, can we find something that is more realistic? Doesn't gcc
create several stages of temporary files?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-25  8:59                 ` Nick Piggin
@ 2009-02-25 12:01                   ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2009-02-25 12:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> No I'm talking about this next case:
> 
> > We can do little about user-space doing stupid things as 
> > doing a big write as a series of many smaller-than-4K 
> > writes.
> 
> Not necessarily smaller than 4K writes, but even as a series 
> of 4K writes. It isn't stupid thing to do if the source memory 
> is always in cache. But if you destination is unlikely to be 
> used, then you still would want nontemporal stores.

I dont disagree that it would be nice to handle that case too, i 
just dont see how.

Unless you suggest some new logic that tracks the length of a 
continuous write to a file, and whether it got read back 
recently, i dont see how this could be done sanely.

That's the deal generally: if an app gives the kernel enough 
information in a syscall, the kernel can act on it reasonably. 

Sometimes, for important cases we allow apps to set attributes 
that function across syscalls too - like here we could extend 
madvise() to hint the kind of access ... but i doubt it would be 
used widely.

Sometimes, for _really_ important cases the kernel will also try 
to auto-detect patterns of use. We do that for readahead and we 
do that for socket buffers - and a few other things. Do you 
suggest we should do it here too?

Anyway ... i wouldnt mind if the lowlevel code used more hints 
if they are present and useful.

And unlike the 'final tail' case which indeed was quirky 
behavior and was worth fixing (hence the 'total' patch), the 
'should the kernel detect many small writes being one real big 
write' question is not a quirk but a highlevel question that the 
lowlevel copy code (nor the lowlevel pagecache code) can answer. 
So it's all a bit different.

> > The new numbers from Salman are convincing too - and his fix
> 
> I'm not exactly convinced. The boundary behaviour condition is 
> a real negative. What I question is whether that benchmark is 
> not doing something stupid. He is quoting the write(2)-only 
> portion of the benchmark, so the speedup does not come from 
> the app reading back results from cache. It comes from either 
> overwriting the same dirty cachelines (a performance critical 
> program should really avoid doing this if possible anyway); or 
> the cached stores simply pipelining better with non-store 
> operations (but in that case you probably still want 
> non-temporal stores anyway because if your workload is doing 
> any real work, you don't want to push its cache out with these 
> stores).
> 
> So, can we find something that is more realistic? Doesn't gcc 
> create several stages of temporary files?

i dont think this is really about performance critical apps, and 
i suspect the numbers will be even more convincing if a read() 
is inserted inbetween.

Lets face it, 99% of the Linux apps out there are not coded with 
'performance critical' aspects in mind.

So what we have to do is to watch out for common and still sane 
patterns of kernel usage - and optimize them, not dismiss them 
with 'this could be done even faster with XYZ'. (As long as it 
does not hurt sane usages - which i think this does not.)

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-25  7:25           ` [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() Ingo Molnar
  2009-02-25  8:09             ` Nick Piggin
@ 2009-02-25 16:04             ` Linus Torvalds
  2009-02-25 16:29               ` Ingo Molnar
  2009-02-27 12:05               ` Nick Piggin
  1 sibling, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2009-02-25 16:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Salman Qazi, davem, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen



On Wed, 25 Feb 2009, Ingo Molnar wrote:
> 
> The main artifact would be the unaligned edges around a bigger 
> write. In particular the tail portion of a big write will be 
> cached.

.. but I don't really agree that this is a problem.

Sure, it's "wrong", but does it actually matter? No. Is it worth adding 
complexity to existing interfaces for? I think not.

In general, I think that software should not mess with nontemporal stores. 
The thing is, software almost never knows enough about the CPU cache to 
make an intelligent choice.

So I didn't want to apply the nocache patches in the first place, but the 
performance numbers were pretty clear. I'll take "real numbers" over my 
personal dislikes any day. But now we have real numbers going the other 
way for small writes, and a patch to fix that.

But we have no amount of real numbers for the edge cases, and I don't 
think they matter. In fact, I don't think they _can_ matter, because it is 
inevitably always going to be an issue of "which CPU and which memory 
subsystem".

In other words, there is no "right" answer. There is no "perfect". But 
there is "we can fix the real numbers".

At the same time, we also do know:
 - caches work
 - CPU designers will continue to worry about the normal (cached) case, 
   and will do reasonable things with cache replacement.
 - ergo: w should always consider the cached case to be the _normal_ mode, 
   and it's the nontempral loads/stores that need to explain themselves.

So I do think we should just apply the simple patch. Not make a big deal 
out of it. We have numbers. We use cached memory copies for everything 
else. It's always "safe".

And we pretty much know that the only time we will ever really care about 
the nontemporal case is with big writes - where the "edge effects" 
essentially become total noise. 

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-25 16:04             ` Linus Torvalds
@ 2009-02-25 16:29               ` Ingo Molnar
  2009-02-27 12:05               ` Nick Piggin
  1 sibling, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2009-02-25 16:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Salman Qazi, davem, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So I do think we should just apply the simple patch. Not make 
> a big deal out of it. We have numbers. We use cached memory 
> copies for everything else. It's always "safe".

ok, 4K is definitely a simple, defensible, calculatable rule.

The only thing that pushes me a bit towards the 'total' tweak is 
that it still is a simple rule and too applies to something the 
user app controls and is intimately familiar with (the total 
size of write()), and it is very cheap to do.

The 64-bit x86 defconfig comparison is:

|
|    text	   data	    bss	    dec	    hex	filename
| 7669425	1138036	 842840	9650301	 93407d	vmlinux.before
| 7669425	1138036	 842840	9650301	 93407d	vmlinux.after
|

md5 shows it's different instructions:

|
|  ec6ba8ae17bdd3ae55062fc59f83fd25  vmlinux.before
|  1f7e9ea67524926cb92b96d5db97f9ff  vmlinux.after
|

i.e. it got almost all of it eliminated at the inlining level, 
and what we have are different but already existing parameters 
being passed in.

There was a single instruction added:

|
|  $ codiff vmlinux.before vmlinux.after
|
|  vmlinux.after:
|   2 functions changed, 6 bytes added, 1 bytes removed, diff: +5
|
|  mm/filemap.c:
|    __iovec_copy_from_user_inatomic |   +6
|    iov_iter_copy_from_user         |   -1
|   2 functions changed, 6 bytes added, 1 bytes removed, diff: +5
|

But hidden in the total size end result by lucky function size 
alignment.

If you are still not convinced i'll remove the two followup 
commits and will keep the first one only.

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-25 16:04             ` Linus Torvalds
  2009-02-25 16:29               ` Ingo Molnar
@ 2009-02-27 12:05               ` Nick Piggin
  2009-02-28  8:29                 ` Ingo Molnar
  1 sibling, 1 reply; 50+ messages in thread
From: Nick Piggin @ 2009-02-27 12:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Salman Qazi, davem, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen

On Thursday 26 February 2009 03:04:22 Linus Torvalds wrote:
> On Wed, 25 Feb 2009, Ingo Molnar wrote:
> > The main artifact would be the unaligned edges around a bigger
> > write. In particular the tail portion of a big write will be
> > cached.
>
> .. but I don't really agree that this is a problem.
>
> Sure, it's "wrong", but does it actually matter? No. Is it worth adding
> complexity to existing interfaces for? I think not.
>
> In general, I think that software should not mess with nontemporal stores.
> The thing is, software almost never knows enough about the CPU cache to
> make an intelligent choice.
>
> So I didn't want to apply the nocache patches in the first place, but the
> performance numbers were pretty clear. I'll take "real numbers" over my
> personal dislikes any day. But now we have real numbers going the other
> way for small writes, and a patch to fix that.
>
> But we have no amount of real numbers for the edge cases, and I don't
> think they matter. In fact, I don't think they _can_ matter, because it is
> inevitably always going to be an issue of "which CPU and which memory
> subsystem".
>
> In other words, there is no "right" answer. There is no "perfect". But
> there is "we can fix the real numbers".

Well... these are "real" benchmark numbers. Where the benchmark is
actually apparently performing an access pattern that seemingly
should favour nontemporal stores (the numbers are just measuring the
phase were write(2) is being done).


> At the same time, we also do know:
>  - caches work
>  - CPU designers will continue to worry about the normal (cached) case,
>    and will do reasonable things with cache replacement.
>  - ergo: w should always consider the cached case to be the _normal_ mode,
>    and it's the nontempral loads/stores that need to explain themselves.
>
> So I do think we should just apply the simple patch. Not make a big deal
> out of it. We have numbers. We use cached memory copies for everything
> else. It's always "safe".
>
> And we pretty much know that the only time we will ever really care about
> the nontemporal case is with big writes - where the "edge effects"
> essentially become total noise.

I guess so. I wouldn't mind just doing cached stores all the time for
the reasons you say.

But whatever. If it ever becomes *really* important, I guess we can flag
this kind of behaviour from userspace.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-27 12:05               ` Nick Piggin
@ 2009-02-28  8:29                 ` Ingo Molnar
  2009-02-28 11:49                   ` Nick Piggin
  0 siblings, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2009-02-28  8:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Thursday 26 February 2009 03:04:22 Linus Torvalds wrote:
> > On Wed, 25 Feb 2009, Ingo Molnar wrote:
> > > The main artifact would be the unaligned edges around a bigger
> > > write. In particular the tail portion of a big write will be
> > > cached.
> >
> > .. but I don't really agree that this is a problem.
> >
> > Sure, it's "wrong", but does it actually matter? No. Is it worth adding
> > complexity to existing interfaces for? I think not.
> >
> > In general, I think that software should not mess with nontemporal stores.
> > The thing is, software almost never knows enough about the CPU cache to
> > make an intelligent choice.
> >
> > So I didn't want to apply the nocache patches in the first place, but the
> > performance numbers were pretty clear. I'll take "real numbers" over my
> > personal dislikes any day. But now we have real numbers going the other
> > way for small writes, and a patch to fix that.
> >
> > But we have no amount of real numbers for the edge cases, and I don't
> > think they matter. In fact, I don't think they _can_ matter, because it is
> > inevitably always going to be an issue of "which CPU and which memory
> > subsystem".
> >
> > In other words, there is no "right" answer. There is no "perfect". But
> > there is "we can fix the real numbers".
> 
> Well... these are "real" benchmark numbers. Where the benchmark is
> actually apparently performing an access pattern that seemingly
> should favour nontemporal stores (the numbers are just measuring the
> phase were write(2) is being done).
> 
> 
> > At the same time, we also do know:
> >  - caches work
> >  - CPU designers will continue to worry about the normal (cached) case,
> >    and will do reasonable things with cache replacement.
> >  - ergo: w should always consider the cached case to be the _normal_ mode,
> >    and it's the nontempral loads/stores that need to explain themselves.
> >
> > So I do think we should just apply the simple patch. Not make a big deal
> > out of it. We have numbers. We use cached memory copies for everything
> > else. It's always "safe".
> >
> > And we pretty much know that the only time we will ever really care about
> > the nontemporal case is with big writes - where the "edge effects"
> > essentially become total noise.
> 
> I guess so. I wouldn't mind just doing cached stores all the 
> time for the reasons you say.
> 
> But whatever. If it ever becomes *really* important, I guess 
> we can flag this kind of behaviour from userspace.

Important question: is there a standing NAK for the 'total' 
parameter addition patch i did? You requested it and Linus didnt 
like it ... and i've measured it and it's just a single 
instruction in the whole kernel so it did not seem to be too bad 
to me.

It might be wrong on the principle though, so will revert it if 
needed, before it spreads into too many topics.

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-28  8:29                 ` Ingo Molnar
@ 2009-02-28 11:49                   ` Nick Piggin
  2009-02-28 12:58                     ` Ingo Molnar
  0 siblings, 1 reply; 50+ messages in thread
From: Nick Piggin @ 2009-02-28 11:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen

On Saturday 28 February 2009 19:29:22 Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > On Thursday 26 February 2009 03:04:22 Linus Torvalds wrote:

> > > And we pretty much know that the only time we will ever really care
> > > about the nontemporal case is with big writes - where the "edge
> > > effects" essentially become total noise.
> >
> > I guess so. I wouldn't mind just doing cached stores all the
> > time for the reasons you say.
> >
> > But whatever. If it ever becomes *really* important, I guess
> > we can flag this kind of behaviour from userspace.
>
> Important question: is there a standing NAK for the 'total'
> parameter addition patch i did? You requested it and Linus didnt
> like it ... and i've measured it and it's just a single
> instruction in the whole kernel so it did not seem to be too bad
> to me.

Well I didn't request it so much as just being concerned about hard
transitions in behaviour in general, and also the lack of an overall
picture (which this patch improves a little, but doesn't solve --
only way to really do it is with explicit flags).


> It might be wrong on the principle though, so will revert it if
> needed, before it spreads into too many topics.

I would say if Linus didn't like it, revert it.

Do you use NFS with 1500 byte packets with a high speed network and
really fast disks (or ramdisk)? Then maybe you can measure vectored
write speedup with your patch over Salman's more basic one :)


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-28 11:49                   ` Nick Piggin
@ 2009-02-28 12:58                     ` Ingo Molnar
  2009-02-28 17:16                       ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2009-02-28 12:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > It might be wrong on the principle though, so will revert it 
> > if needed, before it spreads into too many topics.
> 
> I would say if Linus didn't like it, revert it.
> 
> Do you use NFS with 1500 byte packets with a high speed 
> network and really fast disks (or ramdisk)? Then maybe you can 
> measure vectored write speedup with your patch over Salman's 
> more basic one :)

Can you suggest some other workload that should show sensitivity 
to this detail too? Like a simple write() loop of non-4K-sized 
files or so?

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-28 12:58                     ` Ingo Molnar
@ 2009-02-28 17:16                       ` Linus Torvalds
  2009-02-28 17:24                         ` Arjan van de Ven
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2009-02-28 17:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Salman Qazi, davem, linux-kernel, Thomas Gleixner,
	H. Peter Anvin, Andi Kleen



On Sat, 28 Feb 2009, Ingo Molnar wrote:
> 
> Can you suggest some other workload that should show sensitivity 
> to this detail too? Like a simple write() loop of non-4K-sized 
> files or so?

I bet you can find it, but I also suspect that it will depend quite a bit 
on the microarchitecture. What does 'movntq' actually _do_ on different 
CPU's (bypass L1 or L2 or just turn the L1 cache policy to "write through 
and invalidate")? How expensive is the sfence when there are still stores 
in the write buffer? Does 'movqnt' even use the write buffer for cached 
stores, or is doing some special path the the last-level cache?

If you want to be really subtle, ask questions like what are the 
implications for last-level caches that are inclusive? The last-level 
cache would take not just the new write, but it also has logic to make 
sure that it's a superset of the inner caches, so what does that do to 
replacement policy for that cache? Or does it cause invalidations in the 
inner caches?

Non-temporal stores are really quite different from normal stores. 
Depending on microarchitecture, that may be totally a non-issue (bypassing 
the L1 may be trivial and have no impact on anything else at all). Or it 
could be that a movntq is really expensive because it needs to do odd 
things.

So if you want to test this, I'd suggest using the same program that did 
the 256-byte writes (Unixbench's fstime thing), but just change the 
numbers, and just try different things. But I'd _also_ suggest that if 
you're going for anything more complicated (ie if you really want to 
have a good argument for that 'total_size' thing), then you should try out 
at least three different microarchitectures.

The "different" ones would be at a minimum P4, Core2 and Opteron. They 
really could have very different behavior. 

I suspect Core2 and Core i7 are fairly similar, but at the same time Ci7 
has that L3 cache thing, so it's quite possible that movntq is actually 
fundamentally different (does it bypass both L1 and L2? If so, latencies 
to the L3 are _much_ longer to Ci7 than the very cheap L2 latencies on 
C2).

			Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-28 17:16                       ` Linus Torvalds
@ 2009-02-28 17:24                         ` Arjan van de Ven
  2009-02-28 17:42                           ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Arjan van de Ven @ 2009-02-28 17:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen

On Sat, 28 Feb 2009 09:16:21 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Sat, 28 Feb 2009, Ingo Molnar wrote:
> > 
> > Can you suggest some other workload that should show sensitivity 
> > to this detail too? Like a simple write() loop of non-4K-sized 
> > files or so?
> 
> I bet you can find it, but I also suspect that it will depend quite a
> bit on the microarchitecture. What does 'movntq' actually _do_ on
> different CPU's (bypass L1 or L2 or just turn the L1 cache policy to
> "write through and invalidate")? 

Afaik it's like a cache flush followed by the equivalent of a WC store

> How expensive is the sfence when
> there are still stores in the write buffer? Does 'movqnt' even use
> the write buffer for cached stores, or is doing some special path the
> the last-level cache?

it's usually like a WC store
>
> If you want to be really subtle, ask questions like what are the 
> implications for last-level caches that are inclusive? The last-level 
> cache would take not just the new write, but it also has logic to
> make sure that it's a superset of the inner caches, so what does that
> do to replacement policy for that cache? Or does it cause
> invalidations in the inner caches?

it invalidates all caches in the hierarchy

afaik this is what Intel cpus do; but I also thought this behavior was
quite architectural as well...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-28 17:24                         ` Arjan van de Ven
@ 2009-02-28 17:42                           ` Linus Torvalds
  2009-02-28 17:53                             ` Arjan van de Ven
                                               ` (3 more replies)
  0 siblings, 4 replies; 50+ messages in thread
From: Linus Torvalds @ 2009-02-28 17:42 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Nick Piggin, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen



On Sat, 28 Feb 2009, Arjan van de Ven wrote:
> 
> it invalidates all caches in the hierarchy

Yeah, now that I look at the intel pdf's, I see that.

> afaik this is what Intel cpus do; but I also thought this behavior was
> quite architectural as well...

Ok, I really think we should definitely not use non-temporal stores for 
anything smaller than one full page in that case. In fact, I wonder if 
even any of the old streaming benchmarks are even true. I thought it would 
still stay in the L3, but yes, it literally seems to make the access 
totally noncached and WC.

That's almost unacceptable in the long run. With a 8MB L3 cache - and a 
compile sequence, do we really want to go out to memory to write the .S 
file, and then have the assembler go out to memory to read it back? For a 
compile, that _probably_ is all fine (the compiler in particular will have 
enough data structures around that it's not going to fit in the cache 
anyway), but I'm seeing leaner compilers and other cases where forcing 
things out all the way on the bus is simply the wrong thing.

			Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-28 17:42                           ` Linus Torvalds
@ 2009-02-28 17:53                             ` Arjan van de Ven
  2009-02-28 18:05                             ` Andi Kleen
                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 50+ messages in thread
From: Arjan van de Ven @ 2009-02-28 17:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen

On Sat, 28 Feb 2009 09:42:18 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> That's almost unacceptable in the long run. With a 8MB L3 cache - and
> a compile sequence, do we really want to go out to memory to write
> the .S file, and then have the assembler go out to memory to read it
> back? For a compile, that _probably_ is all fine (the compiler in
> particular will have enough data structures around that it's not
> going to fit in the cache anyway), but I'm seeing leaner compilers
> and other cases where forcing things out all the way on the bus is
> simply the wrong thing.

non-temporal is almost always wrong

one of the valid cases is where you KNOW you'll be doing dma for all of
the memory very shortly (say, an O_SYNC/O_DIRECT write of 4Kb) since
that would tend to evict it from cache anyway.

but beyond that... not so much.

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-28 17:42                           ` Linus Torvalds
  2009-02-28 17:53                             ` Arjan van de Ven
@ 2009-02-28 18:05                             ` Andi Kleen
  2009-02-28 18:27                             ` Ingo Molnar
  2009-03-01  0:06                             ` David Miller
  3 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2009-02-28 18:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Ingo Molnar, Nick Piggin, Salman Qazi, davem,
	linux-kernel, Thomas Gleixner, H. Peter Anvin

Linus Torvalds <torvalds@linux-foundation.org> writes:

> That's almost unacceptable in the long run. With a 8MB L3 cache - and a 
> compile sequence, do we really want to go out to memory to write the .S 
> file, and then have the assembler go out to memory to read it back? For 

One possible way would be to make the threshold dependent on the LLC
cache size, so that let's say an Atom still prefers to not pollute
its caches.  But I agree with you that it's probably better to 
just drop the nocache copies completely.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-28 17:42                           ` Linus Torvalds
  2009-02-28 17:53                             ` Arjan van de Ven
  2009-02-28 18:05                             ` Andi Kleen
@ 2009-02-28 18:27                             ` Ingo Molnar
  2009-02-28 18:39                               ` Arjan van de Ven
  2009-02-28 18:52                               ` [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() Linus Torvalds
  2009-03-01  0:06                             ` David Miller
  3 siblings, 2 replies; 50+ messages in thread
From: Ingo Molnar @ 2009-02-28 18:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Nick Piggin, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 28 Feb 2009, Arjan van de Ven wrote:
> > 
> > it invalidates all caches in the hierarchy
> 
> Yeah, now that I look at the intel pdf's, I see that.
> 
> > afaik this is what Intel cpus do; but I also thought this 
> > behavior was quite architectural as well...
> 
> Ok, I really think we should definitely not use non-temporal 
> stores for anything smaller than one full page in that case. 
> In fact, I wonder if even any of the old streaming benchmarks 
> are even true. I thought it would still stay in the L3, but 
> yes, it literally seems to make the access totally noncached 
> and WC.
> 
> That's almost unacceptable in the long run. With a 8MB L3 
> cache - and a compile sequence, do we really want to go out to 
> memory to write the .S file, and then have the assembler go 
> out to memory to read it back? For a compile, that _probably_ 
> is all fine (the compiler in particular will have enough data 
> structures around that it's not going to fit in the cache 
> anyway), but I'm seeing leaner compilers and other cases where 
> forcing things out all the way on the bus is simply the wrong 
> thing.

with the 'total' cutoff we could cut off at a higher place, say 
64K. But that would be rather arbitrary, not backed up by real 
numbers.

OTOH, given how draconian non-temporal stores are, i'm leaning 
towards removing them from the x86 code altogether. If it matter 
to performance somewhere it can be reintroduced, based on really 
well backed up numbers.

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-28 18:27                             ` Ingo Molnar
@ 2009-02-28 18:39                               ` Arjan van de Ven
  2009-03-02 10:39                                 ` [PATCH] x86, mm: dont use non-temporal stores in pagecache accesses Ingo Molnar
  2009-02-28 18:52                               ` [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() Linus Torvalds
  1 sibling, 1 reply; 50+ messages in thread
From: Arjan van de Ven @ 2009-02-28 18:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Nick Piggin, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen

On Sat, 28 Feb 2009 19:27:59 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> OTOH, given how draconian non-temporal stores are, i'm leaning 
> towards removing them from the x86 code altogether. If it matter 
> to performance somewhere it can be reintroduced, based on really 
> well backed up numbers.

I think that is mostly the right approach; O_DIRECT could be the
exception to that (because there we don't only wipe it from the cpu
cache due to DMA, it even gets wiped from the pagecache!).
But I can see that being too special of a case to care about in the
grand scheme of things (although the database people will now get upset
with me ;-)

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-28 18:27                             ` Ingo Molnar
  2009-02-28 18:39                               ` Arjan van de Ven
@ 2009-02-28 18:52                               ` Linus Torvalds
  2009-03-01 14:19                                 ` Nick Piggin
  1 sibling, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2009-02-28 18:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Nick Piggin, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen



On Sat, 28 Feb 2009, Ingo Molnar wrote:
> 
> OTOH, given how draconian non-temporal stores are, i'm leaning 
> towards removing them from the x86 code altogether. If it matter 
> to performance somewhere it can be reintroduced, based on really 
> well backed up numbers.

It would be interesting to see if we could instead base the decision on 
what we really do care about, namely going to do IO.

And the thing is, in this path we _do_ kind of know that. The caller 
(normally generic_perform_write) already does that whole 
balance_dirty_pages_ratelimited() thing.

So rather than passing in the "total_size" thing, we _could_ pass in 
something that is based on

 - are we O_DIRECT? If so: use uncached
 - perhaps: are we really _really_ large? If so: use uncached, we know the 
   caches aren't going to capture it.
 - are we starting writeout due to dirty page balancing: if so, use 
   uncached.

so we could just add a "nocached" flag to that whole 
__copy_from_user[_inatomic]_nocache thing, and we could make 
balance_dirty_pages_ratelimited() return a flag satung whether we're 
starting write-out or not.

Of course, the dirty writeout will always happen to _old_ pages, so we'll 
actually get notified too late, but assuming you have some steady state 
"heavy IO", it would presumably still be a reasonable heuristic.

But on the other hand, I could personally certainly also imagine just not 
doing that whole uncached thing at all. Myself, I tend to care about the 
peformance of the cached case much more than some odd iozone thing. But 
others will have different priorities..

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-28 17:42                           ` Linus Torvalds
                                               ` (2 preceding siblings ...)
  2009-02-28 18:27                             ` Ingo Molnar
@ 2009-03-01  0:06                             ` David Miller
  2009-03-01  0:40                               ` Andi Kleen
  3 siblings, 1 reply; 50+ messages in thread
From: David Miller @ 2009-03-01  0:06 UTC (permalink / raw)
  To: torvalds; +Cc: arjan, mingo, nickpiggin, sqazi, linux-kernel, tglx, hpa, andi

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat, 28 Feb 2009 09:42:18 -0800 (PST)

> On Sat, 28 Feb 2009, Arjan van de Ven wrote:
> > 
> > it invalidates all caches in the hierarchy
> 
> Yeah, now that I look at the intel pdf's, I see that.
> 
> > afaik this is what Intel cpus do; but I also thought this behavior was
> > quite architectural as well...
> 
> Ok, I really think we should definitely not use non-temporal stores for 
> anything smaller than one full page in that case. In fact, I wonder if 
> even any of the old streaming benchmarks are even true. I thought it would 
> still stay in the L3, but yes, it literally seems to make the access 
> totally noncached and WC.
> 
> That's almost unacceptable in the long run. With a 8MB L3 cache - and a 
> compile sequence, do we really want to go out to memory to write the .S 
> file, and then have the assembler go out to memory to read it back? For a 
> compile, that _probably_ is all fine (the compiler in particular will have 
> enough data structures around that it's not going to fit in the cache 
> anyway), but I'm seeing leaner compilers and other cases where forcing 
> things out all the way on the bus is simply the wrong thing.

I think this is an accurate analysis as well, it's really unfortunate
the non-temporal stuff on x86 doesn't preserve existing cache lines
when present.

I thought that was the whole point.  Don't pollute the caches, but
if cache lines are already loaded there, use them and don't purge!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-01  0:40                               ` Andi Kleen
@ 2009-03-01  0:28                                 ` H. Peter Anvin
  2009-03-01  0:38                                   ` Arjan van de Ven
  0 siblings, 1 reply; 50+ messages in thread
From: H. Peter Anvin @ 2009-03-01  0:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, torvalds, arjan, mingo, nickpiggin, sqazi,
	linux-kernel, tglx

Andi Kleen wrote:
>> I think this is an accurate analysis as well, it's really unfortunate
>> the non-temporal stuff on x86 doesn't preserve existing cache lines
>> when present.
>>
>> I thought that was the whole point.  Don't pollute the caches, but
>> if cache lines are already loaded there, use them and don't purge!
> 
> x86 actually supports that, it's just not done through movnt.
> 
> You can do that on x86 by using PREFETCHNTA (or T0/T1/T2 for specific
> cache levels). Typically this is implemented by forcing the cache line 
> to only a single way of the cache (so only using max 1/8 or so of your last 
> level cache) 
> 
> I'm not sure how it interacts with REP MOVS* though, this internally
> tends to do additional magic for larger copies.

The PREFETCHNTA stuff is really for reads rather than writes, however.
Yes, you can prefetch the cache line you're about to overwrite, but I
suspect (I haven't verified) that you lose out on whole-line
optimizations that way.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-01  0:28                                 ` H. Peter Anvin
@ 2009-03-01  0:38                                   ` Arjan van de Ven
  2009-03-01  1:48                                     ` Andi Kleen
  0 siblings, 1 reply; 50+ messages in thread
From: Arjan van de Ven @ 2009-03-01  0:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, David Miller, torvalds, mingo, nickpiggin, sqazi,
	linux-kernel, tglx

On Sat, 28 Feb 2009 16:28:58 -0800
"H. Peter Anvin" <hpa@zytor.com> wrote:

> Andi Kleen wrote:
> >> I think this is an accurate analysis as well, it's really
> >> unfortunate the non-temporal stuff on x86 doesn't preserve
> >> existing cache lines when present.
> >>
> >> I thought that was the whole point.  Don't pollute the caches, but
> >> if cache lines are already loaded there, use them and don't purge!
> > 
> > x86 actually supports that, it's just not done through movnt.
> > 
> > You can do that on x86 by using PREFETCHNTA (or T0/T1/T2 for
> > specific cache levels). Typically this is implemented by forcing
> > the cache line to only a single way of the cache (so only using max
> > 1/8 or so of your last level cache) 
> > 
> > I'm not sure how it interacts with REP MOVS* though, this internally
> > tends to do additional magic for larger copies.
> 
> The PREFETCHNTA stuff is really for reads rather than writes, however.
> Yes, you can prefetch the cache line you're about to overwrite, but I
> suspect (I haven't verified) that you lose out on whole-line
> optimizations that way.

the entire point of using movntq and friends was to save half the
memory bandwidth to not pull it into the cache before writing...
.... so bad idea to do prefetch<anything>

> 
> 	-hpa
> 


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-01  0:06                             ` David Miller
@ 2009-03-01  0:40                               ` Andi Kleen
  2009-03-01  0:28                                 ` H. Peter Anvin
  0 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2009-03-01  0:40 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, arjan, mingo, nickpiggin, sqazi, linux-kernel, tglx, hpa, andi

> I think this is an accurate analysis as well, it's really unfortunate
> the non-temporal stuff on x86 doesn't preserve existing cache lines
> when present.
> 
> I thought that was the whole point.  Don't pollute the caches, but
> if cache lines are already loaded there, use them and don't purge!

x86 actually supports that, it's just not done through movnt.

You can do that on x86 by using PREFETCHNTA (or T0/T1/T2 for specific
cache levels). Typically this is implemented by forcing the cache line 
to only a single way of the cache (so only using max 1/8 or so of your last 
level cache) 

I'm not sure how it interacts with REP MOVS* though, this internally
tends to do additional magic for larger copies.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-01  1:48                                     ` Andi Kleen
@ 2009-03-01  1:38                                       ` Arjan van de Ven
  2009-03-01  1:40                                         ` H. Peter Anvin
  2009-03-01  2:07                                         ` Andi Kleen
  0 siblings, 2 replies; 50+ messages in thread
From: Arjan van de Ven @ 2009-03-01  1:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Andi Kleen, David Miller, torvalds, mingo,
	nickpiggin, sqazi, linux-kernel, tglx

On Sun, 1 Mar 2009 02:48:22 +0100
Andi Kleen <andi@firstfloor.org> wrote:

> > the entire point of using movntq and friends was to save half the
> 
> I thought the point was to not pollute caches? At least that is 
> what I remember being told when I merged the patch.
> 

the reason that movntq and co are faster is because you avoid the
write-allocate behavior of the caches....

the cache polluting part of it I find hard to buy for general use (as
this discussion shows)... that will be extremely hard to measure as
a real huge thing, while the WA part is like a 1.5x to 2x thing.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-01  1:38                                       ` Arjan van de Ven
@ 2009-03-01  1:40                                         ` H. Peter Anvin
  2009-03-01 14:06                                           ` Nick Piggin
  2009-03-01  2:07                                         ` Andi Kleen
  1 sibling, 1 reply; 50+ messages in thread
From: H. Peter Anvin @ 2009-03-01  1:40 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, David Miller, torvalds, mingo, nickpiggin, sqazi,
	linux-kernel, tglx

Arjan van de Ven wrote:
> 
> the reason that movntq and co are faster is because you avoid the
> write-allocate behavior of the caches....
> 
> the cache polluting part of it I find hard to buy for general use (as
> this discussion shows)... that will be extremely hard to measure as
> a real huge thing, while the WA part is like a 1.5x to 2x thing.
> 

Note that hardware *can* (which is not the same thing as hardware
*will*) elide the write-allocate behavior.  We did that at Transmeta for
rep movs and certain other instructions which provably filled in entire
cache lines.  I haven't investigated if newer Intel CPUs do that in the
"fast rep movs" case.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-01  0:38                                   ` Arjan van de Ven
@ 2009-03-01  1:48                                     ` Andi Kleen
  2009-03-01  1:38                                       ` Arjan van de Ven
  0 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2009-03-01  1:48 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: H. Peter Anvin, Andi Kleen, David Miller, torvalds, mingo,
	nickpiggin, sqazi, linux-kernel, tglx

> the entire point of using movntq and friends was to save half the

I thought the point was to not pollute caches? At least that is 
what I remember being told when I merged the patch.

-andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-01  1:38                                       ` Arjan van de Ven
  2009-03-01  1:40                                         ` H. Peter Anvin
@ 2009-03-01  2:07                                         ` Andi Kleen
  1 sibling, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2009-03-01  2:07 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, H. Peter Anvin, David Miller, torvalds, mingo,
	nickpiggin, sqazi, linux-kernel, tglx

On Sat, Feb 28, 2009 at 05:38:13PM -0800, Arjan van de Ven wrote:
> On Sun, 1 Mar 2009 02:48:22 +0100
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> > > the entire point of using movntq and friends was to save half the
> > 
> > I thought the point was to not pollute caches? At least that is 
> > what I remember being told when I merged the patch.
> > 
> 
> the reason that movntq and co are faster is because you avoid the
> write-allocate behavior of the caches....

Not faster than rep ; movs which does similar magic anyways. Of course it being
magic it can vary a lot and is somewhat unpredictable.

Also in my experince movnt is not actually that much faster for small 
transfers (as in what the kernel mostly does) 

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-01  1:40                                         ` H. Peter Anvin
@ 2009-03-01 14:06                                           ` Nick Piggin
  2009-03-02  4:46                                             ` H. Peter Anvin
  2009-03-02 21:16                                             ` Linus Torvalds
  0 siblings, 2 replies; 50+ messages in thread
From: Nick Piggin @ 2009-03-01 14:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Arjan van de Ven, Andi Kleen, David Miller, torvalds, mingo,
	sqazi, linux-kernel, tglx

On Sunday 01 March 2009 12:40:51 H. Peter Anvin wrote:
> Arjan van de Ven wrote:
> > the reason that movntq and co are faster is because you avoid the
> > write-allocate behavior of the caches....
> >
> > the cache polluting part of it I find hard to buy for general use (as
> > this discussion shows)... that will be extremely hard to measure as
> > a real huge thing, while the WA part is like a 1.5x to 2x thing.
>
> Note that hardware *can* (which is not the same thing as hardware
> *will*) elide the write-allocate behavior.  We did that at Transmeta for
> rep movs and certain other instructions which provably filled in entire
> cache lines.  I haven't investigated if newer Intel CPUs do that in the
> "fast rep movs" case.

I would expect any high performance CPU these days to combine entries
in the store queue, even for normal store instructions (especially for
linear memcpy patterns). Isn't this likely to be the case?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-02-28 18:52                               ` [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() Linus Torvalds
@ 2009-03-01 14:19                                 ` Nick Piggin
  0 siblings, 0 replies; 50+ messages in thread
From: Nick Piggin @ 2009-03-01 14:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Arjan van de Ven, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen

On Sunday 01 March 2009 05:52:19 Linus Torvalds wrote:
> On Sat, 28 Feb 2009, Ingo Molnar wrote:
> > OTOH, given how draconian non-temporal stores are, i'm leaning
> > towards removing them from the x86 code altogether. If it matter
> > to performance somewhere it can be reintroduced, based on really
> > well backed up numbers.
>
> It would be interesting to see if we could instead base the decision on
> what we really do care about, namely going to do IO.
>
> And the thing is, in this path we _do_ kind of know that. The caller
> (normally generic_perform_write) already does that whole
> balance_dirty_pages_ratelimited() thing.

>
> So rather than passing in the "total_size" thing, we _could_ pass in
> something that is based on
>
>  - are we O_DIRECT? If so: use uncached

Zero copies in that case :) O_SYNC, you mean.


>  - perhaps: are we really _really_ large? If so: use uncached, we know the
>    caches aren't going to capture it.

But how large, and which caches? I wouldn't expect very many apps at all
to pass buffers larger than even quite small LLC sizes.


>  - are we starting writeout due to dirty page balancing: if so, use
>    uncached.

Although that should tend to write out oldest written data, wheras
the newly written data might still benefit from being warm in cache
(eg. in the cpp|cc|as|ld case).


> But on the other hand, I could personally certainly also imagine just not
> doing that whole uncached thing at all. Myself, I tend to care about the
> peformance of the cached case much more than some odd iozone thing. But
> others will have different priorities..

FWIW I agree with you. rep mov I think gives the CPU a good window into
proceedings and usually should do a good job.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-01 14:06                                           ` Nick Piggin
@ 2009-03-02  4:46                                             ` H. Peter Anvin
  2009-03-02  6:18                                               ` Nick Piggin
  2009-03-02 21:16                                             ` Linus Torvalds
  1 sibling, 1 reply; 50+ messages in thread
From: H. Peter Anvin @ 2009-03-02  4:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Arjan van de Ven, Andi Kleen, David Miller, torvalds, mingo,
	sqazi, linux-kernel, tglx

Nick Piggin wrote:
> 
> I would expect any high performance CPU these days to combine entries
> in the store queue, even for normal store instructions (especially for
> linear memcpy patterns). Isn't this likely to be the case?
> 

Actually, that is often not the case simply because it doesn't buy that
much.  The big win comes when you don't read a whole cache line in from
memory, but that is a property of the cache, not the store queue.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-02  4:46                                             ` H. Peter Anvin
@ 2009-03-02  6:18                                               ` Nick Piggin
  0 siblings, 0 replies; 50+ messages in thread
From: Nick Piggin @ 2009-03-02  6:18 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Arjan van de Ven, Andi Kleen, David Miller, torvalds, mingo,
	sqazi, linux-kernel, tglx

On Monday 02 March 2009 15:46:03 H. Peter Anvin wrote:
> Nick Piggin wrote:
> > I would expect any high performance CPU these days to combine entries
> > in the store queue, even for normal store instructions (especially for
> > linear memcpy patterns). Isn't this likely to be the case?
>
> Actually, that is often not the case simply because it doesn't buy that
> much.  The big win comes when you don't read a whole cache line in from
> memory, but that is a property of the cache, not the store queue.

Hm, maybe I'm confused. As far as I thought, you could avoid the
RMW write allocate behaviour by bypassing the cache on a store
miss, or combining stores into cacheline blocks before they leave
the store queue.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH] x86, mm: dont use non-temporal stores in pagecache accesses
  2009-02-28 18:39                               ` Arjan van de Ven
@ 2009-03-02 10:39                                 ` Ingo Molnar
  0 siblings, 0 replies; 50+ messages in thread
From: Ingo Molnar @ 2009-03-02 10:39 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Nick Piggin, Salman Qazi, davem, linux-kernel,
	Thomas Gleixner, H. Peter Anvin, Andi Kleen


* Arjan van de Ven <arjan@infradead.org> wrote:

> On Sat, 28 Feb 2009 19:27:59 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > OTOH, given how draconian non-temporal stores are, i'm leaning 
> > towards removing them from the x86 code altogether. If it matter 
> > to performance somewhere it can be reintroduced, based on really 
> > well backed up numbers.
> 
> I think that is mostly the right approach; O_DIRECT could be 
> the exception to that (because there we don't only wipe it 
> from the cpu cache due to DMA, it even gets wiped from the 
> pagecache!). But I can see that being too special of a case to 
> care about in the grand scheme of things (although the 
> database people will now get upset with me ;-)

ok, i've removed most of the non-temporal stores hacks from the 
pagecache code on x86 - see the commit below, queued up in 
tip:x86/mm. (Other architectures did not make use of the 
_nocache() primitives at all.)

The _nocache() primitives are still there and are also still 
used by the Intel 3D driver code - passing buffers from the CPU 
to the GPU is an arguably correct special-case use of MOVNT.

O_SYNC (or any other thing iozone benchmarks might hit) will 
need to try a cleaner approach that does not affect the cached 
path. But, i'd not be surprised if it was all fine as-is as well 
- O_DIRECT IO wont normally get copied anyway.

	Ingo

----------------->
>From f180053694b43d5714bf56cb95499a3c32ff155c Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 2 Mar 2009 11:00:57 +0100
Subject: [PATCH] x86, mm: dont use non-temporal stores in pagecache accesses

Impact: standardize IO on cached ops

On modern CPUs it is almost always a bad idea to use non-temporal stores,
as the regression in this commit has shown it:

  30d697f: x86: fix performance regression in write() syscall

The kernel simply has no good information about whether using non-temporal
stores is a good idea or not - and trying to add heuristics only increases
complexity and inserts fragility.

The regression on cached write()s took very long to be found - over two
years. So dont take any chances and let the hardware decide how it makes
use of its caches.

The only exception is drivers/gpu/drm/i915/i915_gem.c: there were we are
absolutely sure that another entity (the GPU) will pick up the dirty
data immediately and that the CPU will not touch that data before the
GPU will.

Also, keep the _nocache() primitives to make it easier for people to
experiment with these details. There may be more clear-cut cases where
non-cached copies can be used, outside of filemap.c.

Cc: Salman Qazi <sqazi@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/include/asm/uaccess_32.h |    4 ++--
 arch/x86/include/asm/uaccess_64.h |   25 +++++++------------------
 drivers/gpu/drm/i915/i915_gem.c   |    2 +-
 include/linux/uaccess.h           |    4 ++--
 mm/filemap.c                      |   11 ++++-------
 mm/filemap_xip.c                  |    2 +-
 6 files changed, 17 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_32.h b/arch/x86/include/asm/uaccess_32.h
index a0ba613..5e06259 100644
--- a/arch/x86/include/asm/uaccess_32.h
+++ b/arch/x86/include/asm/uaccess_32.h
@@ -157,7 +157,7 @@ __copy_from_user(void *to, const void __user *from, unsigned long n)
 }
 
 static __always_inline unsigned long __copy_from_user_nocache(void *to,
-		const void __user *from, unsigned long n, unsigned long total)
+				const void __user *from, unsigned long n)
 {
 	might_fault();
 	if (__builtin_constant_p(n)) {
@@ -180,7 +180,7 @@ static __always_inline unsigned long __copy_from_user_nocache(void *to,
 
 static __always_inline unsigned long
 __copy_from_user_inatomic_nocache(void *to, const void __user *from,
-				  unsigned long n, unsigned long total)
+				  unsigned long n)
 {
        return __copy_from_user_ll_nocache_nozero(to, from, n);
 }
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index dcaa040..8cc6873 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -188,29 +188,18 @@ __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size)
 extern long __copy_user_nocache(void *dst, const void __user *src,
 				unsigned size, int zerorest);
 
-static inline int __copy_from_user_nocache(void *dst, const void __user *src,
-				   unsigned size, unsigned long total)
+static inline int
+__copy_from_user_nocache(void *dst, const void __user *src, unsigned size)
 {
 	might_sleep();
-	/*
-	 * In practice this limit means that large file write()s
-	 * which get chunked to 4K copies get handled via
-	 * non-temporal stores here. Smaller writes get handled
-	 * via regular __copy_from_user():
-	 */
-	if (likely(total >= PAGE_SIZE))
-		return __copy_user_nocache(dst, src, size, 1);
-	else
-		return __copy_from_user(dst, src, size);
+	return __copy_user_nocache(dst, src, size, 1);
 }
 
-static inline int __copy_from_user_inatomic_nocache(void *dst,
-	    const void __user *src, unsigned size, unsigned total)
+static inline int
+__copy_from_user_inatomic_nocache(void *dst, const void __user *src,
+				  unsigned size)
 {
-	if (likely(total >= PAGE_SIZE))
-		return __copy_user_nocache(dst, src, size, 0);
-	else
-		return __copy_from_user_inatomic(dst, src, size);
+	return __copy_user_nocache(dst, src, size, 0);
 }
 
 unsigned long
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 6b209db..8185766 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -215,7 +215,7 @@ fast_user_write(struct io_mapping *mapping,
 
 	vaddr_atomic = io_mapping_map_atomic_wc(mapping, page_base);
 	unwritten = __copy_from_user_inatomic_nocache(vaddr_atomic + page_offset,
-						      user_data, length, length);
+						      user_data, length);
 	io_mapping_unmap_atomic(vaddr_atomic);
 	if (unwritten)
 		return -EFAULT;
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 6f3c603..6b58367 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -41,13 +41,13 @@ static inline void pagefault_enable(void)
 #ifndef ARCH_HAS_NOCACHE_UACCESS
 
 static inline unsigned long __copy_from_user_inatomic_nocache(void *to,
-		const void __user *from, unsigned long n, unsigned long total)
+				const void __user *from, unsigned long n)
 {
 	return __copy_from_user_inatomic(to, from, n);
 }
 
 static inline unsigned long __copy_from_user_nocache(void *to,
-		const void __user *from, unsigned long n, unsigned long total)
+				const void __user *from, unsigned long n)
 {
 	return __copy_from_user(to, from, n);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 60fd567..126d397 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1816,14 +1816,14 @@ EXPORT_SYMBOL(file_remove_suid);
 static size_t __iovec_copy_from_user_inatomic(char *vaddr,
 			const struct iovec *iov, size_t base, size_t bytes)
 {
-	size_t copied = 0, left = 0, total = bytes;
+	size_t copied = 0, left = 0;
 
 	while (bytes) {
 		char __user *buf = iov->iov_base + base;
 		int copy = min(bytes, iov->iov_len - base);
 
 		base = 0;
-		left = __copy_from_user_inatomic_nocache(vaddr, buf, copy, total);
+		left = __copy_from_user_inatomic(vaddr, buf, copy);
 		copied += copy;
 		bytes -= copy;
 		vaddr += copy;
@@ -1851,9 +1851,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 	if (likely(i->nr_segs == 1)) {
 		int left;
 		char __user *buf = i->iov->iov_base + i->iov_offset;
-
-		left = __copy_from_user_inatomic_nocache(kaddr + offset,
-							buf, bytes, bytes);
+		left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
 		copied = bytes - left;
 	} else {
 		copied = __iovec_copy_from_user_inatomic(kaddr + offset,
@@ -1881,8 +1879,7 @@ size_t iov_iter_copy_from_user(struct page *page,
 	if (likely(i->nr_segs == 1)) {
 		int left;
 		char __user *buf = i->iov->iov_base + i->iov_offset;
-
-		left = __copy_from_user_nocache(kaddr + offset, buf, bytes, bytes);
+		left = __copy_from_user(kaddr + offset, buf, bytes);
 		copied = bytes - left;
 	} else {
 		copied = __iovec_copy_from_user_inatomic(kaddr + offset,
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index bf54f8a..0c04615 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -354,7 +354,7 @@ __xip_file_write(struct file *filp, const char __user *buf,
 			break;
 
 		copied = bytes -
-			__copy_from_user_nocache(xip_mem + offset, buf, bytes, bytes);
+			__copy_from_user_nocache(xip_mem + offset, buf, bytes);
 
 		if (likely(copied > 0)) {
 			status = copied;

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-01 14:06                                           ` Nick Piggin
  2009-03-02  4:46                                             ` H. Peter Anvin
@ 2009-03-02 21:16                                             ` Linus Torvalds
  2009-03-02 21:25                                               ` Ingo Molnar
  2009-03-03  4:20                                               ` Nick Piggin
  1 sibling, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2009-03-02 21:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: H. Peter Anvin, Arjan van de Ven, Andi Kleen, David Miller,
	mingo, sqazi, linux-kernel, tglx



On Mon, 2 Mar 2009, Nick Piggin wrote:
> 
> I would expect any high performance CPU these days to combine entries
> in the store queue, even for normal store instructions (especially for
> linear memcpy patterns). Isn't this likely to be the case?

None of this really matters.

The big issue is that before you can do any write to any cacheline, if the 
memory is cacheable, it needs the cache coherency protocol to synchronize 
with any other CPU's that may have that line in the cache.

The _only_ time a write is "free" is when you already have that cacheline 
in your own cache, and in an "exclusive" state. If that is the case, then 
you know that you don't need to do anything else.

In _any_ other case, before you do the write, you need to make sure that 
no other CPU in the system has that line in its cache. Whether you do that 
with a "write and invalidate" model (which would be how a store buffer 
would do it or a write-through cache would work), or whether you do it 
with a "acquire exclusive cacheline" (which is how the cache coherency 
protocol would do it), it's going to end up using cache coherency 
bandwidth.

Of course, what will be the limiting factor is unclear. On a single-socket 
thing, you don't have any cache coherency issues, an the only bandwidth 
you'd end up using is the actual memory write at the memory controller 
(which may be on-die, and entirely separate from the cache coherency 
protocol). It may be idle and the write queue may be deep enough that you 
reach memory speeds and the write buffer is the optimal approach.

On many sockets, the limiting factor will almost certainly be the cache 
coherency overhead (since the cache coherency traffic needs to go to _all_ 
sockets, rather than just one stream to memory), at least unless you have 
a good cache coherency filter that can filter out part of the traffic 
based on whether it could be cached or not on some socket(s).

IOW, it's almost impossible to tell what is the best approach. It will 
depend on number of sockets, it will depend on size of cache, and it will 
depend on the capabilities and performance of the memory controllers vs 
the cache coherency protocol.

On a "single shared bus" model, the "write with invalidate" is fine, and 
it basically ends up working a lot like a single socket even if you 
actually have multiple sockets - it just won't scale much beyond two 
sockets. With HT or QPI, things are different, and the presense or absense 
of a snoop filter could make a big difference for 4+ socket situations.

There simply is no single answer. 

And we really should keep that in mind. There is no right answer, and the 
right answer will depend on hardware. Playing cache games in software is 
almost always futile. It can be a huge improvement, but it can be a huge 
deprovement too, and it really tends to be only worth it if you (a) know 
your hardware really quite well and (b) know your _load_ pretty well too.

We can play games in the kernel. We do know how many sockets there are. We 
do know the cache size. We _could_ try to make an educated guess at 
whether the next user of the data will be DMA or not. So there are 
unquestionably heuristics we could apply, but I also do suspect that 
they'd inevitably be pretty arbitrary.

I suspect that we could make some boot-time (or maybe CPU hotplug time) 
decision that simply just sets a threshold value for when it is worth 
using non-temporal stores. With smaller caches, and with a single socket 
(or a single bus), it likely makes sense to use non-temporal stores 
earlier. 

But even with some rough heuristic, it will be wrong part of the time. So 
I think "simple and predictable" in the end tends to be better than 
"complex and still known to be broken".

Btw, the "simple and predictable" could literally look at _where_ in the 
file the IO is. Because I know there are papers on the likelihood of 
re-use of data depending on where in the file it is written. Data written 
to low offsets is more likely to be accessed again (think temp-files), 
while data written to big offsets are much more likely to be random or to 
be written out (think databases or simply just large streaming files).

So I suspect a "simple and predictable" algorithm could literally be 
something like

 - use nontemporal stores only if you are writing a whole page, and the 
   byte offset of the page is larger than 'x', where 'x' may optionally 
   even depend on size of cache.

But removing it entirely may be fine too.

What I _don't_ think is fine is to think that you've "solved" it, or that 
you should even try!

			Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-02 21:16                                             ` Linus Torvalds
@ 2009-03-02 21:25                                               ` Ingo Molnar
  2009-03-03  4:30                                                 ` Nick Piggin
  2009-03-03  4:20                                               ` Nick Piggin
  1 sibling, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2009-03-02 21:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, H. Peter Anvin, Arjan van de Ven, Andi Kleen,
	David Miller, sqazi, linux-kernel, tglx


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> We can play games in the kernel. We do know how many sockets 
> there are. We do know the cache size. We _could_ try to make 
> an educated guess at whether the next user of the data will be 
> DMA or not. So there are unquestionably heuristics we could 
> apply, but I also do suspect that they'd inevitably be pretty 
> arbitrary.

There's a higher-level meta-code argument to consider as well, 
and i find it equally important: finding this rather obvious and 
easy to measure performance regression caused by the 
introduction of MOVNT took us two years.

That's _way_ too long - and adding more heuristics and more 
complexity will just increase this latency. (as we will create 
small niches of special cases with special workloads where we 
might or might not regress)

So, if any such change is done, we can only do it if we have the 
latency of performance-regression-finding on the order of days 
or at most weaks - not years.

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-02 21:16                                             ` Linus Torvalds
  2009-03-02 21:25                                               ` Ingo Molnar
@ 2009-03-03  4:20                                               ` Nick Piggin
  2009-03-03  9:02                                                 ` Ingo Molnar
  1 sibling, 1 reply; 50+ messages in thread
From: Nick Piggin @ 2009-03-03  4:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Arjan van de Ven, Andi Kleen, David Miller,
	mingo, sqazi, linux-kernel, tglx

On Tuesday 03 March 2009 08:16:23 Linus Torvalds wrote:
> On Mon, 2 Mar 2009, Nick Piggin wrote:
> > I would expect any high performance CPU these days to combine entries
> > in the store queue, even for normal store instructions (especially for
> > linear memcpy patterns). Isn't this likely to be the case?
>
> None of this really matters.

Well that's just what I was replying to. Of course nontemporal/uncached
stores can't avoid cc operations either, but somebody was hoping that
they would avoid the write-allocate / RMW behaviour. I just replied because
I think that modern CPUs can combine stores in their store queues to get
the same result for cacheable stores.

Of course it doesn't make it free especially if it is a cc protocol that
has to go on the interconnect anyway. But avoiding the RAM read is a
good thing anyway.


> The big issue is that before you can do any write to any cacheline, if the
> memory is cacheable, it needs the cache coherency protocol to synchronize
> with any other CPU's that may have that line in the cache.
>
> The _only_ time a write is "free" is when you already have that cacheline
> in your own cache, and in an "exclusive" state. If that is the case, then
> you know that you don't need to do anything else.
>
> In _any_ other case, before you do the write, you need to make sure that
> no other CPU in the system has that line in its cache. Whether you do that
> with a "write and invalidate" model (which would be how a store buffer
> would do it or a write-through cache would work), or whether you do it
> with a "acquire exclusive cacheline" (which is how the cache coherency
> protocol would do it), it's going to end up using cache coherency
> bandwidth.
>
> Of course, what will be the limiting factor is unclear. On a single-socket
> thing, you don't have any cache coherency issues, an the only bandwidth
> you'd end up using is the actual memory write at the memory controller
> (which may be on-die, and entirely separate from the cache coherency
> protocol). It may be idle and the write queue may be deep enough that you
> reach memory speeds and the write buffer is the optimal approach.
>
> On many sockets, the limiting factor will almost certainly be the cache
> coherency overhead (since the cache coherency traffic needs to go to _all_
> sockets, rather than just one stream to memory), at least unless you have
> a good cache coherency filter that can filter out part of the traffic
> based on whether it could be cached or not on some socket(s).
>
> IOW, it's almost impossible to tell what is the best approach. It will
> depend on number of sockets, it will depend on size of cache, and it will
> depend on the capabilities and performance of the memory controllers vs
> the cache coherency protocol.
>
> On a "single shared bus" model, the "write with invalidate" is fine, and
> it basically ends up working a lot like a single socket even if you
> actually have multiple sockets - it just won't scale much beyond two
> sockets. With HT or QPI, things are different, and the presense or absense
> of a snoop filter could make a big difference for 4+ socket situations.
>
> There simply is no single answer.
>
> And we really should keep that in mind. There is no right answer, and the
> right answer will depend on hardware. Playing cache games in software is
> almost always futile. It can be a huge improvement, but it can be a huge
> deprovement too, and it really tends to be only worth it if you (a) know
> your hardware really quite well and (b) know your _load_ pretty well too.
>
> We can play games in the kernel. We do know how many sockets there are. We
> do know the cache size. We _could_ try to make an educated guess at
> whether the next user of the data will be DMA or not. So there are
> unquestionably heuristics we could apply, but I also do suspect that
> they'd inevitably be pretty arbitrary.
>
> I suspect that we could make some boot-time (or maybe CPU hotplug time)
> decision that simply just sets a threshold value for when it is worth
> using non-temporal stores. With smaller caches, and with a single socket
> (or a single bus), it likely makes sense to use non-temporal stores
> earlier.
>
> But even with some rough heuristic, it will be wrong part of the time. So
> I think "simple and predictable" in the end tends to be better than
> "complex and still known to be broken".
>
> Btw, the "simple and predictable" could literally look at _where_ in the
> file the IO is. Because I know there are papers on the likelihood of
> re-use of data depending on where in the file it is written. Data written
> to low offsets is more likely to be accessed again (think temp-files),
> while data written to big offsets are much more likely to be random or to
> be written out (think databases or simply just large streaming files).
>
> So I suspect a "simple and predictable" algorithm could literally be
> something like
>
>  - use nontemporal stores only if you are writing a whole page, and the
>    byte offset of the page is larger than 'x', where 'x' may optionally
>    even depend on size of cache.
>
> But removing it entirely may be fine too.
>
> What I _don't_ think is fine is to think that you've "solved" it, or that
> you should even try!

Right. I don't know if you misunderstood me or aimed this post at the
general discussion rather than my reply specifically.

I know even if a CPU does write combining in the store buffer and even
if it does have "big-hammer" nontemporal stores like x86 apparently does,
then there are still cases where nontemporal stores will win if the data
doesn't get used by the CPU again.

I agree that if a heuristic can't get it right a *significant* amount of
time, then it is not worthwhile. Even if it gets it right a little more
often than wrong, the unpredictability is a negative factor. I agree
completely with you there :)

I would like to remove it, as in Ingo's last patch, FWIW. But I can see
obviously there are cases where nontemporal helps, so there will never be
a "right" answer.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-02 21:25                                               ` Ingo Molnar
@ 2009-03-03  4:30                                                 ` Nick Piggin
  0 siblings, 0 replies; 50+ messages in thread
From: Nick Piggin @ 2009-03-03  4:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, H. Peter Anvin, Arjan van de Ven, Andi Kleen,
	David Miller, sqazi, linux-kernel, tglx

On Tuesday 03 March 2009 08:25:43 Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > We can play games in the kernel. We do know how many sockets
> > there are. We do know the cache size. We _could_ try to make
> > an educated guess at whether the next user of the data will be
> > DMA or not. So there are unquestionably heuristics we could
> > apply, but I also do suspect that they'd inevitably be pretty
> > arbitrary.
>
> There's a higher-level meta-code argument to consider as well,
> and i find it equally important: finding this rather obvious and
> easy to measure performance regression caused by the
> introduction of MOVNT took us two years.
>
> That's _way_ too long - and adding more heuristics and more
> complexity will just increase this latency. (as we will create
> small niches of special cases with special workloads where we
> might or might not regress)
>
> So, if any such change is done, we can only do it if we have the
> latency of performance-regression-finding on the order of days
> or at most weaks - not years.

Something like temporal vs nontemporal stores, into the pagecache,
is a fundamental tradeoff that will depend on userspace access
pattern that the kernel can't know about. For the kernel side of
the equation, we could query the underlying backing store to see
whether it is going to do any CPU operations on the data before
writeout, eg checksumming or encryption etc. but even then, the
latency between dirtying the pagecache and initiating the writeout
means that it is probably not in cache any longer at the time of
writeout anyway. So the primary factor really is userspace access
patterns I think.

In situations like that, I think the only way to really "solve"
the problem is to provide a way for userspace to ask for temporal
or nontemporal access. Yeah this is just passing the buck, and
probably most apps that try to use this will do no better job than
the kernel :) But it allows ones that really care to give the
kernel much better information.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-03  4:20                                               ` Nick Piggin
@ 2009-03-03  9:02                                                 ` Ingo Molnar
  2009-03-04  3:37                                                   ` Nick Piggin
  0 siblings, 1 reply; 50+ messages in thread
From: Ingo Molnar @ 2009-03-03  9:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, H. Peter Anvin, Arjan van de Ven, Andi Kleen,
	David Miller, sqazi, linux-kernel, tglx


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Tuesday 03 March 2009 08:16:23 Linus Torvalds wrote:
> > On Mon, 2 Mar 2009, Nick Piggin wrote:
> > > I would expect any high performance CPU these days to combine entries
> > > in the store queue, even for normal store instructions (especially for
> > > linear memcpy patterns). Isn't this likely to be the case?
> >
> > None of this really matters.
> 
> Well that's just what I was replying to. Of course 
> nontemporal/uncached stores can't avoid cc operations either, 
> but somebody was hoping that they would avoid the 
> write-allocate / RMW behaviour. I just replied because I think 
> that modern CPUs can combine stores in their store queues to 
> get the same result for cacheable stores.
> 
> Of course it doesn't make it free especially if it is a cc 
> protocol that has to go on the interconnect anyway. But 
> avoiding the RAM read is a good thing anyway.

Hm, why do you assume that there is a RAM read? A sufficiently 
advanced x86 CPU will have good string moves with full cacheline 
transfers - removing partial cachelines and removing the need 
for the physical read.

The cacheline still has to be flushed/queried/transferred across 
the cc domain according to the cc protocol in use, to make sure 
there's no stale cached data elsewhere, but that is not a RAM 
read and in the common case (when the address is not present in 
any cache) it can be quite cheap.

The only cost is the dirty cacheline that is left around that 
increases the flush-out pressure on the cache. (the CPU might 
still be smart about this detail too so in practice a lot of 
write-allocates might not even cause that much trouble.)

	Ingo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
  2009-03-03  9:02                                                 ` Ingo Molnar
@ 2009-03-04  3:37                                                   ` Nick Piggin
  0 siblings, 0 replies; 50+ messages in thread
From: Nick Piggin @ 2009-03-04  3:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, H. Peter Anvin, Arjan van de Ven, Andi Kleen,
	David Miller, sqazi, linux-kernel, tglx

On Tuesday 03 March 2009 20:02:52 Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > On Tuesday 03 March 2009 08:16:23 Linus Torvalds wrote:
> > > On Mon, 2 Mar 2009, Nick Piggin wrote:
> > > > I would expect any high performance CPU these days to combine entries
> > > > in the store queue, even for normal store instructions (especially
> > > > for linear memcpy patterns). Isn't this likely to be the case?
> > >
> > > None of this really matters.
> >
> > Well that's just what I was replying to. Of course
> > nontemporal/uncached stores can't avoid cc operations either,
> > but somebody was hoping that they would avoid the
> > write-allocate / RMW behaviour. I just replied because I think
> > that modern CPUs can combine stores in their store queues to
> > get the same result for cacheable stores.
> >
> > Of course it doesn't make it free especially if it is a cc
> > protocol that has to go on the interconnect anyway. But
> > avoiding the RAM read is a good thing anyway.
>
> Hm, why do you assume that there is a RAM read?

I don't ;) Re-read back a few posts. I thought that nontemporal stores
would not necessarily have an advantage with avoiding write allocate
behaviour. Because I thought CPUs should combine stores in their store
buffer.

Doing some simple tests is showing that a nontemporal stores takes about
0.7 the time of doing a rep stosq here, if the destination is much larger
than cache. So the CPU isn't quite as clever as I assumed.

I can't find any references to back up my assumption, but I thought I
heard it somewhere. It might have been in relation to some powerpc CPUs
not requiring their cacheline clear instruction because they combine
store buffer entries. But I could be way off.


> A sufficiently
> advanced x86 CPU will have good string moves with full cacheline
> transfers - removing partial cachelines and removing the need
> for the physical read.

I thought this should be the case even with a plain sequence of normal
stores. But that's taking about 1.4 the time of rep sto, so again
maybe I overestimate. I don't know.


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2009-03-03 16:38 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-24  2:03 Performance regression in write() syscall Salman Qazi
2009-02-24  4:10 ` Nick Piggin
2009-02-24  4:28   ` Linus Torvalds
2009-02-24  9:02     ` Nick Piggin
2009-02-24 15:52       ` Linus Torvalds
2009-02-24 16:24         ` Andi Kleen
2009-02-24 16:51         ` Ingo Molnar
2009-02-25  3:23         ` Nick Piggin
2009-02-25  7:25           ` [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() Ingo Molnar
2009-02-25  8:09             ` Nick Piggin
2009-02-25  8:29               ` Ingo Molnar
2009-02-25  8:59                 ` Nick Piggin
2009-02-25 12:01                   ` Ingo Molnar
2009-02-25 16:04             ` Linus Torvalds
2009-02-25 16:29               ` Ingo Molnar
2009-02-27 12:05               ` Nick Piggin
2009-02-28  8:29                 ` Ingo Molnar
2009-02-28 11:49                   ` Nick Piggin
2009-02-28 12:58                     ` Ingo Molnar
2009-02-28 17:16                       ` Linus Torvalds
2009-02-28 17:24                         ` Arjan van de Ven
2009-02-28 17:42                           ` Linus Torvalds
2009-02-28 17:53                             ` Arjan van de Ven
2009-02-28 18:05                             ` Andi Kleen
2009-02-28 18:27                             ` Ingo Molnar
2009-02-28 18:39                               ` Arjan van de Ven
2009-03-02 10:39                                 ` [PATCH] x86, mm: dont use non-temporal stores in pagecache accesses Ingo Molnar
2009-02-28 18:52                               ` [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() Linus Torvalds
2009-03-01 14:19                                 ` Nick Piggin
2009-03-01  0:06                             ` David Miller
2009-03-01  0:40                               ` Andi Kleen
2009-03-01  0:28                                 ` H. Peter Anvin
2009-03-01  0:38                                   ` Arjan van de Ven
2009-03-01  1:48                                     ` Andi Kleen
2009-03-01  1:38                                       ` Arjan van de Ven
2009-03-01  1:40                                         ` H. Peter Anvin
2009-03-01 14:06                                           ` Nick Piggin
2009-03-02  4:46                                             ` H. Peter Anvin
2009-03-02  6:18                                               ` Nick Piggin
2009-03-02 21:16                                             ` Linus Torvalds
2009-03-02 21:25                                               ` Ingo Molnar
2009-03-03  4:30                                                 ` Nick Piggin
2009-03-03  4:20                                               ` Nick Piggin
2009-03-03  9:02                                                 ` Ingo Molnar
2009-03-04  3:37                                                   ` Nick Piggin
2009-03-01  2:07                                         ` Andi Kleen
2009-02-24  5:43   ` Performance regression in write() syscall Salman Qazi
2009-02-24 10:09 ` Andi Kleen
2009-02-24 16:13   ` Ingo Molnar
2009-02-24 16:51     ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).