All of lore.kernel.org
 help / color / mirror / Atom feed
* x86 memcpy performance
@ 2011-08-12 17:59 melwyn lobo
  2011-08-12 18:33 ` Andi Kleen
  2011-08-12 19:52 ` Ingo Molnar
  0 siblings, 2 replies; 40+ messages in thread
From: melwyn lobo @ 2011-08-12 17:59 UTC (permalink / raw)
  To: linux-kernel

Hi All,
Our Video recorder application uses memcpy for every frame. About 2KB
data every frame on Intel® Atom™ Z5xx processor.
With default 2.6.35 kernel we got 19.6 fps. But it seems kernel
implemented memcpy is suboptimal, because when we replaced
with an optmized one (using ssse3, exact patches are currently being
finalized) ew obtained 22fps a gain of 12.2 %.
C0 residency also reduced from 75% to 67%. This means power benefits too.
My questions:
1. Is kernel memcpy profiled for optimal performance.
2. Does the default kernel configuration for i386 include the best
memcpy implementation (AMD 3DNOW, __builtin_memcpy .... etc)

Any suggestions, prior experience on this is welcome.

Thanks,
M.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-12 17:59 x86 memcpy performance melwyn lobo
@ 2011-08-12 18:33 ` Andi Kleen
  2011-08-12 19:52 ` Ingo Molnar
  1 sibling, 0 replies; 40+ messages in thread
From: Andi Kleen @ 2011-08-12 18:33 UTC (permalink / raw)
  To: melwyn lobo; +Cc: linux-kernel

melwyn lobo <linux.melwyn@gmail.com> writes:

> Hi All,
> Our Video recorder application uses memcpy for every frame. About 2KB
> data every frame on Intel® Atom™ Z5xx processor.
> With default 2.6.35 kernel we got 19.6 fps. But it seems kernel
> implemented memcpy is suboptimal, because when we replaced
> with an optmized one (using ssse3, exact patches are currently being
> finalized) ew obtained 22fps a gain of 12.2 %.

SSE3 in the kernel memcpy would be incredible expensive,
it would need a full FPU saving for every call and preemption
disabled.

I haven't seen your patches, but until you get all that 
right (and add a lot more overhead to most copies) you
have a good change currently to corrupting user FPU state.

> C0 residency also reduced from 75% to 67%. This means power benefits too.
> My questions:
> 1. Is kernel memcpy profiled for optimal performance.

It depends on the CPU

There have been some improvements for Atom on newer kernels
I believe. 

But then kernel memcpy is usually optimized for relatively
small copies (<= 4K) because very few kernel loads do more.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-12 17:59 x86 memcpy performance melwyn lobo
  2011-08-12 18:33 ` Andi Kleen
@ 2011-08-12 19:52 ` Ingo Molnar
  2011-08-14  9:59   ` Borislav Petkov
  1 sibling, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2011-08-12 19:52 UTC (permalink / raw)
  To: melwyn lobo
  Cc: linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra


* melwyn lobo <linux.melwyn@gmail.com> wrote:

> Hi All,
> Our Video recorder application uses memcpy for every frame. About 2KB
> data every frame on Intel® Atom™ Z5xx processor.
> With default 2.6.35 kernel we got 19.6 fps. But it seems kernel
> implemented memcpy is suboptimal, because when we replaced
> with an optmized one (using ssse3, exact patches are currently being
> finalized) ew obtained 22fps a gain of 12.2 %.
> C0 residency also reduced from 75% to 67%. This means power benefits too.
> My questions:
> 1. Is kernel memcpy profiled for optimal performance.
> 2. Does the default kernel configuration for i386 include the best
> memcpy implementation (AMD 3DNOW, __builtin_memcpy .... etc)
> 
> Any suggestions, prior experience on this is welcome.

Sounds very interesting - it would be nice to see 'perf record' + 
'perf report' profiles done on that workload, before and after your 
patches.

The thing is, we obviously want to achieve those gains of 12.2% fps 
and while we probably do not want to switch the kernel's memcpy to 
SSE right now (the save/restore costs are significant), we could 
certainly try to optimize the specific codepath that your video 
playback path is hitting.

If it's some bulk memcpy in a key video driver then we could offer a 
bulk-optimized x86 memcpy variant which could be called from that 
driver - and that could use SSE3 as well.

So yes, if the speedup is real then i'm sure we can achieve that 
speedup - but exact profiles and measurements would have to be shown.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-12 19:52 ` Ingo Molnar
@ 2011-08-14  9:59   ` Borislav Petkov
  2011-08-14 11:13     ` Denys Vlasenko
  2011-08-16  2:34     ` Valdis.Kletnieks
  0 siblings, 2 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-08-14  9:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra, borislav.petkov

[-- Attachment #1: Type: text/plain, Size: 12636 bytes --]

On Fri, Aug 12, 2011 at 09:52:20PM +0200, Ingo Molnar wrote:
> Sounds very interesting - it would be nice to see 'perf record' +
> 'perf report' profiles done on that workload, before and after your
> patches.

FWIW, I've been playing with SSE memcpy version for the kernel recently
too, here's what I have so far:

First of all, I did a trace of all the memcpy buffer sizes used while
building a kernel, see attached kernel_build.sizes.

On the one hand, there is a large amount of small chunks copied (1.1M
of 1.2M calls total), and, on the other, a relatively small amount of
larger sized mem copies (256 - 2048 bytes) which are about 100K in total
but which account for the larger cumulative amount of data copied: 138MB
of 175MB total. So, if the buffer copied is big enough, the context
save/restore cost might be something we're willing to pay.

I've implemented the SSE memcpy first in userspace to measure the
speedup vs memcpy_64 we have right now:

Benchmarking with 10000 iterations, average results:
size    XM              MM              speedup
119     540.58          449.491         0.8314969419
189     296.318         263.507         0.8892692985
206     297.949         271.399         0.9108923485
224     255.565         235.38          0.9210161798
221     299.383         276.628         0.9239941159
245     299.806         279.432         0.9320430545
369     314.774         316.89          1.006721324
425     327.536         330.475         1.00897153
439     330.847         334.532         1.01113687
458     333.159         340.124         1.020904708
503     334.44          352.166         1.053003229
767     375.612         429.949         1.144661625
870     358.888         312.572         0.8709465025
882     394.297         454.977         1.153893229
925     403.82          472.56          1.170222413
1009    407.147         490.171         1.203915735
1525    512.059         660.133         1.289174911
1737    556.85          725.552         1.302958536
1778    533.839         711.59          1.332965994
1864    558.06          745.317         1.335549882
2039    585.915         813.806         1.388949687
3068    766.462         1105.56         1.442422252
3471    883.983         1239.99         1.40272883
3570    895.822         1266.74         1.414057295
3748    906.832         1302.4          1.436212771
4086    957.649         1486.93         1.552686041
6130    1238.45         1996.42         1.612023046
6961    1413.11         2201.55         1.557939181
7162    1385.5          2216.49         1.59977178
7499    1440.87         2330.12         1.617158856
8182    1610.74         2720.45         1.688950194
12273   2307.86         4042.88         1.751787902
13924   2431.8          4224.48         1.737184756
14335   2469.4          4218.82         1.708440514
15018   2675.67         1904.07         0.711622886
16374   2989.75         5296.26         1.771470902
24564   4262.15         7696.86         1.805863077
27852   4362.53         3347.72         0.7673805572
28672   5122.8          7113.14         1.388524413
30033   4874.62         8740.04         1.792967931
32768   6014.78         7564.2          1.257603505
49142   14464.2         21114.2         1.459757233
55702   16055           23496.8         1.463523623
57339   16725.7         24553.8         1.46803388
60073   17451.5         24407.3         1.398579162


Size is with randomly generated misalignment to test the implementation.

I've implemented the SSE memcpy similar to arch/x86/lib/mmx_32.c and did
some kernel build traces:

with SSE memcpy
===============

Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs):

    3301761.517649 task-clock                #   24.001 CPUs utilized            ( +-  1.48% )
           520,658 context-switches          #    0.000 M/sec                    ( +-  0.25% )
            63,845 CPU-migrations            #    0.000 M/sec                    ( +-  0.58% )
        26,070,835 page-faults               #    0.008 M/sec                    ( +-  0.00% )
 1,812,482,599,021 cycles                    #    0.549 GHz                      ( +-  0.85% ) [64.55%]
   551,783,051,492 stalled-cycles-frontend   #   30.44% frontend cycles idle     ( +-  0.98% ) [65.64%]
   444,996,901,060 stalled-cycles-backend    #   24.55% backend  cycles idle     ( +-  1.15% ) [67.16%]
 1,488,917,931,766 instructions              #    0.82  insns per cycle
                                             #    0.37  stalled cycles per insn  ( +-  0.91% ) [69.25%]
   340,575,978,517 branches                  #  103.150 M/sec                    ( +-  0.99% ) [68.29%]
    21,519,667,206 branch-misses             #    6.32% of all branches          ( +-  1.09% ) [65.11%]

     137.567155255 seconds time elapsed                                          ( +-  1.48% )


plain 3.0
=========

 Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs):

    3504754.425527 task-clock                #   24.001 CPUs utilized            ( +-  1.31% )
           518,139 context-switches          #    0.000 M/sec                    ( +-  0.32% )
            61,790 CPU-migrations            #    0.000 M/sec                    ( +-  0.73% )
        26,056,947 page-faults               #    0.007 M/sec                    ( +-  0.00% )
 1,826,757,751,616 cycles                    #    0.521 GHz                      ( +-  0.66% ) [63.86%]
   557,800,617,954 stalled-cycles-frontend   #   30.54% frontend cycles idle     ( +-  0.79% ) [64.65%]
   443,950,768,357 stalled-cycles-backend    #   24.30% backend  cycles idle     ( +-  0.60% ) [67.07%]
 1,469,707,613,500 instructions              #    0.80  insns per cycle
                                             #    0.38  stalled cycles per insn  ( +-  0.68% ) [69.98%]
   335,560,565,070 branches                  #   95.744 M/sec                    ( +-  0.67% ) [69.09%]
    21,365,279,176 branch-misses             #    6.37% of all branches          ( +-  0.65% ) [65.36%]

     146.025263276 seconds time elapsed                                          ( +-  1.31% )


So, although kernel build is probably not the proper workload for an
SSE memcpy routine, I'm seeing 9 secs build time improvement, i.e.
something around 6%. We're executing a bit more instructions but I'd say
the amount of data moved per instruction is higher due to the quadword
moves.

Here's the SSE memcpy version I got so far, I haven't wired in the
proper CPU feature detection yet because we want to run more benchmarks
like netperf and stuff to see whether we see any positive results there.

The SYSTEM_RUNNING check is to take care of early boot situations where
we can't handle FPU exceptions but we use memcpy. There's an aligned and
misaligned variant which should handle any buffers and sizes although
I've set the SSE memcpy threshold at 512 Bytes buffersize the least to
cover context save/restore somewhat.

Comments are much appreciated! :-)

--
>From 385519e844f3466f500774c2c37afe44691ef8d2 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Aug 2011 18:43:08 +0200
Subject: [PATCH] SSE3 memcpy in C

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/include/asm/string_64.h |   14 ++++-
 arch/x86/lib/Makefile            |    2 +-
 arch/x86/lib/sse_memcpy_64.c     |  133 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 146 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/lib/sse_memcpy_64.c

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..7bd51bb 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
 
 #define __HAVE_ARCH_MEMCPY 1
 #ifndef CONFIG_KMEMCHECK
+extern void *__memcpy(void *to, const void *from, size_t len);
+extern void *__sse_memcpy(void *to, const void *from, size_t len);
 #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
-extern void *memcpy(void *to, const void *from, size_t len);
+#define memcpy(dst, src, len)					\
+({								\
+	size_t __len = (len);					\
+	void *__ret;						\
+	if (__len >= 512)					\
+		__ret = __sse_memcpy((dst), (src), __len);	\
+	else							\
+		__ret = __memcpy((dst), (src), __len);		\
+	__ret;							\
+})
 #else
-extern void *__memcpy(void *to, const void *from, size_t len);
 #define memcpy(dst, src, len)					\
 ({								\
 	size_t __len = (len);					\
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index f2479f1..5f90709 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -36,7 +36,7 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y)
 endif
         lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o
 else
-        obj-y += iomap_copy_64.o
+        obj-y += iomap_copy_64.o sse_memcpy_64.o
         lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
         lib-y += thunk_64.o clear_page_64.o copy_page_64.o
         lib-y += memmove_64.o memset_64.o
diff --git a/arch/x86/lib/sse_memcpy_64.c b/arch/x86/lib/sse_memcpy_64.c
new file mode 100644
index 0000000..b53fc31
--- /dev/null
+++ b/arch/x86/lib/sse_memcpy_64.c
@@ -0,0 +1,133 @@
+#include <linux/module.h>
+
+#include <asm/i387.h>
+#include <asm/string_64.h>
+
+void *__sse_memcpy(void *to, const void *from, size_t len)
+{
+	unsigned long src = (unsigned long)from;
+	unsigned long dst = (unsigned long)to;
+	void *p = to;
+	int i;
+
+	if (in_interrupt())
+		return __memcpy(to, from, len);
+
+	if (system_state != SYSTEM_RUNNING)
+		return __memcpy(to, from, len);
+
+	kernel_fpu_begin();
+
+	/* check alignment */
+	if ((src ^ dst) & 0xf)
+		goto unaligned;
+
+	if (src & 0xf) {
+		u8 chunk = 0x10 - (src & 0xf);
+
+		/* copy chunk until next 16-byte  */
+		__memcpy(to, from, chunk);
+		len -= chunk;
+		to += chunk;
+		from += chunk;
+	}
+
+	/*
+	 * copy in 256 Byte portions
+	 */
+	for (i = 0; i < (len & ~0xff); i += 256) {
+		asm volatile(
+		"movaps 0x0(%0),  %%xmm0\n\t"
+		"movaps 0x10(%0), %%xmm1\n\t"
+		"movaps 0x20(%0), %%xmm2\n\t"
+		"movaps 0x30(%0), %%xmm3\n\t"
+		"movaps 0x40(%0), %%xmm4\n\t"
+		"movaps 0x50(%0), %%xmm5\n\t"
+		"movaps 0x60(%0), %%xmm6\n\t"
+		"movaps 0x70(%0), %%xmm7\n\t"
+		"movaps 0x80(%0), %%xmm8\n\t"
+		"movaps 0x90(%0), %%xmm9\n\t"
+		"movaps 0xa0(%0), %%xmm10\n\t"
+		"movaps 0xb0(%0), %%xmm11\n\t"
+		"movaps 0xc0(%0), %%xmm12\n\t"
+		"movaps 0xd0(%0), %%xmm13\n\t"
+		"movaps 0xe0(%0), %%xmm14\n\t"
+		"movaps 0xf0(%0), %%xmm15\n\t"
+
+		"movaps %%xmm0,  0x0(%1)\n\t"
+		"movaps %%xmm1,  0x10(%1)\n\t"
+		"movaps %%xmm2,  0x20(%1)\n\t"
+		"movaps %%xmm3,  0x30(%1)\n\t"
+		"movaps %%xmm4,  0x40(%1)\n\t"
+		"movaps %%xmm5,  0x50(%1)\n\t"
+		"movaps %%xmm6,  0x60(%1)\n\t"
+		"movaps %%xmm7,  0x70(%1)\n\t"
+		"movaps %%xmm8,  0x80(%1)\n\t"
+		"movaps %%xmm9,  0x90(%1)\n\t"
+		"movaps %%xmm10, 0xa0(%1)\n\t"
+		"movaps %%xmm11, 0xb0(%1)\n\t"
+		"movaps %%xmm12, 0xc0(%1)\n\t"
+		"movaps %%xmm13, 0xd0(%1)\n\t"
+		"movaps %%xmm14, 0xe0(%1)\n\t"
+		"movaps %%xmm15, 0xf0(%1)\n\t"
+		: : "r" (from), "r" (to) : "memory");
+
+		from += 256;
+		to += 256;
+	}
+
+	goto trailer;
+
+unaligned:
+	/*
+	 * copy in 256 Byte portions unaligned
+	 */
+	for (i = 0; i < (len & ~0xff); i += 256) {
+		asm volatile(
+		"movups 0x0(%0),  %%xmm0\n\t"
+		"movups 0x10(%0), %%xmm1\n\t"
+		"movups 0x20(%0), %%xmm2\n\t"
+		"movups 0x30(%0), %%xmm3\n\t"
+		"movups 0x40(%0), %%xmm4\n\t"
+		"movups 0x50(%0), %%xmm5\n\t"
+		"movups 0x60(%0), %%xmm6\n\t"
+		"movups 0x70(%0), %%xmm7\n\t"
+		"movups 0x80(%0), %%xmm8\n\t"
+		"movups 0x90(%0), %%xmm9\n\t"
+		"movups 0xa0(%0), %%xmm10\n\t"
+		"movups 0xb0(%0), %%xmm11\n\t"
+		"movups 0xc0(%0), %%xmm12\n\t"
+		"movups 0xd0(%0), %%xmm13\n\t"
+		"movups 0xe0(%0), %%xmm14\n\t"
+		"movups 0xf0(%0), %%xmm15\n\t"
+
+		"movups %%xmm0,  0x0(%1)\n\t"
+		"movups %%xmm1,  0x10(%1)\n\t"
+		"movups %%xmm2,  0x20(%1)\n\t"
+		"movups %%xmm3,  0x30(%1)\n\t"
+		"movups %%xmm4,  0x40(%1)\n\t"
+		"movups %%xmm5,  0x50(%1)\n\t"
+		"movups %%xmm6,  0x60(%1)\n\t"
+		"movups %%xmm7,  0x70(%1)\n\t"
+		"movups %%xmm8,  0x80(%1)\n\t"
+		"movups %%xmm9,  0x90(%1)\n\t"
+		"movups %%xmm10, 0xa0(%1)\n\t"
+		"movups %%xmm11, 0xb0(%1)\n\t"
+		"movups %%xmm12, 0xc0(%1)\n\t"
+		"movups %%xmm13, 0xd0(%1)\n\t"
+		"movups %%xmm14, 0xe0(%1)\n\t"
+		"movups %%xmm15, 0xf0(%1)\n\t"
+		: : "r" (from), "r" (to) : "memory");
+
+		from += 256;
+		to += 256;
+	}
+
+trailer:
+	__memcpy(to, from, len & 0xff);
+
+	kernel_fpu_end();
+
+	return p;
+}
+EXPORT_SYMBOL_GPL(__sse_memcpy);
-- 
1.7.6.134.gcf13f6


-- 
Regards/Gruss,
    Boris.

[-- Attachment #2: kernel_build.sizes --]
[-- Type: text/plain, Size: 925 bytes --]

Bytes	Count
=====	=====
0	5447
1	3850
2	16255
3	11113
4	68870
5	4256
6	30433
7	19188
8	50490
9	5999
10	78275
11	5628
12	6870
13	7371
14	4742
15	4911
16	143835
17	14096
18	1573
19	13603
20	424321
21	741
22	584
23	450
24	472
25	685
26	367
27	365
28	333
29	301
30	300
31	269
32	489
33	272
34	266
35	220
36	239
37	209
38	249
39	235
40	207
41	181
42	150
43	98
44	194
45	66
46	62
47	52
48	67226
49	138
50	171
51	26
52	20
53	12
54	15
55	4
56	13
57	8
58	6
59	6
60	115
61	10
62	5
63	12
64	67353
65	6
66	2363
67	9
68	11
69	6
70	5
71	6
72	10
73	4
74	9
75	8
76	4
77	6
78	3
79	4
80	3
81	4
82	4
83	4
84	4
85	8
86	6
87	2
88	3
89	2
90	2
91	1
92	9
93	1
94	2
96	2
97	2
98	3
100	2
102	1
104	1
105	1
106	1
107	2
109	1
110	1
111	1
112	1
113	2
115	2
117	1
118	1
119	1
120	14
127	1
128	1
130	1
131	2
134	2
137	1
144	100092
149	1
151	1
153	1
158	1
185	1
217	4
224	3
225	3
227	3
244	1
254	5
255	13
256	21708
512	21746
848	12907
1920	36536
2048	21708

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-14  9:59   ` Borislav Petkov
@ 2011-08-14 11:13     ` Denys Vlasenko
  2011-08-14 12:40       ` Borislav Petkov
  2011-08-16  2:34     ` Valdis.Kletnieks
  1 sibling, 1 reply; 40+ messages in thread
From: Denys Vlasenko @ 2011-08-14 11:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov

On Sunday 14 August 2011 11:59, Borislav Petkov wrote:
> Here's the SSE memcpy version I got so far, I haven't wired in the
> proper CPU feature detection yet because we want to run more benchmarks
> like netperf and stuff to see whether we see any positive results there.
> 
> The SYSTEM_RUNNING check is to take care of early boot situations where
> we can't handle FPU exceptions but we use memcpy. There's an aligned and
> misaligned variant which should handle any buffers and sizes although
> I've set the SSE memcpy threshold at 512 Bytes buffersize the least to
> cover context save/restore somewhat.
> 
> Comments are much appreciated! :-)
> 
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
>  
>  #define __HAVE_ARCH_MEMCPY 1
>  #ifndef CONFIG_KMEMCHECK
> +extern void *__memcpy(void *to, const void *from, size_t len);
> +extern void *__sse_memcpy(void *to, const void *from, size_t len);
>  #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
> -extern void *memcpy(void *to, const void *from, size_t len);
> +#define memcpy(dst, src, len)					\
> +({								\
> +	size_t __len = (len);					\
> +	void *__ret;						\
> +	if (__len >= 512)					\
> +		__ret = __sse_memcpy((dst), (src), __len);	\
> +	else							\
> +		__ret = __memcpy((dst), (src), __len);		\
> +	__ret;							\
> +})

Please, no. Do not inline every memcpy invocation.
This is pure bloat (comsidering how many memcpy calls there are)
and it doesn't even win anything in speed, since there will be
a fucntion call either way.
Put the __len >= 512 check inside your memcpy instead.

You may do the check if you know that __len is constant:
if (__builtin_constant_p(__len) && __len >= 512) ...
because in this case gcc will evaluate it at compile-time.

-- 
vda

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-14 11:13     ` Denys Vlasenko
@ 2011-08-14 12:40       ` Borislav Petkov
  2011-08-15 13:27         ` melwyn lobo
  2011-08-15 13:44         ` Denys Vlasenko
  0 siblings, 2 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-08-14 12:40 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov

On Sun, Aug 14, 2011 at 01:13:56PM +0200, Denys Vlasenko wrote:
> On Sunday 14 August 2011 11:59, Borislav Petkov wrote:
> > Here's the SSE memcpy version I got so far, I haven't wired in the
> > proper CPU feature detection yet because we want to run more benchmarks
> > like netperf and stuff to see whether we see any positive results there.
> > 
> > The SYSTEM_RUNNING check is to take care of early boot situations where
> > we can't handle FPU exceptions but we use memcpy. There's an aligned and
> > misaligned variant which should handle any buffers and sizes although
> > I've set the SSE memcpy threshold at 512 Bytes buffersize the least to
> > cover context save/restore somewhat.
> > 
> > Comments are much appreciated! :-)
> > 
> > --- a/arch/x86/include/asm/string_64.h
> > +++ b/arch/x86/include/asm/string_64.h
> > @@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
> >  
> >  #define __HAVE_ARCH_MEMCPY 1
> >  #ifndef CONFIG_KMEMCHECK
> > +extern void *__memcpy(void *to, const void *from, size_t len);
> > +extern void *__sse_memcpy(void *to, const void *from, size_t len);
> >  #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
> > -extern void *memcpy(void *to, const void *from, size_t len);
> > +#define memcpy(dst, src, len)					\
> > +({								\
> > +	size_t __len = (len);					\
> > +	void *__ret;						\
> > +	if (__len >= 512)					\
> > +		__ret = __sse_memcpy((dst), (src), __len);	\
> > +	else							\
> > +		__ret = __memcpy((dst), (src), __len);		\
> > +	__ret;							\
> > +})
> 
> Please, no. Do not inline every memcpy invocation.
> This is pure bloat (comsidering how many memcpy calls there are)
> and it doesn't even win anything in speed, since there will be
> a fucntion call either way.
> Put the __len >= 512 check inside your memcpy instead.

In the __len < 512 case, this would actually cause two function calls,
actually: once the __sse_memcpy and then the __memcpy one.

> You may do the check if you know that __len is constant:
> if (__builtin_constant_p(__len) && __len >= 512) ...
> because in this case gcc will evaluate it at compile-time.

That could justify the bloat at least partially.

Actually, I had a version which sticks sse_memcpy code into memcpy_64.S
and that would save us both the function call and the bloat. I might
return to that one if it turns out that SSE memcpy makes sense for the
kernel.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-14 12:40       ` Borislav Petkov
@ 2011-08-15 13:27         ` melwyn lobo
  2011-08-15 13:44         ` Denys Vlasenko
  1 sibling, 0 replies; 40+ messages in thread
From: melwyn lobo @ 2011-08-15 13:27 UTC (permalink / raw)
  To: Borislav Petkov, Denys Vlasenko, Ingo Molnar, melwyn lobo,
	linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra, borislav.petkov

Hi,
Was on a vacation for last two days. Thanks for the good insights into
the issue.
Ingo, unfortunately the data we have is on a soon to be released
platform and strictly confidential at this stage.

Boris, thanks for the patch. On seeing your patch:
+void *__sse_memcpy(void *to, const void *from, size_t len)
+{
+       unsigned long src = (unsigned long)from;
+       unsigned long dst = (unsigned long)to;
+       void *p = to;
+       int i;
+
+       if (in_interrupt())
+               return __memcpy(to, from, len)
So what is the reason we cannot use sse_memcpy in interrupt context.
(fpu registers not saved ? )
My question is still not answered. There are 3 versions of memcpy in kernel:

***********************************arch/x86/include/asm/string_32.h******************************
179 #ifndef CONFIG_KMEMCHECK
180
181 #if (__GNUC__ >= 4)
182 #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
183 #else
184 #define memcpy(t, f, n)                         \
185         (__builtin_constant_p((n))              \
186          ? __constant_memcpy((t), (f), (n))     \
187          : __memcpy((t), (f), (n)))
188 #endif
189 #else
190 /*
191  * kmemcheck becomes very happy if we use the REP instructions
unconditionally,
192  * because it means that we know both memory operands in advance.
193  */
194 #define memcpy(t, f, n) __memcpy((t), (f), (n))
195 #endif
196
197
****************************************************************************************.
I will ignore CONFIG_X86_USE_3DNOW (including mmx_memcpy() ) as this
is valid only for AMD and not for Atom Z5xx series.
This means __memcpy, __constant_memcpy, __builtin_memcpy .
I have a hunch by default we were using  __builtin_memcpy. This  is
because I see my GCC version >=4 and CONFIG_KMEMCHECK not defined.
Can someone confirm of these 3 which is used, with i386_defconfig.
Again with i386_defconfig which workloads provide the best results
with the default implementation.

thanks,
M.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-14 12:40       ` Borislav Petkov
  2011-08-15 13:27         ` melwyn lobo
@ 2011-08-15 13:44         ` Denys Vlasenko
  1 sibling, 0 replies; 40+ messages in thread
From: Denys Vlasenko @ 2011-08-15 13:44 UTC (permalink / raw)
  To: Borislav Petkov, Denys Vlasenko, Ingo Molnar, melwyn lobo,
	linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra, borislav.petkov

On Sun, Aug 14, 2011 at 2:40 PM, Borislav Petkov <bp@alien8.de> wrote:
>> > +   if (__len >= 512)                                       \
>> > +           __ret = __sse_memcpy((dst), (src), __len);      \
>> > +   else                                                    \
>> > +           __ret = __memcpy((dst), (src), __len);          \
>> > +   __ret;                                                  \
>> > +})
>>
>> Please, no. Do not inline every memcpy invocation.
>> This is pure bloat (comsidering how many memcpy calls there are)
>> and it doesn't even win anything in speed, since there will be
>> a fucntion call either way.
>> Put the __len >= 512 check inside your memcpy instead.
>
> In the __len < 512 case, this would actually cause two function calls,
> actually: once the __sse_memcpy and then the __memcpy one.

You didn't notice the "else".

>> You may do the check if you know that __len is constant:
>> if (__builtin_constant_p(__len) && __len >= 512) ...
>> because in this case gcc will evaluate it at compile-time.
>
> That could justify the bloat at least partially.

There will be no bloat in this case.
-- 
vda

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-14  9:59   ` Borislav Petkov
  2011-08-14 11:13     ` Denys Vlasenko
@ 2011-08-16  2:34     ` Valdis.Kletnieks
  2011-08-16 12:16       ` Borislav Petkov
  1 sibling, 1 reply; 40+ messages in thread
From: Valdis.Kletnieks @ 2011-08-16  2:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov

[-- Attachment #1: Type: text/plain, Size: 1109 bytes --]

On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:

> Benchmarking with 10000 iterations, average results:
> size    XM              MM              speedup
> 119     540.58          449.491         0.8314969419

> 12273   2307.86         4042.88         1.751787902
> 13924   2431.8          4224.48         1.737184756
> 14335   2469.4          4218.82         1.708440514
> 15018   2675.67         1904.07         0.711622886
> 16374   2989.75         5296.26         1.771470902
> 24564   4262.15         7696.86         1.805863077
> 27852   4362.53         3347.72         0.7673805572
> 28672   5122.8          7113.14         1.388524413
> 30033   4874.62         8740.04         1.792967931

The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
really good about this till we understand what happened for those two cases.

Also, anytime I see "10000 iterations",  I ask myself if the benchmark rigging
took proper note of hot/cold cache issues.  That *may* explain the two oddball
results we see above - but not knowing more about how it was benched, it's hard
to say.


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-16  2:34     ` Valdis.Kletnieks
@ 2011-08-16 12:16       ` Borislav Petkov
  2011-09-01 15:15         ` Maarten Lankhorst
  0 siblings, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-08-16 12:16 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Borislav Petkov, Ingo Molnar, melwyn lobo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 2448 bytes --]

On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@vt.edu wrote:
> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
> 
> > Benchmarking with 10000 iterations, average results:
> > size    XM              MM              speedup
> > 119     540.58          449.491         0.8314969419
> 
> > 12273   2307.86         4042.88         1.751787902
> > 13924   2431.8          4224.48         1.737184756
> > 14335   2469.4          4218.82         1.708440514
> > 15018   2675.67         1904.07         0.711622886
> > 16374   2989.75         5296.26         1.771470902
> > 24564   4262.15         7696.86         1.805863077
> > 27852   4362.53         3347.72         0.7673805572
> > 28672   5122.8          7113.14         1.388524413
> > 30033   4874.62         8740.04         1.792967931
> 
> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
> really good about this till we understand what happened for those two cases.

Yep.

> Also, anytime I see "10000 iterations", I ask myself if the benchmark
> rigging took proper note of hot/cold cache issues. That *may* explain
> the two oddball results we see above - but not knowing more about how
> it was benched, it's hard to say.

Yeah, the more scrutiny this gets the better. So I've cleaned up my
setup and have attached it.

xm_mem.c does the benchmarking and in bench_memcpy() there's the
sse_memcpy call which is the SSE memcpy implementation using inline asm.
It looks like gcc produces pretty crappy code here because if I replace
the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
same function but in pure asm - I get much better numbers, sometimes
even over 2x. It all depends on the alignment of the buffers though.
Also, those numbers don't include the context saving/restoring which the
kernel does for us.

7491    1509.89         2346.94         1.554378381
8170    2166.81         2857.78         1.318890326
12277   2659.03         4179.31         1.571744176
13907   2571.24         4125.7          1.604558427
14319   2638.74         5799.67         2.19789466	<----
14993   2752.42         4413.85         1.603625603
16371   3479.11         5562.65         1.59887055

So please take a look and let me know what you think.

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

[-- Attachment #2: sse_memcpy.tar.bz2 --]
[-- Type: application/octet-stream, Size: 3508 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-16 12:16       ` Borislav Petkov
@ 2011-09-01 15:15         ` Maarten Lankhorst
  2011-09-01 16:18           ` Linus Torvalds
  2011-12-05 12:54           ` melwyn lobo
  0 siblings, 2 replies; 40+ messages in thread
From: Maarten Lankhorst @ 2011-09-01 15:15 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Valdis.Kletnieks, Borislav Petkov, Ingo Molnar, melwyn lobo,
	linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 3418 bytes --]

Hey,

2011/8/16 Borislav Petkov <bp@amd64.org>:
> On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@vt.edu wrote:
>> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
>>
>> > Benchmarking with 10000 iterations, average results:
>> > size    XM              MM              speedup
>> > 119     540.58          449.491         0.8314969419
>>
>> > 12273   2307.86         4042.88         1.751787902
>> > 13924   2431.8          4224.48         1.737184756
>> > 14335   2469.4          4218.82         1.708440514
>> > 15018 2675.67         1904.07         0.711622886
>> > 16374   2989.75         5296.26         1.771470902
>> > 24564   4262.15         7696.86         1.805863077
>> > 27852   4362.53         3347.72         0.7673805572
>> > 28672   5122.8          7113.14         1.388524413
>> > 30033   4874.62         8740.04         1.792967931
>>
>> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
>> really good about this till we understand what happened for those two cases.
>
> Yep.
>
>> Also, anytime I see "10000 iterations", I ask myself if the benchmark
>> rigging took proper note of hot/cold cache issues. That *may* explain
>> the two oddball results we see above - but not knowing more about how
>> it was benched, it's hard to say.
>
> Yeah, the more scrutiny this gets the better. So I've cleaned up my
> setup and have attached it.
>
> xm_mem.c does the benchmarking and in bench_memcpy() there's the
> sse_memcpy call which is the SSE memcpy implementation using inline asm.
> It looks like gcc produces pretty crappy code here because if I replace
> the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
> same function but in pure asm - I get much better numbers, sometimes
> even over 2x. It all depends on the alignment of the buffers though.
> Also, those numbers don't include the context saving/restoring which the
> kernel does for us.
>
> 7491    1509.89         2346.94         1.554378381
> 8170    2166.81         2857.78         1.318890326
> 12277   2659.03         4179.31         1.571744176
> 13907   2571.24         4125.7          1.604558427
> 14319   2638.74         5799.67         2.19789466      <----
> 14993   2752.42         4413.85         1.603625603
> 16371   3479.11         5562.65         1.59887055

This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
and I finally figured out why. I also extended the test to an optimized avx memcpy,
but I think the kernel memcpy will always win in the aligned case.

Those numbers you posted aren't right it seems. It depends a lot on the alignment,
for example if both are aligned to 64 relative to each other,
kernel memcpy will win from avx memcpy on my machine.

I replaced the malloc calls with memalign(65536, size + 256) so I could toy
around with the alignments a little. This explains why for some sizes, kernel
memcpy was faster than sse memcpy in the test results you had.
When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise
avx memcpy might.

If you want to speed up memcpy, I think your best bet is to find out why it's
so much slower when src and dst aren't 64-byte aligned compared to each other.

Cheers,
Maarten

---
Attached: my modified version of the sse memcpy you posted.

I changed it a bit, and used avx, but some of the other changes might
be better for your sse memcpy too.

[-- Attachment #2: ym_memcpy.txt --]
[-- Type: text/plain, Size: 2668 bytes --]

/*
 * ym_memcpy - AVX version of memcpy
 *
 * Input:
 *  rdi destination
 *  rsi source
 *  rdx count
 *
 * Output:
 * rax original destination
 */
.globl ym_memcpy
.type ym_memcpy, @function

ym_memcpy:
	mov %rdi, %rax

	/* Target align */
	movzbq %dil, %rcx
	negb %cl
	andb $0x1f, %cl
	subq %rcx, %rdx
	rep movsb

	movq %rdx, %rcx
	andq $0x1ff, %rdx
	shrq $9, %rcx
	jz .trailer

	movb %sil, %r8b
	andb $0x1f, %r8b
	test %r8b, %r8b
	jz .repeat_a

	.align 32
.repeat_ua:
	vmovups 0x0(%rsi), %ymm0
	vmovups 0x20(%rsi), %ymm1
	vmovups 0x40(%rsi), %ymm2
	vmovups 0x60(%rsi), %ymm3
	vmovups 0x80(%rsi), %ymm4
	vmovups 0xa0(%rsi), %ymm5
	vmovups 0xc0(%rsi), %ymm6
	vmovups 0xe0(%rsi), %ymm7
	vmovups 0x100(%rsi), %ymm8
	vmovups 0x120(%rsi), %ymm9
	vmovups 0x140(%rsi), %ymm10
	vmovups 0x160(%rsi), %ymm11
	vmovups 0x180(%rsi), %ymm12
	vmovups 0x1a0(%rsi), %ymm13
	vmovups 0x1c0(%rsi), %ymm14
	vmovups 0x1e0(%rsi), %ymm15

	vmovaps %ymm0, 0x0(%rdi)
	vmovaps %ymm1, 0x20(%rdi)
	vmovaps %ymm2, 0x40(%rdi)
	vmovaps %ymm3, 0x60(%rdi)
	vmovaps %ymm4, 0x80(%rdi)
	vmovaps %ymm5, 0xa0(%rdi)
	vmovaps %ymm6, 0xc0(%rdi)
	vmovaps %ymm7, 0xe0(%rdi)
	vmovaps %ymm8, 0x100(%rdi)
	vmovaps %ymm9, 0x120(%rdi)
	vmovaps %ymm10, 0x140(%rdi)
	vmovaps %ymm11, 0x160(%rdi)
	vmovaps %ymm12, 0x180(%rdi)
	vmovaps %ymm13, 0x1a0(%rdi)
	vmovaps %ymm14, 0x1c0(%rdi)
	vmovaps %ymm15, 0x1e0(%rdi)

	/* advance pointers */
	addq $0x200, %rsi
	addq $0x200, %rdi
	subq $1, %rcx
	jnz .repeat_ua
	jz .trailer

	.align 32
.repeat_a:
	prefetchnta 0x80(%rsi)
	prefetchnta 0x100(%rsi)
	prefetchnta 0x180(%rsi)
	vmovaps 0x0(%rsi), %ymm0
	vmovaps 0x20(%rsi), %ymm1
	vmovaps 0x40(%rsi), %ymm2
	vmovaps 0x60(%rsi), %ymm3
	vmovaps 0x80(%rsi), %ymm4
	vmovaps 0xa0(%rsi), %ymm5
	vmovaps 0xc0(%rsi), %ymm6
	vmovaps 0xe0(%rsi), %ymm7
	vmovaps 0x100(%rsi), %ymm8
	vmovaps 0x120(%rsi), %ymm9
	vmovaps 0x140(%rsi), %ymm10
	vmovaps 0x160(%rsi), %ymm11
	vmovaps 0x180(%rsi), %ymm12
	vmovaps 0x1a0(%rsi), %ymm13
	vmovaps 0x1c0(%rsi), %ymm14
	vmovaps 0x1e0(%rsi), %ymm15

	vmovaps %ymm0, 0x0(%rdi)
	vmovaps %ymm1, 0x20(%rdi)
	vmovaps %ymm2, 0x40(%rdi)
	vmovaps %ymm3, 0x60(%rdi)
	vmovaps %ymm4, 0x80(%rdi)
	vmovaps %ymm5, 0xa0(%rdi)
	vmovaps %ymm6, 0xc0(%rdi)
	vmovaps %ymm7, 0xe0(%rdi)
	vmovaps %ymm8, 0x100(%rdi)
	vmovaps %ymm9, 0x120(%rdi)
	vmovaps %ymm10, 0x140(%rdi)
	vmovaps %ymm11, 0x160(%rdi)
	vmovaps %ymm12, 0x180(%rdi)
	vmovaps %ymm13, 0x1a0(%rdi)
	vmovaps %ymm14, 0x1c0(%rdi)
	vmovaps %ymm15, 0x1e0(%rdi)

	/* advance pointers */
	addq $0x200, %rsi
	addq $0x200, %rdi
	subq $1, %rcx
	jnz .repeat_a

	.align 32
.trailer:
	movq %rdx, %rcx
	shrq $3, %rcx
	rep; movsq
	movq %rdx, %rcx
	andq $0x7, %rcx
	rep; movsb
	retq

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-09-01 15:15         ` Maarten Lankhorst
@ 2011-09-01 16:18           ` Linus Torvalds
  2011-09-08  8:35             ` Borislav Petkov
  2011-12-05 12:54           ` melwyn lobo
  1 sibling, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2011-09-01 16:18 UTC (permalink / raw)
  To: Maarten Lankhorst
  Cc: Borislav Petkov, Valdis.Kletnieks, Borislav Petkov, Ingo Molnar,
	melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra

On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
<m.b.lankhorst@gmail.com> wrote:
>
> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
> and I finally figured out why. I also extended the test to an optimized avx memcpy,
> but I think the kernel memcpy will always win in the aligned case.

"rep movs" is generally optimized in microcode on most modern Intel
CPU's for some easyish cases, and it will outperform just about
anything.

Atom is a notable exception, but if you expect performance on any
general loads from Atom, you need to get your head examined. Atom is a
disaster for anything but tuned loops.

The "easyish cases" depend on microarchitecture. They are improving,
so long-term "rep movs" is the best way regardless, but for most
current ones it's something like "source aligned to 8 bytes *and*
source and destination are equal "mod 64"".

And that's true in a lot of common situations. It's true for the page
copy, for example, and it's often true for big user "read()/write()"
calls (but "often" may not be "often enough" - high-performance
userland should strive to align read/write buffers to 64 bytes, for
example).

Many other cases of "memcpy()" are the fairly small, constant-sized
ones, where the optimal strategy tends to be "move words by hand".

                      Linus

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-09-01 16:18           ` Linus Torvalds
@ 2011-09-08  8:35             ` Borislav Petkov
  2011-09-08 10:58               ` Maarten Lankhorst
  0 siblings, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-09-08  8:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Maarten Lankhorst, Borislav Petkov, Valdis.Kletnieks,
	Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra

On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote:
> On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
> <m.b.lankhorst@gmail.com> wrote:
> >
> > This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
> > and I finally figured out why. I also extended the test to an optimized avx memcpy,
> > but I think the kernel memcpy will always win in the aligned case.
> 
> "rep movs" is generally optimized in microcode on most modern Intel
> CPU's for some easyish cases, and it will outperform just about
> anything.
> 
> Atom is a notable exception, but if you expect performance on any
> general loads from Atom, you need to get your head examined. Atom is a
> disaster for anything but tuned loops.
> 
> The "easyish cases" depend on microarchitecture. They are improving,
> so long-term "rep movs" is the best way regardless, but for most
> current ones it's something like "source aligned to 8 bytes *and*
> source and destination are equal "mod 64"".
> 
> And that's true in a lot of common situations. It's true for the page
> copy, for example, and it's often true for big user "read()/write()"
> calls (but "often" may not be "often enough" - high-performance
> userland should strive to align read/write buffers to 64 bytes, for
> example).
> 
> Many other cases of "memcpy()" are the fairly small, constant-sized
> ones, where the optimal strategy tends to be "move words by hand".

Yeah,

this probably makes enabling SSE memcpy in the kernel a task
with diminishing returns. There are also the additional costs of
saving/restoring FPU context in the kernel which eat off from any SSE
speedup.

And then there's the additional I$ pressure because "rep movs" is
much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the
smallest (two-byte) instructions I could use - in the AVX case they can
get up to 4 Bytes of length with the VEX prefix and the additional SIB,
size override, etc. fields.

Oh, and then there's copy_*_user which also does fault handling and
replacing that with a SSE version of memcpy could get quite hairy quite
fast.

Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel
when I get the time to see whether it still makes sense, at all.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-09-08  8:35             ` Borislav Petkov
@ 2011-09-08 10:58               ` Maarten Lankhorst
  2011-09-09  8:14                 ` Borislav Petkov
  0 siblings, 1 reply; 40+ messages in thread
From: Maarten Lankhorst @ 2011-09-08 10:58 UTC (permalink / raw)
  To: Borislav Petkov, Linus Torvalds, Borislav Petkov,
	Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 3330 bytes --]

On 09/08/2011 10:35 AM, Borislav Petkov wrote:
> On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote:
>> On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
>> <m.b.lankhorst@gmail.com> wrote:
>>> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
>>> and I finally figured out why. I also extended the test to an optimized avx memcpy,
>>> but I think the kernel memcpy will always win in the aligned case.
>> "rep movs" is generally optimized in microcode on most modern Intel
>> CPU's for some easyish cases, and it will outperform just about
>> anything.
>>
>> Atom is a notable exception, but if you expect performance on any
>> general loads from Atom, you need to get your head examined. Atom is a
>> disaster for anything but tuned loops.
>>
>> The "easyish cases" depend on microarchitecture. They are improving,
>> so long-term "rep movs" is the best way regardless, but for most
>> current ones it's something like "source aligned to 8 bytes *and*
>> source and destination are equal "mod 64"".
>>
>> And that's true in a lot of common situations. It's true for the page
>> copy, for example, and it's often true for big user "read()/write()"
>> calls (but "often" may not be "often enough" - high-performance
>> userland should strive to align read/write buffers to 64 bytes, for
>> example).
>>
>> Many other cases of "memcpy()" are the fairly small, constant-sized
>> ones, where the optimal strategy tends to be "move words by hand".
> Yeah,
>
> this probably makes enabling SSE memcpy in the kernel a task
> with diminishing returns. There are also the additional costs of
> saving/restoring FPU context in the kernel which eat off from any SSE
> speedup.
>
> And then there's the additional I$ pressure because "rep movs" is
> much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the
> smallest (two-byte) instructions I could use - in the AVX case they can
> get up to 4 Bytes of length with the VEX prefix and the additional SIB,
> size override, etc. fields.
>
> Oh, and then there's copy_*_user which also does fault handling and
> replacing that with a SSE version of memcpy could get quite hairy quite
> fast.
>
> Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel
> when I get the time to see whether it still makes sense, at all.
>
I have changed your sse memcpy to test various alignments with
source/destination offsets instead of random, from that you can
see that you don't really get a speedup at all. It seems to be more
a case of 'kernel memcpy is significantly slower with some alignments',
than 'avx memcpy is just that much faster'.

For example 3754 with src misalignment 4 and target misalignment 20
takes 1185 units on avx memcpy, but 1480 units with kernel memcpy

The modified testcase is attached, I did some optimizations in avx memcpy,
but I fear I may be missing something, when I tried to put it in the kernel, it
complained about sata errors I never had before, so I immediately went for
the power button to prevent more errors, fortunately it only corrupted some
kernel object files, and btrfs threw checksum errors. :)

All in all I think testing in userspace is safer, you might want to run it on an
idle cpu with schedtool, with a high fifo priority, and set cpufreq governor to
performance.

~Maarten

[-- Attachment #2: memcpy.tar.gz --]
[-- Type: application/x-gzip, Size: 4352 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-09-08 10:58               ` Maarten Lankhorst
@ 2011-09-09  8:14                 ` Borislav Petkov
  2011-09-09 10:12                   ` Maarten Lankhorst
  2011-09-09 14:39                   ` Linus Torvalds
  0 siblings, 2 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-09-09  8:14 UTC (permalink / raw)
  To: Maarten Lankhorst
  Cc: Linus Torvalds, Borislav Petkov, Valdis.Kletnieks, Ingo Molnar,
	melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner,
	Peter Zijlstra

On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote:
> I have changed your sse memcpy to test various alignments with
> source/destination offsets instead of random, from that you can
> see that you don't really get a speedup at all. It seems to be more
> a case of 'kernel memcpy is significantly slower with some alignments',
> than 'avx memcpy is just that much faster'.
> 
> For example 3754 with src misalignment 4 and target misalignment 20
> takes 1185 units on avx memcpy, but 1480 units with kernel memcpy

Right, so the idea is to check whether with the bigger buffer sizes
(and misaligned, although this should not be that often the case in
the kernel) the SSE version would outperform a "rep movs" with ucode
optimizations not kicking in.

With your version modified back to SSE memcpy (don't have an AVX box
right now) I get on an AMD F10h:

...
16384(12/40)    4756.24         7867.74         1.654192552
16384(40/12)    5067.81         6068.71         1.197500008
16384(12/44)    4341.3          8474.96         1.952172387
16384(44/12)    4277.13         7107.64         1.661777347
16384(12/48)    4989.16         7964.54         1.596369011
16384(48/12)    4644.94         6499.5          1.399264281
...

which looks like pretty nice numbers to me. I can't say whether there
ever is 16K buffer we copy in the kernel but if there were... But <16K
buffers also show up to 1.5x speedup. So I'd say it's a uarch thing.
As I said, best it would be to put it in the kernel and run a bunch of
benchmarks...

> The modified testcase is attached, I did some optimizations in avx
> memcpy, but I fear I may be missing something, when I tried to put it
> in the kernel, it complained about sata errors I never had before,
> so I immediately went for the power button to prevent more errors,
> fortunately it only corrupted some kernel object files, and btrfs
> threw checksum errors. :)

Well, your version should do something similar to what _mmx_memcpy does:
save FPU state and not execute in IRQ context.

> All in all I think testing in userspace is safer, you might want to
> run it on an idle cpu with schedtool, with a high fifo priority, and
> set cpufreq governor to performance.

No, you need a generic system with default settings - otherwise it is
blatant benchmark lying :-)

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-09-09  8:14                 ` Borislav Petkov
@ 2011-09-09 10:12                   ` Maarten Lankhorst
  2011-09-09 11:23                     ` Maarten Lankhorst
  2011-09-09 14:39                   ` Linus Torvalds
  1 sibling, 1 reply; 40+ messages in thread
From: Maarten Lankhorst @ 2011-09-09 10:12 UTC (permalink / raw)
  To: Borislav Petkov, Linus Torvalds, Borislav Petkov,
	Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

Hey,

On 09/09/2011 10:14 AM, Borislav Petkov wrote:
> On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote:
>> I have changed your sse memcpy to test various alignments with
>> source/destination offsets instead of random, from that you can
>> see that you don't really get a speedup at all. It seems to be more
>> a case of 'kernel memcpy is significantly slower with some alignments',
>> than 'avx memcpy is just that much faster'.
>>
>> For example 3754 with src misalignment 4 and target misalignment 20
>> takes 1185 units on avx memcpy, but 1480 units with kernel memcpy
> Right, so the idea is to check whether with the bigger buffer sizes
> (and misaligned, although this should not be that often the case in
> the kernel) the SSE version would outperform a "rep movs" with ucode
> optimizations not kicking in.
>
> With your version modified back to SSE memcpy (don't have an AVX box
> right now) I get on an AMD F10h:
>
> ...
> 16384(12/40)    4756.24         7867.74         1.654192552
> 16384(40/12)    5067.81         6068.71         1.197500008
> 16384(12/44)    4341.3          8474.96         1.952172387
> 16384(44/12)    4277.13         7107.64         1.661777347
> 16384(12/48)    4989.16         7964.54         1.596369011
> 16384(48/12)    4644.94         6499.5          1.399264281
> ...
>
> which looks like pretty nice numbers to me. I can't say whether there
> ever is 16K buffer we copy in the kernel but if there were... But <16K
> buffers also show up to 1.5x speedup. So I'd say it's a uarch thing.
> As I said, best it would be to put it in the kernel and run a bunch of
> benchmarks...
I think for bigger memcpy's it might make sense to demand stricter
alignment. What are your numbers for (0/0) ? In my case it seems
that kernel memcpy is always faster for that. In fact, it seems
src&63 == dst&63 is generally faster with kernel memcpy.

Patching my tree to WARN_ON_ONCE for when this condition isn't true, I get the following warnings:

WARNING: at arch/x86/kernel/head64.c:49 x86_64_start_reservations+0x3b/0x18d()
WARNING: at arch/x86/kernel/head64.c:52 x86_64_start_reservations+0xcb/0x18d()
WARNING: at arch/x86/kernel/e820.c:1077 setup_memory_map+0x3b/0x72()
WARNING: at kernel/fork.c:938 copy_process+0x148f/0x1550()
WARNING: at arch/x86/vdso/vdso32-setup.c:306 sysenter_setup+0xd4/0x301()
WARNING: at mm/util.c:72 kmemdup+0x75/0x80()
WARNING: at fs/btrfs/disk-io.c:1742 open_ctree+0x1ab5/0x1bb0()
WARNING: at fs/btrfs/disk-io.c:1744 open_ctree+0x1b35/0x1bb0()
WARNING: at fs/btrfs/extent_io.c:3634 write_extent_buffer+0x209/0x240()
WARNING: at fs/exec.c:1002 flush_old_exec+0x6c3/0x750()
WARNING: at fs/btrfs/extent_io.c:3496 read_extent_buffer+0x1b1/0x1e0()
WARNING: at kernel/module.c:2585 load_module+0x1933/0x1c30()
WARNING: at fs/btrfs/extent_io.c:3748 memcpy_extent_buffer+0x2aa/0x2f0()
WARNING: at fs/btrfs/disk-io.c:2276 write_dev_supers+0x34e/0x360()
WARNING: at lib/swiotlb.c:367 swiotlb_bounce+0xc6/0xe0()
WARNING: at fs/btrfs/transaction.c:1387 btrfs_commit_transaction+0x867/0x8a0()
WARNING: at drivers/tty/serial/serial_core.c:527 uart_write+0x14a/0x160()
WARNING: at mm/memory.c:3830 __access_remote_vm+0x251/0x270()

The most persistent one appears to be the btrfs' *_extent_buffer,
it gets the most warnings on my system. Apart from that on my
system there's not much to gain, since the alignment is already
close to optimal.

My ext4 /home doesn't throw warnings, so I'd gain the most
by figuring out if I could improve btrfs/extent_io.c in some way.
The patch for triggering those warnings is below, change to WARN_ON
if you want to see which one happens the most for you.

I was pleasantly surprised though.

>> The modified testcase is attached, I did some optimizations in avx
>> memcpy, but I fear I may be missing something, when I tried to put it
>> in the kernel, it complained about sata errors I never had before,
>> so I immediately went for the power button to prevent more errors,
>> fortunately it only corrupted some kernel object files, and btrfs
>> threw checksum errors. :)
> Well, your version should do something similar to what _mmx_memcpy does:
> save FPU state and not execute in IRQ context.
>
>> All in all I think testing in userspace is safer, you might want to
>> run it on an idle cpu with schedtool, with a high fifo priority, and
>> set cpufreq governor to performance.
> No, you need a generic system with default settings - otherwise it is
> blatant benchmark lying :-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..77180bb 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -30,6 +30,14 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
 #ifndef CONFIG_KMEMCHECK
 #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
 extern void *memcpy(void *to, const void *from, size_t len);
+#define memcpy(dst, src, len)					\
+({								\
+	size_t __len = (len);					\
+	const void *__src = (src);				\
+	void *__dst = (dst);					\
+	WARN_ON_ONCE(__len > 1024 && (((long)__src & 63) != ((long)__dst & 63))); \
+	memcpy(__dst, __src, __len);				\
+})
 #else
 extern void *__memcpy(void *to, const void *from, size_t len);
 #define memcpy(dst, src, len)					\



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-09-09 10:12                   ` Maarten Lankhorst
@ 2011-09-09 11:23                     ` Maarten Lankhorst
  2011-09-09 13:42                       ` Borislav Petkov
  0 siblings, 1 reply; 40+ messages in thread
From: Maarten Lankhorst @ 2011-09-09 11:23 UTC (permalink / raw)
  To: Borislav Petkov, Linus Torvalds, Borislav Petkov,
	Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

Hey just a followup on btrfs,

On 09/09/2011 12:12 PM, Maarten Lankhorst wrote:
> Hey,
>
> On 09/09/2011 10:14 AM, Borislav Petkov wrote:
>> On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote:
>>> I have changed your sse memcpy to test various alignments with
>>> source/destination offsets instead of random, from that you can
>>> see that you don't really get a speedup at all. It seems to be more
>>> a case of 'kernel memcpy is significantly slower with some alignments',
>>> than 'avx memcpy is just that much faster'.
>>>
>>> For example 3754 with src misalignment 4 and target misalignment 20
>>> takes 1185 units on avx memcpy, but 1480 units with kernel memcpy
>> Right, so the idea is to check whether with the bigger buffer sizes
>> (and misaligned, although this should not be that often the case in
>> the kernel) the SSE version would outperform a "rep movs" with ucode
>> optimizations not kicking in.
>>
>> With your version modified back to SSE memcpy (don't have an AVX box
>> right now) I get on an AMD F10h:
>>
>> ...
>> 16384(12/40)    4756.24         7867.74         1.654192552
>> 16384(40/12)    5067.81         6068.71         1.197500008
>> 16384(12/44)    4341.3          8474.96         1.952172387
>> 16384(44/12)    4277.13         7107.64         1.661777347
>> 16384(12/48)    4989.16         7964.54         1.596369011
>> 16384(48/12)    4644.94         6499.5          1.399264281
>> ...
>>
>> which looks like pretty nice numbers to me. I can't say whether there
>> ever is 16K buffer we copy in the kernel but if there were... But <16K
>> buffers also show up to 1.5x speedup. So I'd say it's a uarch thing.
>> As I said, best it would be to put it in the kernel and run a bunch of
>> benchmarks...
> I think for bigger memcpy's it might make sense to demand stricter
> alignment. What are your numbers for (0/0) ? In my case it seems
> that kernel memcpy is always faster for that. In fact, it seems
> src&63 == dst&63 is generally faster with kernel memcpy.
>
> Patching my tree to WARN_ON_ONCE for when this condition isn't true, I get the following warnings:
>
> WARNING: at arch/x86/kernel/head64.c:49 x86_64_start_reservations+0x3b/0x18d()
> WARNING: at arch/x86/kernel/head64.c:52 x86_64_start_reservations+0xcb/0x18d()
> WARNING: at arch/x86/kernel/e820.c:1077 setup_memory_map+0x3b/0x72()
> WARNING: at kernel/fork.c:938 copy_process+0x148f/0x1550()
> WARNING: at arch/x86/vdso/vdso32-setup.c:306 sysenter_setup+0xd4/0x301()
> WARNING: at mm/util.c:72 kmemdup+0x75/0x80()
> WARNING: at fs/btrfs/disk-io.c:1742 open_ctree+0x1ab5/0x1bb0()
> WARNING: at fs/btrfs/disk-io.c:1744 open_ctree+0x1b35/0x1bb0()
> WARNING: at fs/btrfs/extent_io.c:3634 write_extent_buffer+0x209/0x240()
> WARNING: at fs/exec.c:1002 flush_old_exec+0x6c3/0x750()
> WARNING: at fs/btrfs/extent_io.c:3496 read_extent_buffer+0x1b1/0x1e0()
> WARNING: at kernel/module.c:2585 load_module+0x1933/0x1c30()
> WARNING: at fs/btrfs/extent_io.c:3748 memcpy_extent_buffer+0x2aa/0x2f0()
> WARNING: at fs/btrfs/disk-io.c:2276 write_dev_supers+0x34e/0x360()
> WARNING: at lib/swiotlb.c:367 swiotlb_bounce+0xc6/0xe0()
> WARNING: at fs/btrfs/transaction.c:1387 btrfs_commit_transaction+0x867/0x8a0()
> WARNING: at drivers/tty/serial/serial_core.c:527 uart_write+0x14a/0x160()
> WARNING: at mm/memory.c:3830 __access_remote_vm+0x251/0x270()
>
> The most persistent one appears to be the btrfs' *_extent_buffer,
> it gets the most warnings on my system. Apart from that on my
> system there's not much to gain, since the alignment is already
> close to optimal.
>
> My ext4 /home doesn't throw warnings, so I'd gain the most
> by figuring out if I could improve btrfs/extent_io.c in some way.
> The patch for triggering those warnings is below, change to WARN_ON
> if you want to see which one happens the most for you.
>
> I was pleasantly surprised though.
The btrfs one which happens far more often than all others is read_extent_buffer,
but most of them are page aligned on destination. This means that for me,
avx memcpy might be 10% slower or 10% faster, depending on the specific source
alignment, so avx memcpy wouldn't help much.

This specific one happened far more than any of the other memcpy usages, and
ignoring the check when destination is page aligned, most of them are gone.

In short: I don't think I can get a speedup by using avx memcpy in-kernel.

YMMV, if it does speed up for you, I'd love to see concrete numbers. And not only worst
case, but for the common aligned cases too. Or some concrete numbers that misaligned
happens a lot for you.

~Maarten

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-09-09 11:23                     ` Maarten Lankhorst
@ 2011-09-09 13:42                       ` Borislav Petkov
  0 siblings, 0 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-09-09 13:42 UTC (permalink / raw)
  To: Maarten Lankhorst
  Cc: Linus Torvalds, Valdis.Kletnieks, Ingo Molnar, melwyn lobo,
	linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 2343 bytes --]

On Fri, Sep 09, 2011 at 01:23:09PM +0200, Maarten Lankhorst wrote:
> This specific one happened far more than any of the other memcpy usages, and
> ignoring the check when destination is page aligned, most of them are gone.
> 
> In short: I don't think I can get a speedup by using avx memcpy in-kernel.
> 
> YMMV, if it does speed up for you, I'd love to see concrete numbers. And not only worst
> case, but for the common aligned cases too. Or some concrete numbers that misaligned
> happens a lot for you.

Actually,

assuming alignment matters, I'd need to redo the trace_printk run I did
initially on buffer sizes:

http://marc.info/?l=linux-kernel&m=131331602309340 (kernel_build.sizes attached)

to get a more sensible grasp on the alignment of kernel buffers along
with their sizes and to see whether we're doing a lot of unaligned large
buffer copies in the kernel. I seriously doubt that, though, we should
be doing everything pagewise anyway so...

Concerning numbers, I ran your version again and sorted the output by
speedup. The highest scores are:

30037(12/44)	5566.4		12797.2		2.299011642
28672(12/44)	5512.97		12588.7		2.283467991
30037(28/60)	5610.34		12732.7		2.269502799
27852(12/44)	5398.36		12242.4		2.267803859
30037(4/36)	5585.02		12598.6		2.25578257
28672(28/60)	5499.11		12317.5		2.239914033
27852(28/60)	5349.78		11918.9		2.227919527
27852(20/52)	5335.92		11750.7		2.202186795
24576(12/44)	4991.37		10987.2		2.201247446

and this is pretty cool. Here are the (0/0) cases:

8192(0/0)       2627.82         3038.43         1.156255766
12288(0/0)      3116.62         3675.98         1.179475031
13926(0/0)      3330.04         4077.08         1.224334839
14336(0/0)      3377.95         4067.24         1.204055286
15018(0/0)      3465.3          4215.3          1.216430725
16384(0/0)      3623.33         4442.38         1.226050715
24576(0/0)      4629.53         6021.81         1.300737559
27852(0/0)      5026.69         6619.26         1.316823133
28672(0/0)      5157.73         6831.39         1.324495749
30037(0/0)      5322.01         6978.36         1.3112261

It is not 2x anymore but still.

Anyway, looking at the buffer sizes, they're rather ridiculous and even
if we get them in some workload, they won't repeat n times per second to
be relevant. So we'll see...

Thanks.

-- 
Regards/Gruss,
Boris.

[-- Attachment #2: kernel_build.sizes --]
[-- Type: text/plain, Size: 925 bytes --]

Bytes	Count
=====	=====
0	5447
1	3850
2	16255
3	11113
4	68870
5	4256
6	30433
7	19188
8	50490
9	5999
10	78275
11	5628
12	6870
13	7371
14	4742
15	4911
16	143835
17	14096
18	1573
19	13603
20	424321
21	741
22	584
23	450
24	472
25	685
26	367
27	365
28	333
29	301
30	300
31	269
32	489
33	272
34	266
35	220
36	239
37	209
38	249
39	235
40	207
41	181
42	150
43	98
44	194
45	66
46	62
47	52
48	67226
49	138
50	171
51	26
52	20
53	12
54	15
55	4
56	13
57	8
58	6
59	6
60	115
61	10
62	5
63	12
64	67353
65	6
66	2363
67	9
68	11
69	6
70	5
71	6
72	10
73	4
74	9
75	8
76	4
77	6
78	3
79	4
80	3
81	4
82	4
83	4
84	4
85	8
86	6
87	2
88	3
89	2
90	2
91	1
92	9
93	1
94	2
96	2
97	2
98	3
100	2
102	1
104	1
105	1
106	1
107	2
109	1
110	1
111	1
112	1
113	2
115	2
117	1
118	1
119	1
120	14
127	1
128	1
130	1
131	2
134	2
137	1
144	100092
149	1
151	1
153	1
158	1
185	1
217	4
224	3
225	3
227	3
244	1
254	5
255	13
256	21708
512	21746
848	12907
1920	36536
2048	21708

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-09-09  8:14                 ` Borislav Petkov
  2011-09-09 10:12                   ` Maarten Lankhorst
@ 2011-09-09 14:39                   ` Linus Torvalds
  2011-09-09 15:35                     ` Borislav Petkov
  1 sibling, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2011-09-09 14:39 UTC (permalink / raw)
  To: Borislav Petkov, Maarten Lankhorst, Linus Torvalds,
	Borislav Petkov, Valdis.Kletnieks, Ingo Molnar, melwyn lobo,
	linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra

On Fri, Sep 9, 2011 at 1:14 AM, Borislav Petkov <bp@alien8.de> wrote:
>
> which looks like pretty nice numbers to me. I can't say whether there
> ever is 16K buffer we copy in the kernel but if there were...

Kernel memcpy's are basically almost always smaller than a page size,
because that tends to be the fundamental allocation size.

Yes, there are exceptions that copy into big vmalloc'ed buffers, but
they don't tend to matter. Things like module loading etc.

                     Linus

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-09-09 14:39                   ` Linus Torvalds
@ 2011-09-09 15:35                     ` Borislav Petkov
  2011-12-05 12:20                       ` melwyn lobo
  0 siblings, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-09-09 15:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Maarten Lankhorst, Borislav Petkov, Valdis.Kletnieks,
	Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra

On Fri, Sep 09, 2011 at 07:39:18AM -0700, Linus Torvalds wrote:
> Kernel memcpy's are basically almost always smaller than a page size,
> because that tends to be the fundamental allocation size.

Yeah, this is what my trace of a kernel build showed too:

Bytes   Count
=====   =====

...

224     3
225     3
227     3
244     1
254     5
255     13
256     21708
512     21746
848     12907
1920    36536
2048    21708

OTOH, I keep thinking that copy_*_user might be doing bigger sizes, for
example when shuffling network buffers to/from userspace. Converting
those to SSE memcpy might not be as easy as memcpy itself, though.

> Yes, there are exceptions that copy into big vmalloc'ed buffers, but
> they don't tend to matter. Things like module loading etc.

Too small a number of repetitions to matter, yes.

-- 
Regards/Gruss,
Boris.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-09-09 15:35                     ` Borislav Petkov
@ 2011-12-05 12:20                       ` melwyn lobo
  0 siblings, 0 replies; 40+ messages in thread
From: melwyn lobo @ 2011-12-05 12:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Maarten Lankhorst, Borislav Petkov,
	Valdis.Kletnieks, Ingo Molnar, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Peter Zijlstra

The driver has a loop of memcpy the source and destination addresses
based on a runtime computed value and confuses the compiler on the
alignement.
So instead of generating neat 32 bit memcpy, gcc generates "rep movsb"
Example code snippet:
src = (char *)kmap(bo->pages[idx]);
src += offset;
memcpy(des, src, len);
Be replacing ssse3 only for memcpy of length larger than 1K bytes (for
my driver typical length are 2k metadata from SRAM to DDR) I think
overheads of FPU save and restore can be forgiven.
Will SSSE3 work for unlaigned pointers as well ? If it doesn't I am
lucky for past 6 months :)


On Fri, Sep 9, 2011 at 9:05 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Sep 09, 2011 at 07:39:18AM -0700, Linus Torvalds wrote:
>> Kernel memcpy's are basically almost always smaller than a page size,
>> because that tends to be the fundamental allocation size.
>
> Yeah, this is what my trace of a kernel build showed too:
>
> Bytes   Count
> =====   =====
>
> ...
>
> 224     3
> 225     3
> 227     3
> 244     1
> 254     5
> 255     13
> 256     21708
> 512     21746
> 848     12907
> 1920    36536
> 2048    21708
>
> OTOH, I keep thinking that copy_*_user might be doing bigger sizes, for
> example when shuffling network buffers to/from userspace. Converting
> those to SSE memcpy might not be as easy as memcpy itself, though.
>
>> Yes, there are exceptions that copy into big vmalloc'ed buffers, but
>> they don't tend to matter. Things like module loading etc.
>
> Too small a number of repetitions to matter, yes.
>
> --
> Regards/Gruss,
> Boris.
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-09-01 15:15         ` Maarten Lankhorst
  2011-09-01 16:18           ` Linus Torvalds
@ 2011-12-05 12:54           ` melwyn lobo
  2011-12-05 14:36             ` Alan Cox
  1 sibling, 1 reply; 40+ messages in thread
From: melwyn lobo @ 2011-12-05 12:54 UTC (permalink / raw)
  To: Maarten Lankhorst
  Cc: Borislav Petkov, Valdis.Kletnieks, Borislav Petkov, Ingo Molnar,
	linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra

Will AVX work on Intel ATOM. I guess not. Then is this now not the
time for having architecture dependant definitions for basic cpu
intensive tasks


On Thu, Sep 1, 2011 at 8:45 PM, Maarten Lankhorst
<m.b.lankhorst@gmail.com> wrote:
> Hey,
>
> 2011/8/16 Borislav Petkov <bp@amd64.org>:
>> On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@vt.edu wrote:
>>> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
>>>
>>> > Benchmarking with 10000 iterations, average results:
>>> > size    XM              MM              speedup
>>> > 119     540.58          449.491         0.8314969419
>>>
>>> > 12273   2307.86         4042.88         1.751787902
>>> > 13924   2431.8          4224.48         1.737184756
>>> > 14335   2469.4          4218.82         1.708440514
>>> > 15018 2675.67         1904.07         0.711622886
>>> > 16374   2989.75         5296.26         1.771470902
>>> > 24564   4262.15         7696.86         1.805863077
>>> > 27852   4362.53         3347.72         0.7673805572
>>> > 28672   5122.8          7113.14         1.388524413
>>> > 30033   4874.62         8740.04         1.792967931
>>>
>>> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
>>> really good about this till we understand what happened for those two cases.
>>
>> Yep.
>>
>>> Also, anytime I see "10000 iterations", I ask myself if the benchmark
>>> rigging took proper note of hot/cold cache issues. That *may* explain
>>> the two oddball results we see above - but not knowing more about how
>>> it was benched, it's hard to say.
>>
>> Yeah, the more scrutiny this gets the better. So I've cleaned up my
>> setup and have attached it.
>>
>> xm_mem.c does the benchmarking and in bench_memcpy() there's the
>> sse_memcpy call which is the SSE memcpy implementation using inline asm.
>> It looks like gcc produces pretty crappy code here because if I replace
>> the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
>> same function but in pure asm - I get much better numbers, sometimes
>> even over 2x. It all depends on the alignment of the buffers though.
>> Also, those numbers don't include the context saving/restoring which the
>> kernel does for us.
>>
>> 7491    1509.89         2346.94         1.554378381
>> 8170    2166.81         2857.78         1.318890326
>> 12277   2659.03         4179.31         1.571744176
>> 13907   2571.24         4125.7          1.604558427
>> 14319   2638.74         5799.67         2.19789466      <----
>> 14993   2752.42         4413.85         1.603625603
>> 16371   3479.11         5562.65         1.59887055
>
> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
> and I finally figured out why. I also extended the test to an optimized avx memcpy,
> but I think the kernel memcpy will always win in the aligned case.
>
> Those numbers you posted aren't right it seems. It depends a lot on the alignment,
> for example if both are aligned to 64 relative to each other,
> kernel memcpy will win from avx memcpy on my machine.
>
> I replaced the malloc calls with memalign(65536, size + 256) so I could toy
> around with the alignments a little. This explains why for some sizes, kernel
> memcpy was faster than sse memcpy in the test results you had.
> When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise
> avx memcpy might.
>
> If you want to speed up memcpy, I think your best bet is to find out why it's
> so much slower when src and dst aren't 64-byte aligned compared to each other.
>
> Cheers,
> Maarten
>
> ---
> Attached: my modified version of the sse memcpy you posted.
>
> I changed it a bit, and used avx, but some of the other changes might
> be better for your sse memcpy too.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-12-05 12:54           ` melwyn lobo
@ 2011-12-05 14:36             ` Alan Cox
  0 siblings, 0 replies; 40+ messages in thread
From: Alan Cox @ 2011-12-05 14:36 UTC (permalink / raw)
  To: melwyn lobo
  Cc: Maarten Lankhorst, Borislav Petkov, Valdis.Kletnieks,
	Borislav Petkov, Ingo Molnar, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds, Peter Zijlstra

> Will AVX work on Intel ATOM. I guess not. Then is this now not the
> time for having architecture dependant definitions for basic cpu
> intensive tasks

It's pretty much a necessity if you want to fine tune some of this.

> > If you want to speed up memcpy, I think your best bet is to find out why it's
> > so much slower when src and dst aren't 64-byte aligned compared to each other.

rep mov on most x86 processors is an extremely optimised path. The 64
byte alignment behaviour is to be expected given the processor cache line
size.

Alan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-16  7:19 ` melwyn lobo
@ 2011-08-16  7:43   ` Borislav Petkov
  0 siblings, 0 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-08-16  7:43 UTC (permalink / raw)
  To: melwyn lobo
  Cc: Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov

On Tue, Aug 16, 2011 at 12:49:28PM +0530, melwyn lobo wrote:
> We would rather use the 32 bit patch. Have you already got a 32 bit
> patch.

Nope, only 64-bit for now, sorry.

> How can I use sse3 for 32 bit.

Well, OTTOMH, you have only 8 xmm regs in 32-bit instead of 16, which
should halve the performance of the 64-bit version in a perfect world.
However, we don't know how the performance of a 32-bit SSE memcpy
version behaves vs the gcc builtin one - that would require benchmarking
too.

But other than that, I don't see a problem with having a 32-bit version.

> I don't think you have submitted 64 bit patch in the mainline.
> Is there still work ongoing on this.

Yeah, we are currently benchmarking it to see whether it actually makes
sense to even have SSE memcpy in the kernel.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 14:55 Borislav Petkov
  2011-08-15 14:59 ` Andy Lutomirski
@ 2011-08-16  7:19 ` melwyn lobo
  2011-08-16  7:43   ` Borislav Petkov
  1 sibling, 1 reply; 40+ messages in thread
From: melwyn lobo @ 2011-08-16  7:19 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov

> Yes, on 32-bit you're using the compiler-supplied version
> __builtin_memcpy when CONFIG_KMEMCHECK=n and your gcc is of version 4
> and above. Reportedly, using __builtin_memcpy generates better code.
>
> Btw, my version of SSE memcpy is 64-bit only.
>
> --
> Regards/Gruss,
> Boris.
>
>

We would rather use the 32 bit patch. Have you already got a 32 bit
patch. How can I use sse3 for 32 bit.
I don't think you have submitted 64 bit patch in the mainline.
Is there still work ongoing on this.

Regards,
Melwyn

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 20:05               ` Borislav Petkov
@ 2011-08-15 20:08                 ` Andrew Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Lutomirski @ 2011-08-15 20:08 UTC (permalink / raw)
  To: Borislav Petkov, Andrew Lutomirski, melwyn lobo, Denys Vlasenko,
	Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner,
	Linus Torvalds, Peter Zijlstra, borislav.petkov

On Mon, Aug 15, 2011 at 4:05 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Aug 15, 2011 at 03:11:40PM -0400, Andrew Lutomirski wrote:
>> > Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
>> > shows reasonable speedup there, we might need to make those work too.
>>
>> I'm a little surprised that SSE beats fast string operations, but I
>> guess benchmarking always wins.
>
> If by fast string operations you mean X86_FEATURE_ERMS, then that's
> Intel-only and that actually would need to be benchmarked separately.
> Currently, I see speedup for large(r) buffers only vs rep; movsq. But I
> dunno about rep; movsb's enhanced rep string tricks Intel does.

I meant X86_FEATURE_REP_GOOD.  (That may also be Intel-only, but it
sounds like rep;movsq might move whole cachelines on cpus at least a
few generations back.)  I don't know if any ERMS cpus exist yet.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 19:11             ` Andrew Lutomirski
@ 2011-08-15 20:05               ` Borislav Petkov
  2011-08-15 20:08                 ` Andrew Lutomirski
  0 siblings, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-08-15 20:05 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel,
	H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	borislav.petkov

On Mon, Aug 15, 2011 at 03:11:40PM -0400, Andrew Lutomirski wrote:
> > Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
> > shows reasonable speedup there, we might need to make those work too.
> 
> I'm a little surprised that SSE beats fast string operations, but I
> guess benchmarking always wins.

If by fast string operations you mean X86_FEATURE_ERMS, then that's
Intel-only and that actually would need to be benchmarked separately.
Currently, I see speedup for large(r) buffers only vs rep; movsq. But I
dunno about rep; movsb's enhanced rep string tricks Intel does.

> Yes.  But we don't nest that much, and the save/restore isn't all that
> expensive.  And we don't have to save/restore unless kernel entries
> nest and both entries try to use kernel_fpu_begin at the same time.

Yep.

> This whole project may take awhile.  The code in there is a
> poorly-documented mess, even after Hans' cleanups.  (It's a lot worse
> without them, though.)

Oh yeah, this code could use lotsa scrubbing :)

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 18:49           ` Borislav Petkov
@ 2011-08-15 19:11             ` Andrew Lutomirski
  2011-08-15 20:05               ` Borislav Petkov
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Lutomirski @ 2011-08-15 19:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel,
	H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	borislav.petkov

On Mon, Aug 15, 2011 at 2:49 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, 15 August, 2011 7:04 pm, Andrew Lutomirski wrote:
>>> Or, if we want to use SSE stuff in the kernel, we might think of
>>> allocating its own FPU context(s) and handle those...
>>
>> I'm thinking of having a stack of FPU states to parallel irq stacks
>> and IST stacks.
>
> ... I'm guessing with the same nesting as hardirqs? Making FPU
> instructions usable in irq contexts too.
>
>> It gets a little hairy when code inside kernel_fpu_begin traps for a
>> non-irq non-IST reason, though.
>
> How does that happen? You're in the kernel with preemption disabled and
> TS cleared, what would cause the #NM? I think that if you need to switch
> context, you simply "push" the current FPU context, allocate a new one
> and clts as part of the FPU context switching, no?

Not #NM, but page faults can happen too (even just accessing vmalloc space).

>
>> Fortunately, those are rare and all of the EX_TABLE users could mark
>> xmm regs as clobbered (except for copy_from_user...).
>
> Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
> shows reasonable speedup there, we might need to make those work too.

I'm a little surprised that SSE beats fast string operations, but I
guess benchmarking always wins.

>
>> Keeping kernel_fpu_begin non-preemptable makes it less bad because the
>> extra FPU state can be per-cpu and not per-task.
>
> Yep.
>
>> This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
>>
>> The major speedup will come from saving state in kernel_fpu_begin but
>> not restoring it until the code in entry_??.S restores registers.
>
> But you'd need to save each kernel FPU state when nesting, no?
>

Yes.  But we don't nest that much, and the save/restore isn't all that
expensive.  And we don't have to save/restore unless kernel entries
nest and both entries try to use kernel_fpu_begin at the same time.

This whole project may take awhile.  The code in there is a
poorly-documented mess, even after Hans' cleanups.  (It's a lot worse
without them, though.)

--Andy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 18:35             ` Andrew Lutomirski
@ 2011-08-15 18:52               ` H. Peter Anvin
  0 siblings, 0 replies; 40+ messages in thread
From: H. Peter Anvin @ 2011-08-15 18:52 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	borislav.petkov

On 08/15/2011 11:35 AM, Andrew Lutomirski wrote:
> 
> Are there any architecture-neutral users of this thing?

Look at the RAID-6 code, for example.  It makes the various
architecture-specific codes look more similar.

	-hpa

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 17:04         ` Andrew Lutomirski
@ 2011-08-15 18:49           ` Borislav Petkov
  2011-08-15 19:11             ` Andrew Lutomirski
  0 siblings, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-08-15 18:49 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
	linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra, borislav.petkov

On Mon, 15 August, 2011 7:04 pm, Andrew Lutomirski wrote:
>> Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
>> This would obviate the need to muck with contexts but that could get
>> expensive wrt stack operations. The advantage is that I'm not dealing
>> with the whole FPU state but only with 16 XMM regs. I should probably
>> dust off that version again and retest.
>
> I bet it won't be a significant win.  On Sandy Bridge, clts/stts takes
> 80 ns and a full state save+restore is only ~60 ns.
> Without infrastructure changes, I don't think you can avoid the clts
> and stts.

Yeah, probably.

> You might be able to get away with turning off IRQs, reading CR0 to
> check TS, pushing XMM regs, and being very certain that you don't
> accidentally generate any VEX-coded instructions.

That's ok - I'm using movaps/movups. But, the problem is that I still
need to save FPU state if the task I'm interrupting has been using FPU
instructions. So, I can't get away without saving the context in which
case I don't need to save the XMM regs anyway.

>> Or, if we want to use SSE stuff in the kernel, we might think of
>> allocating its own FPU context(s) and handle those...
>
> I'm thinking of having a stack of FPU states to parallel irq stacks
> and IST stacks.

... I'm guessing with the same nesting as hardirqs? Making FPU
instructions usable in irq contexts too.

> It gets a little hairy when code inside kernel_fpu_begin traps for a
> non-irq non-IST reason, though.

How does that happen? You're in the kernel with preemption disabled and
TS cleared, what would cause the #NM? I think that if you need to switch
context, you simply "push" the current FPU context, allocate a new one
and clts as part of the FPU context switching, no?

> Fortunately, those are rare and all of the EX_TABLE users could mark
> xmm regs as clobbered (except for copy_from_user...).

Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
shows reasonable speedup there, we might need to make those work too.

> Keeping kernel_fpu_begin non-preemptable makes it less bad because the
> extra FPU state can be per-cpu and not per-task.

Yep.

> This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
>
> The major speedup will come from saving state in kernel_fpu_begin but
> not restoring it until the code in entry_??.S restores registers.

But you'd need to save each kernel FPU state when nesting, no?

>>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>>> 387 equivalent) could contain garbage.
>>
>> Well, do we want to use floating point instructions in the kernel?
>
> The only use I could find is in staging.

Exactly my point - I think we should do it only when it's really worth
the trouble.

-- 
Regards/Gruss,
Boris.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 18:26           ` H. Peter Anvin
@ 2011-08-15 18:35             ` Andrew Lutomirski
  2011-08-15 18:52               ` H. Peter Anvin
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Lutomirski @ 2011-08-15 18:35 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	borislav.petkov

On Mon, Aug 15, 2011 at 2:26 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 08/15/2011 09:58 AM, Andrew Lutomirski wrote:
>> On Mon, Aug 15, 2011 at 12:12 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 08/15/2011 08:36 AM, Andrew Lutomirski wrote:
>>>>
>>>> (*)  kernel_fpu_begin is a bad name.  It's only safe to use integer
>>>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>>>> 387 equivalent) could contain garbage.
>>>>
>>>
>>> Uh... no, it just means you have to initialize the settings.  It's a
>>> perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.
>>
>> I prefer get_xstate / put_xstate, but this could rapidly devolve into
>> bikeshedding. :)
>>
>
> a) Quite.
>
> b) xstate is not architecture-neutral.

Are there any architecture-neutral users of this thing?  If I were
writing generic code, I would expect:

kernel_fpu_begin();
foo *= 1.5;
kernel_fpu_end();

to work, but I would not expect:

kernel_fpu_begin();
use_xmm_registers();
kernel_fpu_end();

to make any sense.

Since the former does not actually work, I would hope that there is no
non-x86-specific user.

--Andy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 16:58         ` Andrew Lutomirski
@ 2011-08-15 18:26           ` H. Peter Anvin
  2011-08-15 18:35             ` Andrew Lutomirski
  0 siblings, 1 reply; 40+ messages in thread
From: H. Peter Anvin @ 2011-08-15 18:26 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	borislav.petkov

On 08/15/2011 09:58 AM, Andrew Lutomirski wrote:
> On Mon, Aug 15, 2011 at 12:12 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 08/15/2011 08:36 AM, Andrew Lutomirski wrote:
>>>
>>> (*)  kernel_fpu_begin is a bad name.  It's only safe to use integer
>>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>>> 387 equivalent) could contain garbage.
>>>
>>
>> Uh... no, it just means you have to initialize the settings.  It's a
>> perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.
> 
> I prefer get_xstate / put_xstate, but this could rapidly devolve into
> bikeshedding. :)
> 

a) Quite.

b) xstate is not architecture-neutral.

	-hpa


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 16:12       ` Borislav Petkov
@ 2011-08-15 17:04         ` Andrew Lutomirski
  2011-08-15 18:49           ` Borislav Petkov
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Lutomirski @ 2011-08-15 17:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel,
	H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	borislav.petkov

On Mon, Aug 15, 2011 at 12:12 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, 15 August, 2011 5:36 pm, Andrew Lutomirski wrote:
>>> But still, irq_fpu_usable() still checks !in_interrupt() which means
>>> that we don't want to run SSE instructions in IRQ context. OTOH, we
>>> still are fine when running with CR0.TS. So what happens when we get an
>>> #NM as a result of executing an FPU instruction in an IRQ handler? We
>>> will have to do init_fpu() on the current task if the last hasn't used
>>> math yet and do the slab allocation of the FPU context area (I'm looking
>>> at math_state_restore, btw).
>>
>> IIRC kernel_fpu_begin does clts, so #NM won't happen.  But if we're in
>> an interrupt and TS=1, when we know that we're not in a
>> kernel_fpu_begin section, so it's safe to start one (and do clts).
>
> Doh, yes, I see it now. This way we save the math state of the current
> process if needed and "disable" #NM exceptions until kernel_fpu_end() by
> clearing CR0.TS, sure. Thanks.
>
>> IMO this code is not very good, and I plan to fix it sooner or later.
>
> Yep. Also, AFAIR, Hans did some FPU cleanup as part of his xsave rework.
> You could probably reuse some bits from there. The patchset should be in
> tip/x86/xsave.
>
>> I want kernel_fpu_begin (or its equivalent*) to be very fast and
>> usable from any context whatsoever.  Mucking with TS is slower than a
>> complete save and restore of YMM state.
>
> Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
> This would obviate the need to muck with contexts but that could get
> expensive wrt stack operations. The advantage is that I'm not dealing
> with the whole FPU state but only with 16 XMM regs. I should probably
> dust off that version again and retest.

I bet it won't be a significant win.  On Sandy Bridge, clts/stts takes
80 ns and a full state save+restore is only ~60 ns.  Without
infrastructure changes, I don't think you can avoid the clts and stts.

You might be able to get away with turning off IRQs, reading CR0 to
check TS, pushing XMM regs, and being very certain that you don't
accidentally generate any VEX-coded instructions.

>
> Or, if we want to use SSE stuff in the kernel, we might think of
> allocating its own FPU context(s) and handle those...

I'm thinking of having a stack of FPU states to parallel irq stacks
and IST stacks.  It gets a little hairy when code inside
kernel_fpu_begin traps for a non-irq non-IST reason, though.
Fortunately, those are rare and all of the EX_TABLE users could mark
xmm regs as clobbered (except for copy_from_user...).  Keeping
kernel_fpu_begin non-preemptable makes it less bad because the extra
FPU state can be per-cpu and not per-task.

This is extra fun on 32 bit, which IIRC doesn't have IST stacks.

The major speedup will come from saving state in kernel_fpu_begin but
not restoring it until the code in entry_??.S restores registers.

>
>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>> 387 equivalent) could contain garbage.
>
> Well, do we want to use floating point instructions in the kernel?

The only use I could find is in staging.

--Andy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 16:12       ` H. Peter Anvin
@ 2011-08-15 16:58         ` Andrew Lutomirski
  2011-08-15 18:26           ` H. Peter Anvin
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Lutomirski @ 2011-08-15 16:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	borislav.petkov

On Mon, Aug 15, 2011 at 12:12 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 08/15/2011 08:36 AM, Andrew Lutomirski wrote:
>>
>> (*)  kernel_fpu_begin is a bad name.  It's only safe to use integer
>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>> 387 equivalent) could contain garbage.
>>
>
> Uh... no, it just means you have to initialize the settings.  It's a
> perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.

I prefer get_xstate / put_xstate, but this could rapidly devolve into
bikeshedding. :)

--Andy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 15:36     ` Andrew Lutomirski
  2011-08-15 16:12       ` Borislav Petkov
@ 2011-08-15 16:12       ` H. Peter Anvin
  2011-08-15 16:58         ` Andrew Lutomirski
  1 sibling, 1 reply; 40+ messages in thread
From: H. Peter Anvin @ 2011-08-15 16:12 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
	linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	borislav.petkov

On 08/15/2011 08:36 AM, Andrew Lutomirski wrote:
> 
> (*)  kernel_fpu_begin is a bad name.  It's only safe to use integer
> instructions inside a kernel_fpu_begin section because MXCSR (and the
> 387 equivalent) could contain garbage.
> 

Uh... no, it just means you have to initialize the settings.  It's a
perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.

	-hpa



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 15:36     ` Andrew Lutomirski
@ 2011-08-15 16:12       ` Borislav Petkov
  2011-08-15 17:04         ` Andrew Lutomirski
  2011-08-15 16:12       ` H. Peter Anvin
  1 sibling, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-08-15 16:12 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
	linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra, borislav.petkov

On Mon, 15 August, 2011 5:36 pm, Andrew Lutomirski wrote:
>> But still, irq_fpu_usable() still checks !in_interrupt() which means
>> that we don't want to run SSE instructions in IRQ context. OTOH, we
>> still are fine when running with CR0.TS. So what happens when we get an
>> #NM as a result of executing an FPU instruction in an IRQ handler? We
>> will have to do init_fpu() on the current task if the last hasn't used
>> math yet and do the slab allocation of the FPU context area (I'm looking
>> at math_state_restore, btw).
>
> IIRC kernel_fpu_begin does clts, so #NM won't happen.  But if we're in
> an interrupt and TS=1, when we know that we're not in a
> kernel_fpu_begin section, so it's safe to start one (and do clts).

Doh, yes, I see it now. This way we save the math state of the current
process if needed and "disable" #NM exceptions until kernel_fpu_end() by
clearing CR0.TS, sure. Thanks.

> IMO this code is not very good, and I plan to fix it sooner or later.

Yep. Also, AFAIR, Hans did some FPU cleanup as part of his xsave rework.
You could probably reuse some bits from there. The patchset should be in
tip/x86/xsave.

> I want kernel_fpu_begin (or its equivalent*) to be very fast and
> usable from any context whatsoever.  Mucking with TS is slower than a
> complete save and restore of YMM state.

Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
This would obviate the need to muck with contexts but that could get
expensive wrt stack operations. The advantage is that I'm not dealing
with the whole FPU state but only with 16 XMM regs. I should probably
dust off that version again and retest.

Or, if we want to use SSE stuff in the kernel, we might think of
allocating its own FPU context(s) and handle those...

> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
> instructions inside a kernel_fpu_begin section because MXCSR (and the
> 387 equivalent) could contain garbage.

Well, do we want to use floating point instructions in the kernel?

Thanks.

-- 
Regards/Gruss,
Boris.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 15:29   ` Borislav Petkov
@ 2011-08-15 15:36     ` Andrew Lutomirski
  2011-08-15 16:12       ` Borislav Petkov
  2011-08-15 16:12       ` H. Peter Anvin
  0 siblings, 2 replies; 40+ messages in thread
From: Andrew Lutomirski @ 2011-08-15 15:36 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel,
	H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	borislav.petkov

On Mon, Aug 15, 2011 at 11:29 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, 15 August, 2011 4:59 pm, Andy Lutomirski wrote:
>>>> So what is the reason we cannot use sse_memcpy in interrupt context.
>>>> (fpu registers not saved ? )
>>>
>>> Because, AFAICT, when we handle an #NM exception while running
>>> sse_memcpy in an IRQ handler, we might need to allocate FPU save state
>>> area, which in turn, can sleep. Then, we might get another IRQ while
>>> sleeping and we should be deadlocked.
>>>
>>> But let me stress on the "AFAICT" above, someone who actually knows the
>>> FPU code should correct me if I'm missing something.
>>
>> I don't think you ever get #NM as a result of kernel_fpu_begin, but you
>> can certainly have problems when kernel_fpu_begin nests by accident.
>> There's irq_fpu_usable() for this.
>>
>> (irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.)
>
> Oh I didn't know about irq_fpu_usable(), thanks.
>
> But still, irq_fpu_usable() still checks !in_interrupt() which means
> that we don't want to run SSE instructions in IRQ context. OTOH, we
> still are fine when running with CR0.TS. So what happens when we get an
> #NM as a result of executing an FPU instruction in an IRQ handler? We
> will have to do init_fpu() on the current task if the last hasn't used
> math yet and do the slab allocation of the FPU context area (I'm looking
> at math_state_restore, btw).

IIRC kernel_fpu_begin does clts, so #NM won't happen.  But if we're in
an interrupt and TS=1, when we know that we're not in a
kernel_fpu_begin section, so it's safe to start one (and do clts).

IMO this code is not very good, and I plan to fix it sooner or later.
I want kernel_fpu_begin (or its equivalent*) to be very fast and
usable from any context whatsoever.  Mucking with TS is slower than a
complete save and restore of YMM state.

(*)  kernel_fpu_begin is a bad name.  It's only safe to use integer
instructions inside a kernel_fpu_begin section because MXCSR (and the
387 equivalent) could contain garbage.

--Andy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 14:59 ` Andy Lutomirski
@ 2011-08-15 15:29   ` Borislav Petkov
  2011-08-15 15:36     ` Andrew Lutomirski
  0 siblings, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-08-15 15:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
	linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra, borislav.petkov

On Mon, 15 August, 2011 4:59 pm, Andy Lutomirski wrote:
>>> So what is the reason we cannot use sse_memcpy in interrupt context.
>>> (fpu registers not saved ? )
>>
>> Because, AFAICT, when we handle an #NM exception while running
>> sse_memcpy in an IRQ handler, we might need to allocate FPU save state
>> area, which in turn, can sleep. Then, we might get another IRQ while
>> sleeping and we should be deadlocked.
>>
>> But let me stress on the "AFAICT" above, someone who actually knows the
>> FPU code should correct me if I'm missing something.
>
> I don't think you ever get #NM as a result of kernel_fpu_begin, but you
> can certainly have problems when kernel_fpu_begin nests by accident.
> There's irq_fpu_usable() for this.
>
> (irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.)

Oh I didn't know about irq_fpu_usable(), thanks.

But still, irq_fpu_usable() still checks !in_interrupt() which means
that we don't want to run SSE instructions in IRQ context. OTOH, we
still are fine when running with CR0.TS. So what happens when we get an
#NM as a result of executing an FPU instruction in an IRQ handler? We
will have to do init_fpu() on the current task if the last hasn't used
math yet and do the slab allocation of the FPU context area (I'm looking
at math_state_restore, btw).

Thanks.

-- 
Regards/Gruss,
Boris.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
  2011-08-15 14:55 Borislav Petkov
@ 2011-08-15 14:59 ` Andy Lutomirski
  2011-08-15 15:29   ` Borislav Petkov
  2011-08-16  7:19 ` melwyn lobo
  1 sibling, 1 reply; 40+ messages in thread
From: Andy Lutomirski @ 2011-08-15 14:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel,
	H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
	borislav.petkov

On 08/15/2011 10:55 AM, Borislav Petkov wrote:
> On Mon, 15 August, 2011 3:27 pm, melwyn lobo wrote:
>> Hi,
>> Was on a vacation for last two days. Thanks for the good insights into
>> the issue.
>> Ingo, unfortunately the data we have is on a soon to be released
>> platform and strictly confidential at this stage.
>>
>> Boris, thanks for the patch. On seeing your patch:
>> +void *__sse_memcpy(void *to, const void *from, size_t len)
>> +{
>> +       unsigned long src = (unsigned long)from;
>> +       unsigned long dst = (unsigned long)to;
>> +       void *p = to;
>> +       int i;
>> +
>> +       if (in_interrupt())
>> +               return __memcpy(to, from, len)
>> So what is the reason we cannot use sse_memcpy in interrupt context.
>> (fpu registers not saved ? )
>
> Because, AFAICT, when we handle an #NM exception while running
> sse_memcpy in an IRQ handler, we might need to allocate FPU save state
> area, which in turn, can sleep. Then, we might get another IRQ while
> sleeping and we should be deadlocked.
>
> But let me stress on the "AFAICT" above, someone who actually knows the
> FPU code should correct me if I'm missing something.

I don't think you ever get #NM as a result of kernel_fpu_begin, but you 
can certainly have problems when kernel_fpu_begin nests by accident. 
There's irq_fpu_usable() for this.

(irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.)

--Andy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: x86 memcpy performance
@ 2011-08-15 14:55 Borislav Petkov
  2011-08-15 14:59 ` Andy Lutomirski
  2011-08-16  7:19 ` melwyn lobo
  0 siblings, 2 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-08-15 14:55 UTC (permalink / raw)
  To: melwyn lobo
  Cc: Borislav Petkov, Denys Vlasenko, Ingo Molnar, melwyn lobo,
	linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra, borislav.petkov

On Mon, 15 August, 2011 3:27 pm, melwyn lobo wrote:
> Hi,
> Was on a vacation for last two days. Thanks for the good insights into
> the issue.
> Ingo, unfortunately the data we have is on a soon to be released
> platform and strictly confidential at this stage.
>
> Boris, thanks for the patch. On seeing your patch:
> +void *__sse_memcpy(void *to, const void *from, size_t len)
> +{
> +       unsigned long src = (unsigned long)from;
> +       unsigned long dst = (unsigned long)to;
> +       void *p = to;
> +       int i;
> +
> +       if (in_interrupt())
> +               return __memcpy(to, from, len)
> So what is the reason we cannot use sse_memcpy in interrupt context.
> (fpu registers not saved ? )

Because, AFAICT, when we handle an #NM exception while running
sse_memcpy in an IRQ handler, we might need to allocate FPU save state
area, which in turn, can sleep. Then, we might get another IRQ while
sleeping and we should be deadlocked.

But let me stress on the "AFAICT" above, someone who actually knows the
FPU code should correct me if I'm missing something.

> My question is still not answered. There are 3 versions of memcpy in
> kernel:
>
> ***********************************arch/x86/include/asm/string_32.h******************************
> 179 #ifndef CONFIG_KMEMCHECK
> 180
> 181 #if (__GNUC__ >= 4)
> 182 #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
> 183 #else
> 184 #define memcpy(t, f, n)                         \
> 185         (__builtin_constant_p((n))              \
> 186          ? __constant_memcpy((t), (f), (n))     \
> 187          : __memcpy((t), (f), (n)))
> 188 #endif
> 189 #else
> 190 /*
> 191  * kmemcheck becomes very happy if we use the REP instructions
> unconditionally,
> 192  * because it means that we know both memory operands in advance.
> 193  */
> 194 #define memcpy(t, f, n) __memcpy((t), (f), (n))
> 195 #endif
> 196
> 197
> ****************************************************************************************.
> I will ignore CONFIG_X86_USE_3DNOW (including mmx_memcpy() ) as this
> is valid only for AMD and not for Atom Z5xx series.
> This means __memcpy, __constant_memcpy, __builtin_memcpy .
> I have a hunch by default we were using  __builtin_memcpy.
> This is because I see my GCC version >=4 and CONFIG_KMEMCHECK
> not defined. Can someone confirm of these 3 which is used, with
> i386_defconfig. Again with i386_defconfig which workloads provide the
> best results with the default implementation.

Yes, on 32-bit you're using the compiler-supplied version
__builtin_memcpy when CONFIG_KMEMCHECK=n and your gcc is of version 4
and above. Reportedly, using __builtin_memcpy generates better code.

Btw, my version of SSE memcpy is 64-bit only.

-- 
Regards/Gruss,
Boris.


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2011-12-05 14:35 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-12 17:59 x86 memcpy performance melwyn lobo
2011-08-12 18:33 ` Andi Kleen
2011-08-12 19:52 ` Ingo Molnar
2011-08-14  9:59   ` Borislav Petkov
2011-08-14 11:13     ` Denys Vlasenko
2011-08-14 12:40       ` Borislav Petkov
2011-08-15 13:27         ` melwyn lobo
2011-08-15 13:44         ` Denys Vlasenko
2011-08-16  2:34     ` Valdis.Kletnieks
2011-08-16 12:16       ` Borislav Petkov
2011-09-01 15:15         ` Maarten Lankhorst
2011-09-01 16:18           ` Linus Torvalds
2011-09-08  8:35             ` Borislav Petkov
2011-09-08 10:58               ` Maarten Lankhorst
2011-09-09  8:14                 ` Borislav Petkov
2011-09-09 10:12                   ` Maarten Lankhorst
2011-09-09 11:23                     ` Maarten Lankhorst
2011-09-09 13:42                       ` Borislav Petkov
2011-09-09 14:39                   ` Linus Torvalds
2011-09-09 15:35                     ` Borislav Petkov
2011-12-05 12:20                       ` melwyn lobo
2011-12-05 12:54           ` melwyn lobo
2011-12-05 14:36             ` Alan Cox
2011-08-15 14:55 Borislav Petkov
2011-08-15 14:59 ` Andy Lutomirski
2011-08-15 15:29   ` Borislav Petkov
2011-08-15 15:36     ` Andrew Lutomirski
2011-08-15 16:12       ` Borislav Petkov
2011-08-15 17:04         ` Andrew Lutomirski
2011-08-15 18:49           ` Borislav Petkov
2011-08-15 19:11             ` Andrew Lutomirski
2011-08-15 20:05               ` Borislav Petkov
2011-08-15 20:08                 ` Andrew Lutomirski
2011-08-15 16:12       ` H. Peter Anvin
2011-08-15 16:58         ` Andrew Lutomirski
2011-08-15 18:26           ` H. Peter Anvin
2011-08-15 18:35             ` Andrew Lutomirski
2011-08-15 18:52               ` H. Peter Anvin
2011-08-16  7:19 ` melwyn lobo
2011-08-16  7:43   ` Borislav Petkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.