* x86 memcpy performance
@ 2011-08-12 17:59 melwyn lobo
2011-08-12 18:33 ` Andi Kleen
2011-08-12 19:52 ` Ingo Molnar
0 siblings, 2 replies; 40+ messages in thread
From: melwyn lobo @ 2011-08-12 17:59 UTC (permalink / raw)
To: linux-kernel
Hi All,
Our Video recorder application uses memcpy for every frame. About 2KB
data every frame on Intel® Atom™ Z5xx processor.
With default 2.6.35 kernel we got 19.6 fps. But it seems kernel
implemented memcpy is suboptimal, because when we replaced
with an optmized one (using ssse3, exact patches are currently being
finalized) ew obtained 22fps a gain of 12.2 %.
C0 residency also reduced from 75% to 67%. This means power benefits too.
My questions:
1. Is kernel memcpy profiled for optimal performance.
2. Does the default kernel configuration for i386 include the best
memcpy implementation (AMD 3DNOW, __builtin_memcpy .... etc)
Any suggestions, prior experience on this is welcome.
Thanks,
M.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-12 17:59 x86 memcpy performance melwyn lobo
@ 2011-08-12 18:33 ` Andi Kleen
2011-08-12 19:52 ` Ingo Molnar
1 sibling, 0 replies; 40+ messages in thread
From: Andi Kleen @ 2011-08-12 18:33 UTC (permalink / raw)
To: melwyn lobo; +Cc: linux-kernel
melwyn lobo <linux.melwyn@gmail.com> writes:
> Hi All,
> Our Video recorder application uses memcpy for every frame. About 2KB
> data every frame on Intel® Atom™ Z5xx processor.
> With default 2.6.35 kernel we got 19.6 fps. But it seems kernel
> implemented memcpy is suboptimal, because when we replaced
> with an optmized one (using ssse3, exact patches are currently being
> finalized) ew obtained 22fps a gain of 12.2 %.
SSE3 in the kernel memcpy would be incredible expensive,
it would need a full FPU saving for every call and preemption
disabled.
I haven't seen your patches, but until you get all that
right (and add a lot more overhead to most copies) you
have a good change currently to corrupting user FPU state.
> C0 residency also reduced from 75% to 67%. This means power benefits too.
> My questions:
> 1. Is kernel memcpy profiled for optimal performance.
It depends on the CPU
There have been some improvements for Atom on newer kernels
I believe.
But then kernel memcpy is usually optimized for relatively
small copies (<= 4K) because very few kernel loads do more.
-Andi
--
ak@linux.intel.com -- Speaking for myself only
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-12 17:59 x86 memcpy performance melwyn lobo
2011-08-12 18:33 ` Andi Kleen
@ 2011-08-12 19:52 ` Ingo Molnar
2011-08-14 9:59 ` Borislav Petkov
1 sibling, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2011-08-12 19:52 UTC (permalink / raw)
To: melwyn lobo
Cc: linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
Peter Zijlstra
* melwyn lobo <linux.melwyn@gmail.com> wrote:
> Hi All,
> Our Video recorder application uses memcpy for every frame. About 2KB
> data every frame on Intel® Atom™ Z5xx processor.
> With default 2.6.35 kernel we got 19.6 fps. But it seems kernel
> implemented memcpy is suboptimal, because when we replaced
> with an optmized one (using ssse3, exact patches are currently being
> finalized) ew obtained 22fps a gain of 12.2 %.
> C0 residency also reduced from 75% to 67%. This means power benefits too.
> My questions:
> 1. Is kernel memcpy profiled for optimal performance.
> 2. Does the default kernel configuration for i386 include the best
> memcpy implementation (AMD 3DNOW, __builtin_memcpy .... etc)
>
> Any suggestions, prior experience on this is welcome.
Sounds very interesting - it would be nice to see 'perf record' +
'perf report' profiles done on that workload, before and after your
patches.
The thing is, we obviously want to achieve those gains of 12.2% fps
and while we probably do not want to switch the kernel's memcpy to
SSE right now (the save/restore costs are significant), we could
certainly try to optimize the specific codepath that your video
playback path is hitting.
If it's some bulk memcpy in a key video driver then we could offer a
bulk-optimized x86 memcpy variant which could be called from that
driver - and that could use SSE3 as well.
So yes, if the speedup is real then i'm sure we can achieve that
speedup - but exact profiles and measurements would have to be shown.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-12 19:52 ` Ingo Molnar
@ 2011-08-14 9:59 ` Borislav Petkov
2011-08-14 11:13 ` Denys Vlasenko
2011-08-16 2:34 ` Valdis.Kletnieks
0 siblings, 2 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-08-14 9:59 UTC (permalink / raw)
To: Ingo Molnar
Cc: melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner,
Linus Torvalds, Peter Zijlstra, borislav.petkov
[-- Attachment #1: Type: text/plain, Size: 12636 bytes --]
On Fri, Aug 12, 2011 at 09:52:20PM +0200, Ingo Molnar wrote:
> Sounds very interesting - it would be nice to see 'perf record' +
> 'perf report' profiles done on that workload, before and after your
> patches.
FWIW, I've been playing with SSE memcpy version for the kernel recently
too, here's what I have so far:
First of all, I did a trace of all the memcpy buffer sizes used while
building a kernel, see attached kernel_build.sizes.
On the one hand, there is a large amount of small chunks copied (1.1M
of 1.2M calls total), and, on the other, a relatively small amount of
larger sized mem copies (256 - 2048 bytes) which are about 100K in total
but which account for the larger cumulative amount of data copied: 138MB
of 175MB total. So, if the buffer copied is big enough, the context
save/restore cost might be something we're willing to pay.
I've implemented the SSE memcpy first in userspace to measure the
speedup vs memcpy_64 we have right now:
Benchmarking with 10000 iterations, average results:
size XM MM speedup
119 540.58 449.491 0.8314969419
189 296.318 263.507 0.8892692985
206 297.949 271.399 0.9108923485
224 255.565 235.38 0.9210161798
221 299.383 276.628 0.9239941159
245 299.806 279.432 0.9320430545
369 314.774 316.89 1.006721324
425 327.536 330.475 1.00897153
439 330.847 334.532 1.01113687
458 333.159 340.124 1.020904708
503 334.44 352.166 1.053003229
767 375.612 429.949 1.144661625
870 358.888 312.572 0.8709465025
882 394.297 454.977 1.153893229
925 403.82 472.56 1.170222413
1009 407.147 490.171 1.203915735
1525 512.059 660.133 1.289174911
1737 556.85 725.552 1.302958536
1778 533.839 711.59 1.332965994
1864 558.06 745.317 1.335549882
2039 585.915 813.806 1.388949687
3068 766.462 1105.56 1.442422252
3471 883.983 1239.99 1.40272883
3570 895.822 1266.74 1.414057295
3748 906.832 1302.4 1.436212771
4086 957.649 1486.93 1.552686041
6130 1238.45 1996.42 1.612023046
6961 1413.11 2201.55 1.557939181
7162 1385.5 2216.49 1.59977178
7499 1440.87 2330.12 1.617158856
8182 1610.74 2720.45 1.688950194
12273 2307.86 4042.88 1.751787902
13924 2431.8 4224.48 1.737184756
14335 2469.4 4218.82 1.708440514
15018 2675.67 1904.07 0.711622886
16374 2989.75 5296.26 1.771470902
24564 4262.15 7696.86 1.805863077
27852 4362.53 3347.72 0.7673805572
28672 5122.8 7113.14 1.388524413
30033 4874.62 8740.04 1.792967931
32768 6014.78 7564.2 1.257603505
49142 14464.2 21114.2 1.459757233
55702 16055 23496.8 1.463523623
57339 16725.7 24553.8 1.46803388
60073 17451.5 24407.3 1.398579162
Size is with randomly generated misalignment to test the implementation.
I've implemented the SSE memcpy similar to arch/x86/lib/mmx_32.c and did
some kernel build traces:
with SSE memcpy
===============
Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs):
3301761.517649 task-clock # 24.001 CPUs utilized ( +- 1.48% )
520,658 context-switches # 0.000 M/sec ( +- 0.25% )
63,845 CPU-migrations # 0.000 M/sec ( +- 0.58% )
26,070,835 page-faults # 0.008 M/sec ( +- 0.00% )
1,812,482,599,021 cycles # 0.549 GHz ( +- 0.85% ) [64.55%]
551,783,051,492 stalled-cycles-frontend # 30.44% frontend cycles idle ( +- 0.98% ) [65.64%]
444,996,901,060 stalled-cycles-backend # 24.55% backend cycles idle ( +- 1.15% ) [67.16%]
1,488,917,931,766 instructions # 0.82 insns per cycle
# 0.37 stalled cycles per insn ( +- 0.91% ) [69.25%]
340,575,978,517 branches # 103.150 M/sec ( +- 0.99% ) [68.29%]
21,519,667,206 branch-misses # 6.32% of all branches ( +- 1.09% ) [65.11%]
137.567155255 seconds time elapsed ( +- 1.48% )
plain 3.0
=========
Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs):
3504754.425527 task-clock # 24.001 CPUs utilized ( +- 1.31% )
518,139 context-switches # 0.000 M/sec ( +- 0.32% )
61,790 CPU-migrations # 0.000 M/sec ( +- 0.73% )
26,056,947 page-faults # 0.007 M/sec ( +- 0.00% )
1,826,757,751,616 cycles # 0.521 GHz ( +- 0.66% ) [63.86%]
557,800,617,954 stalled-cycles-frontend # 30.54% frontend cycles idle ( +- 0.79% ) [64.65%]
443,950,768,357 stalled-cycles-backend # 24.30% backend cycles idle ( +- 0.60% ) [67.07%]
1,469,707,613,500 instructions # 0.80 insns per cycle
# 0.38 stalled cycles per insn ( +- 0.68% ) [69.98%]
335,560,565,070 branches # 95.744 M/sec ( +- 0.67% ) [69.09%]
21,365,279,176 branch-misses # 6.37% of all branches ( +- 0.65% ) [65.36%]
146.025263276 seconds time elapsed ( +- 1.31% )
So, although kernel build is probably not the proper workload for an
SSE memcpy routine, I'm seeing 9 secs build time improvement, i.e.
something around 6%. We're executing a bit more instructions but I'd say
the amount of data moved per instruction is higher due to the quadword
moves.
Here's the SSE memcpy version I got so far, I haven't wired in the
proper CPU feature detection yet because we want to run more benchmarks
like netperf and stuff to see whether we see any positive results there.
The SYSTEM_RUNNING check is to take care of early boot situations where
we can't handle FPU exceptions but we use memcpy. There's an aligned and
misaligned variant which should handle any buffers and sizes although
I've set the SSE memcpy threshold at 512 Bytes buffersize the least to
cover context save/restore somewhat.
Comments are much appreciated! :-)
--
>From 385519e844f3466f500774c2c37afe44691ef8d2 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 11 Aug 2011 18:43:08 +0200
Subject: [PATCH] SSE3 memcpy in C
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
arch/x86/include/asm/string_64.h | 14 ++++-
arch/x86/lib/Makefile | 2 +-
arch/x86/lib/sse_memcpy_64.c | 133 ++++++++++++++++++++++++++++++++++++++
3 files changed, 146 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/lib/sse_memcpy_64.c
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..7bd51bb 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
#define __HAVE_ARCH_MEMCPY 1
#ifndef CONFIG_KMEMCHECK
+extern void *__memcpy(void *to, const void *from, size_t len);
+extern void *__sse_memcpy(void *to, const void *from, size_t len);
#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
-extern void *memcpy(void *to, const void *from, size_t len);
+#define memcpy(dst, src, len) \
+({ \
+ size_t __len = (len); \
+ void *__ret; \
+ if (__len >= 512) \
+ __ret = __sse_memcpy((dst), (src), __len); \
+ else \
+ __ret = __memcpy((dst), (src), __len); \
+ __ret; \
+})
#else
-extern void *__memcpy(void *to, const void *from, size_t len);
#define memcpy(dst, src, len) \
({ \
size_t __len = (len); \
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index f2479f1..5f90709 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -36,7 +36,7 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y)
endif
lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o
else
- obj-y += iomap_copy_64.o
+ obj-y += iomap_copy_64.o sse_memcpy_64.o
lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
lib-y += thunk_64.o clear_page_64.o copy_page_64.o
lib-y += memmove_64.o memset_64.o
diff --git a/arch/x86/lib/sse_memcpy_64.c b/arch/x86/lib/sse_memcpy_64.c
new file mode 100644
index 0000000..b53fc31
--- /dev/null
+++ b/arch/x86/lib/sse_memcpy_64.c
@@ -0,0 +1,133 @@
+#include <linux/module.h>
+
+#include <asm/i387.h>
+#include <asm/string_64.h>
+
+void *__sse_memcpy(void *to, const void *from, size_t len)
+{
+ unsigned long src = (unsigned long)from;
+ unsigned long dst = (unsigned long)to;
+ void *p = to;
+ int i;
+
+ if (in_interrupt())
+ return __memcpy(to, from, len);
+
+ if (system_state != SYSTEM_RUNNING)
+ return __memcpy(to, from, len);
+
+ kernel_fpu_begin();
+
+ /* check alignment */
+ if ((src ^ dst) & 0xf)
+ goto unaligned;
+
+ if (src & 0xf) {
+ u8 chunk = 0x10 - (src & 0xf);
+
+ /* copy chunk until next 16-byte */
+ __memcpy(to, from, chunk);
+ len -= chunk;
+ to += chunk;
+ from += chunk;
+ }
+
+ /*
+ * copy in 256 Byte portions
+ */
+ for (i = 0; i < (len & ~0xff); i += 256) {
+ asm volatile(
+ "movaps 0x0(%0), %%xmm0\n\t"
+ "movaps 0x10(%0), %%xmm1\n\t"
+ "movaps 0x20(%0), %%xmm2\n\t"
+ "movaps 0x30(%0), %%xmm3\n\t"
+ "movaps 0x40(%0), %%xmm4\n\t"
+ "movaps 0x50(%0), %%xmm5\n\t"
+ "movaps 0x60(%0), %%xmm6\n\t"
+ "movaps 0x70(%0), %%xmm7\n\t"
+ "movaps 0x80(%0), %%xmm8\n\t"
+ "movaps 0x90(%0), %%xmm9\n\t"
+ "movaps 0xa0(%0), %%xmm10\n\t"
+ "movaps 0xb0(%0), %%xmm11\n\t"
+ "movaps 0xc0(%0), %%xmm12\n\t"
+ "movaps 0xd0(%0), %%xmm13\n\t"
+ "movaps 0xe0(%0), %%xmm14\n\t"
+ "movaps 0xf0(%0), %%xmm15\n\t"
+
+ "movaps %%xmm0, 0x0(%1)\n\t"
+ "movaps %%xmm1, 0x10(%1)\n\t"
+ "movaps %%xmm2, 0x20(%1)\n\t"
+ "movaps %%xmm3, 0x30(%1)\n\t"
+ "movaps %%xmm4, 0x40(%1)\n\t"
+ "movaps %%xmm5, 0x50(%1)\n\t"
+ "movaps %%xmm6, 0x60(%1)\n\t"
+ "movaps %%xmm7, 0x70(%1)\n\t"
+ "movaps %%xmm8, 0x80(%1)\n\t"
+ "movaps %%xmm9, 0x90(%1)\n\t"
+ "movaps %%xmm10, 0xa0(%1)\n\t"
+ "movaps %%xmm11, 0xb0(%1)\n\t"
+ "movaps %%xmm12, 0xc0(%1)\n\t"
+ "movaps %%xmm13, 0xd0(%1)\n\t"
+ "movaps %%xmm14, 0xe0(%1)\n\t"
+ "movaps %%xmm15, 0xf0(%1)\n\t"
+ : : "r" (from), "r" (to) : "memory");
+
+ from += 256;
+ to += 256;
+ }
+
+ goto trailer;
+
+unaligned:
+ /*
+ * copy in 256 Byte portions unaligned
+ */
+ for (i = 0; i < (len & ~0xff); i += 256) {
+ asm volatile(
+ "movups 0x0(%0), %%xmm0\n\t"
+ "movups 0x10(%0), %%xmm1\n\t"
+ "movups 0x20(%0), %%xmm2\n\t"
+ "movups 0x30(%0), %%xmm3\n\t"
+ "movups 0x40(%0), %%xmm4\n\t"
+ "movups 0x50(%0), %%xmm5\n\t"
+ "movups 0x60(%0), %%xmm6\n\t"
+ "movups 0x70(%0), %%xmm7\n\t"
+ "movups 0x80(%0), %%xmm8\n\t"
+ "movups 0x90(%0), %%xmm9\n\t"
+ "movups 0xa0(%0), %%xmm10\n\t"
+ "movups 0xb0(%0), %%xmm11\n\t"
+ "movups 0xc0(%0), %%xmm12\n\t"
+ "movups 0xd0(%0), %%xmm13\n\t"
+ "movups 0xe0(%0), %%xmm14\n\t"
+ "movups 0xf0(%0), %%xmm15\n\t"
+
+ "movups %%xmm0, 0x0(%1)\n\t"
+ "movups %%xmm1, 0x10(%1)\n\t"
+ "movups %%xmm2, 0x20(%1)\n\t"
+ "movups %%xmm3, 0x30(%1)\n\t"
+ "movups %%xmm4, 0x40(%1)\n\t"
+ "movups %%xmm5, 0x50(%1)\n\t"
+ "movups %%xmm6, 0x60(%1)\n\t"
+ "movups %%xmm7, 0x70(%1)\n\t"
+ "movups %%xmm8, 0x80(%1)\n\t"
+ "movups %%xmm9, 0x90(%1)\n\t"
+ "movups %%xmm10, 0xa0(%1)\n\t"
+ "movups %%xmm11, 0xb0(%1)\n\t"
+ "movups %%xmm12, 0xc0(%1)\n\t"
+ "movups %%xmm13, 0xd0(%1)\n\t"
+ "movups %%xmm14, 0xe0(%1)\n\t"
+ "movups %%xmm15, 0xf0(%1)\n\t"
+ : : "r" (from), "r" (to) : "memory");
+
+ from += 256;
+ to += 256;
+ }
+
+trailer:
+ __memcpy(to, from, len & 0xff);
+
+ kernel_fpu_end();
+
+ return p;
+}
+EXPORT_SYMBOL_GPL(__sse_memcpy);
--
1.7.6.134.gcf13f6
--
Regards/Gruss,
Boris.
[-- Attachment #2: kernel_build.sizes --]
[-- Type: text/plain, Size: 925 bytes --]
Bytes Count
===== =====
0 5447
1 3850
2 16255
3 11113
4 68870
5 4256
6 30433
7 19188
8 50490
9 5999
10 78275
11 5628
12 6870
13 7371
14 4742
15 4911
16 143835
17 14096
18 1573
19 13603
20 424321
21 741
22 584
23 450
24 472
25 685
26 367
27 365
28 333
29 301
30 300
31 269
32 489
33 272
34 266
35 220
36 239
37 209
38 249
39 235
40 207
41 181
42 150
43 98
44 194
45 66
46 62
47 52
48 67226
49 138
50 171
51 26
52 20
53 12
54 15
55 4
56 13
57 8
58 6
59 6
60 115
61 10
62 5
63 12
64 67353
65 6
66 2363
67 9
68 11
69 6
70 5
71 6
72 10
73 4
74 9
75 8
76 4
77 6
78 3
79 4
80 3
81 4
82 4
83 4
84 4
85 8
86 6
87 2
88 3
89 2
90 2
91 1
92 9
93 1
94 2
96 2
97 2
98 3
100 2
102 1
104 1
105 1
106 1
107 2
109 1
110 1
111 1
112 1
113 2
115 2
117 1
118 1
119 1
120 14
127 1
128 1
130 1
131 2
134 2
137 1
144 100092
149 1
151 1
153 1
158 1
185 1
217 4
224 3
225 3
227 3
244 1
254 5
255 13
256 21708
512 21746
848 12907
1920 36536
2048 21708
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-14 9:59 ` Borislav Petkov
@ 2011-08-14 11:13 ` Denys Vlasenko
2011-08-14 12:40 ` Borislav Petkov
2011-08-16 2:34 ` Valdis.Kletnieks
1 sibling, 1 reply; 40+ messages in thread
From: Denys Vlasenko @ 2011-08-14 11:13 UTC (permalink / raw)
To: Borislav Petkov
Cc: Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin,
Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov
On Sunday 14 August 2011 11:59, Borislav Petkov wrote:
> Here's the SSE memcpy version I got so far, I haven't wired in the
> proper CPU feature detection yet because we want to run more benchmarks
> like netperf and stuff to see whether we see any positive results there.
>
> The SYSTEM_RUNNING check is to take care of early boot situations where
> we can't handle FPU exceptions but we use memcpy. There's an aligned and
> misaligned variant which should handle any buffers and sizes although
> I've set the SSE memcpy threshold at 512 Bytes buffersize the least to
> cover context save/restore somewhat.
>
> Comments are much appreciated! :-)
>
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
>
> #define __HAVE_ARCH_MEMCPY 1
> #ifndef CONFIG_KMEMCHECK
> +extern void *__memcpy(void *to, const void *from, size_t len);
> +extern void *__sse_memcpy(void *to, const void *from, size_t len);
> #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
> -extern void *memcpy(void *to, const void *from, size_t len);
> +#define memcpy(dst, src, len) \
> +({ \
> + size_t __len = (len); \
> + void *__ret; \
> + if (__len >= 512) \
> + __ret = __sse_memcpy((dst), (src), __len); \
> + else \
> + __ret = __memcpy((dst), (src), __len); \
> + __ret; \
> +})
Please, no. Do not inline every memcpy invocation.
This is pure bloat (comsidering how many memcpy calls there are)
and it doesn't even win anything in speed, since there will be
a fucntion call either way.
Put the __len >= 512 check inside your memcpy instead.
You may do the check if you know that __len is constant:
if (__builtin_constant_p(__len) && __len >= 512) ...
because in this case gcc will evaluate it at compile-time.
--
vda
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-14 11:13 ` Denys Vlasenko
@ 2011-08-14 12:40 ` Borislav Petkov
2011-08-15 13:27 ` melwyn lobo
2011-08-15 13:44 ` Denys Vlasenko
0 siblings, 2 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-08-14 12:40 UTC (permalink / raw)
To: Denys Vlasenko
Cc: Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin,
Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov
On Sun, Aug 14, 2011 at 01:13:56PM +0200, Denys Vlasenko wrote:
> On Sunday 14 August 2011 11:59, Borislav Petkov wrote:
> > Here's the SSE memcpy version I got so far, I haven't wired in the
> > proper CPU feature detection yet because we want to run more benchmarks
> > like netperf and stuff to see whether we see any positive results there.
> >
> > The SYSTEM_RUNNING check is to take care of early boot situations where
> > we can't handle FPU exceptions but we use memcpy. There's an aligned and
> > misaligned variant which should handle any buffers and sizes although
> > I've set the SSE memcpy threshold at 512 Bytes buffersize the least to
> > cover context save/restore somewhat.
> >
> > Comments are much appreciated! :-)
> >
> > --- a/arch/x86/include/asm/string_64.h
> > +++ b/arch/x86/include/asm/string_64.h
> > @@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
> >
> > #define __HAVE_ARCH_MEMCPY 1
> > #ifndef CONFIG_KMEMCHECK
> > +extern void *__memcpy(void *to, const void *from, size_t len);
> > +extern void *__sse_memcpy(void *to, const void *from, size_t len);
> > #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
> > -extern void *memcpy(void *to, const void *from, size_t len);
> > +#define memcpy(dst, src, len) \
> > +({ \
> > + size_t __len = (len); \
> > + void *__ret; \
> > + if (__len >= 512) \
> > + __ret = __sse_memcpy((dst), (src), __len); \
> > + else \
> > + __ret = __memcpy((dst), (src), __len); \
> > + __ret; \
> > +})
>
> Please, no. Do not inline every memcpy invocation.
> This is pure bloat (comsidering how many memcpy calls there are)
> and it doesn't even win anything in speed, since there will be
> a fucntion call either way.
> Put the __len >= 512 check inside your memcpy instead.
In the __len < 512 case, this would actually cause two function calls,
actually: once the __sse_memcpy and then the __memcpy one.
> You may do the check if you know that __len is constant:
> if (__builtin_constant_p(__len) && __len >= 512) ...
> because in this case gcc will evaluate it at compile-time.
That could justify the bloat at least partially.
Actually, I had a version which sticks sse_memcpy code into memcpy_64.S
and that would save us both the function call and the bloat. I might
return to that one if it turns out that SSE memcpy makes sense for the
kernel.
Thanks.
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-14 12:40 ` Borislav Petkov
@ 2011-08-15 13:27 ` melwyn lobo
2011-08-15 13:44 ` Denys Vlasenko
1 sibling, 0 replies; 40+ messages in thread
From: melwyn lobo @ 2011-08-15 13:27 UTC (permalink / raw)
To: Borislav Petkov, Denys Vlasenko, Ingo Molnar, melwyn lobo,
linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
Peter Zijlstra, borislav.petkov
Hi,
Was on a vacation for last two days. Thanks for the good insights into
the issue.
Ingo, unfortunately the data we have is on a soon to be released
platform and strictly confidential at this stage.
Boris, thanks for the patch. On seeing your patch:
+void *__sse_memcpy(void *to, const void *from, size_t len)
+{
+ unsigned long src = (unsigned long)from;
+ unsigned long dst = (unsigned long)to;
+ void *p = to;
+ int i;
+
+ if (in_interrupt())
+ return __memcpy(to, from, len)
So what is the reason we cannot use sse_memcpy in interrupt context.
(fpu registers not saved ? )
My question is still not answered. There are 3 versions of memcpy in kernel:
***********************************arch/x86/include/asm/string_32.h******************************
179 #ifndef CONFIG_KMEMCHECK
180
181 #if (__GNUC__ >= 4)
182 #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
183 #else
184 #define memcpy(t, f, n) \
185 (__builtin_constant_p((n)) \
186 ? __constant_memcpy((t), (f), (n)) \
187 : __memcpy((t), (f), (n)))
188 #endif
189 #else
190 /*
191 * kmemcheck becomes very happy if we use the REP instructions
unconditionally,
192 * because it means that we know both memory operands in advance.
193 */
194 #define memcpy(t, f, n) __memcpy((t), (f), (n))
195 #endif
196
197
****************************************************************************************.
I will ignore CONFIG_X86_USE_3DNOW (including mmx_memcpy() ) as this
is valid only for AMD and not for Atom Z5xx series.
This means __memcpy, __constant_memcpy, __builtin_memcpy .
I have a hunch by default we were using __builtin_memcpy. This is
because I see my GCC version >=4 and CONFIG_KMEMCHECK not defined.
Can someone confirm of these 3 which is used, with i386_defconfig.
Again with i386_defconfig which workloads provide the best results
with the default implementation.
thanks,
M.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-14 12:40 ` Borislav Petkov
2011-08-15 13:27 ` melwyn lobo
@ 2011-08-15 13:44 ` Denys Vlasenko
1 sibling, 0 replies; 40+ messages in thread
From: Denys Vlasenko @ 2011-08-15 13:44 UTC (permalink / raw)
To: Borislav Petkov, Denys Vlasenko, Ingo Molnar, melwyn lobo,
linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
Peter Zijlstra, borislav.petkov
On Sun, Aug 14, 2011 at 2:40 PM, Borislav Petkov <bp@alien8.de> wrote:
>> > + if (__len >= 512) \
>> > + __ret = __sse_memcpy((dst), (src), __len); \
>> > + else \
>> > + __ret = __memcpy((dst), (src), __len); \
>> > + __ret; \
>> > +})
>>
>> Please, no. Do not inline every memcpy invocation.
>> This is pure bloat (comsidering how many memcpy calls there are)
>> and it doesn't even win anything in speed, since there will be
>> a fucntion call either way.
>> Put the __len >= 512 check inside your memcpy instead.
>
> In the __len < 512 case, this would actually cause two function calls,
> actually: once the __sse_memcpy and then the __memcpy one.
You didn't notice the "else".
>> You may do the check if you know that __len is constant:
>> if (__builtin_constant_p(__len) && __len >= 512) ...
>> because in this case gcc will evaluate it at compile-time.
>
> That could justify the bloat at least partially.
There will be no bloat in this case.
--
vda
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-14 9:59 ` Borislav Petkov
2011-08-14 11:13 ` Denys Vlasenko
@ 2011-08-16 2:34 ` Valdis.Kletnieks
2011-08-16 12:16 ` Borislav Petkov
1 sibling, 1 reply; 40+ messages in thread
From: Valdis.Kletnieks @ 2011-08-16 2:34 UTC (permalink / raw)
To: Borislav Petkov
Cc: Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin,
Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov
[-- Attachment #1: Type: text/plain, Size: 1109 bytes --]
On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
> Benchmarking with 10000 iterations, average results:
> size XM MM speedup
> 119 540.58 449.491 0.8314969419
> 12273 2307.86 4042.88 1.751787902
> 13924 2431.8 4224.48 1.737184756
> 14335 2469.4 4218.82 1.708440514
> 15018 2675.67 1904.07 0.711622886
> 16374 2989.75 5296.26 1.771470902
> 24564 4262.15 7696.86 1.805863077
> 27852 4362.53 3347.72 0.7673805572
> 28672 5122.8 7113.14 1.388524413
> 30033 4874.62 8740.04 1.792967931
The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
really good about this till we understand what happened for those two cases.
Also, anytime I see "10000 iterations", I ask myself if the benchmark rigging
took proper note of hot/cold cache issues. That *may* explain the two oddball
results we see above - but not knowing more about how it was benched, it's hard
to say.
[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-16 2:34 ` Valdis.Kletnieks
@ 2011-08-16 12:16 ` Borislav Petkov
2011-09-01 15:15 ` Maarten Lankhorst
0 siblings, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-08-16 12:16 UTC (permalink / raw)
To: Valdis.Kletnieks
Cc: Borislav Petkov, Ingo Molnar, melwyn lobo, linux-kernel,
H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra
[-- Attachment #1: Type: text/plain, Size: 2448 bytes --]
On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@vt.edu wrote:
> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
>
> > Benchmarking with 10000 iterations, average results:
> > size XM MM speedup
> > 119 540.58 449.491 0.8314969419
>
> > 12273 2307.86 4042.88 1.751787902
> > 13924 2431.8 4224.48 1.737184756
> > 14335 2469.4 4218.82 1.708440514
> > 15018 2675.67 1904.07 0.711622886
> > 16374 2989.75 5296.26 1.771470902
> > 24564 4262.15 7696.86 1.805863077
> > 27852 4362.53 3347.72 0.7673805572
> > 28672 5122.8 7113.14 1.388524413
> > 30033 4874.62 8740.04 1.792967931
>
> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
> really good about this till we understand what happened for those two cases.
Yep.
> Also, anytime I see "10000 iterations", I ask myself if the benchmark
> rigging took proper note of hot/cold cache issues. That *may* explain
> the two oddball results we see above - but not knowing more about how
> it was benched, it's hard to say.
Yeah, the more scrutiny this gets the better. So I've cleaned up my
setup and have attached it.
xm_mem.c does the benchmarking and in bench_memcpy() there's the
sse_memcpy call which is the SSE memcpy implementation using inline asm.
It looks like gcc produces pretty crappy code here because if I replace
the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
same function but in pure asm - I get much better numbers, sometimes
even over 2x. It all depends on the alignment of the buffers though.
Also, those numbers don't include the context saving/restoring which the
kernel does for us.
7491 1509.89 2346.94 1.554378381
8170 2166.81 2857.78 1.318890326
12277 2659.03 4179.31 1.571744176
13907 2571.24 4125.7 1.604558427
14319 2638.74 5799.67 2.19789466 <----
14993 2752.42 4413.85 1.603625603
16371 3479.11 5562.65 1.59887055
So please take a look and let me know what you think.
Thanks.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
[-- Attachment #2: sse_memcpy.tar.bz2 --]
[-- Type: application/octet-stream, Size: 3508 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-16 12:16 ` Borislav Petkov
@ 2011-09-01 15:15 ` Maarten Lankhorst
2011-09-01 16:18 ` Linus Torvalds
2011-12-05 12:54 ` melwyn lobo
0 siblings, 2 replies; 40+ messages in thread
From: Maarten Lankhorst @ 2011-09-01 15:15 UTC (permalink / raw)
To: Borislav Petkov
Cc: Valdis.Kletnieks, Borislav Petkov, Ingo Molnar, melwyn lobo,
linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
Peter Zijlstra
[-- Attachment #1: Type: text/plain, Size: 3418 bytes --]
Hey,
2011/8/16 Borislav Petkov <bp@amd64.org>:
> On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@vt.edu wrote:
>> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
>>
>> > Benchmarking with 10000 iterations, average results:
>> > size XM MM speedup
>> > 119 540.58 449.491 0.8314969419
>>
>> > 12273 2307.86 4042.88 1.751787902
>> > 13924 2431.8 4224.48 1.737184756
>> > 14335 2469.4 4218.82 1.708440514
>> > 15018 2675.67 1904.07 0.711622886
>> > 16374 2989.75 5296.26 1.771470902
>> > 24564 4262.15 7696.86 1.805863077
>> > 27852 4362.53 3347.72 0.7673805572
>> > 28672 5122.8 7113.14 1.388524413
>> > 30033 4874.62 8740.04 1.792967931
>>
>> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
>> really good about this till we understand what happened for those two cases.
>
> Yep.
>
>> Also, anytime I see "10000 iterations", I ask myself if the benchmark
>> rigging took proper note of hot/cold cache issues. That *may* explain
>> the two oddball results we see above - but not knowing more about how
>> it was benched, it's hard to say.
>
> Yeah, the more scrutiny this gets the better. So I've cleaned up my
> setup and have attached it.
>
> xm_mem.c does the benchmarking and in bench_memcpy() there's the
> sse_memcpy call which is the SSE memcpy implementation using inline asm.
> It looks like gcc produces pretty crappy code here because if I replace
> the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
> same function but in pure asm - I get much better numbers, sometimes
> even over 2x. It all depends on the alignment of the buffers though.
> Also, those numbers don't include the context saving/restoring which the
> kernel does for us.
>
> 7491 1509.89 2346.94 1.554378381
> 8170 2166.81 2857.78 1.318890326
> 12277 2659.03 4179.31 1.571744176
> 13907 2571.24 4125.7 1.604558427
> 14319 2638.74 5799.67 2.19789466 <----
> 14993 2752.42 4413.85 1.603625603
> 16371 3479.11 5562.65 1.59887055
This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
and I finally figured out why. I also extended the test to an optimized avx memcpy,
but I think the kernel memcpy will always win in the aligned case.
Those numbers you posted aren't right it seems. It depends a lot on the alignment,
for example if both are aligned to 64 relative to each other,
kernel memcpy will win from avx memcpy on my machine.
I replaced the malloc calls with memalign(65536, size + 256) so I could toy
around with the alignments a little. This explains why for some sizes, kernel
memcpy was faster than sse memcpy in the test results you had.
When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise
avx memcpy might.
If you want to speed up memcpy, I think your best bet is to find out why it's
so much slower when src and dst aren't 64-byte aligned compared to each other.
Cheers,
Maarten
---
Attached: my modified version of the sse memcpy you posted.
I changed it a bit, and used avx, but some of the other changes might
be better for your sse memcpy too.
[-- Attachment #2: ym_memcpy.txt --]
[-- Type: text/plain, Size: 2668 bytes --]
/*
* ym_memcpy - AVX version of memcpy
*
* Input:
* rdi destination
* rsi source
* rdx count
*
* Output:
* rax original destination
*/
.globl ym_memcpy
.type ym_memcpy, @function
ym_memcpy:
mov %rdi, %rax
/* Target align */
movzbq %dil, %rcx
negb %cl
andb $0x1f, %cl
subq %rcx, %rdx
rep movsb
movq %rdx, %rcx
andq $0x1ff, %rdx
shrq $9, %rcx
jz .trailer
movb %sil, %r8b
andb $0x1f, %r8b
test %r8b, %r8b
jz .repeat_a
.align 32
.repeat_ua:
vmovups 0x0(%rsi), %ymm0
vmovups 0x20(%rsi), %ymm1
vmovups 0x40(%rsi), %ymm2
vmovups 0x60(%rsi), %ymm3
vmovups 0x80(%rsi), %ymm4
vmovups 0xa0(%rsi), %ymm5
vmovups 0xc0(%rsi), %ymm6
vmovups 0xe0(%rsi), %ymm7
vmovups 0x100(%rsi), %ymm8
vmovups 0x120(%rsi), %ymm9
vmovups 0x140(%rsi), %ymm10
vmovups 0x160(%rsi), %ymm11
vmovups 0x180(%rsi), %ymm12
vmovups 0x1a0(%rsi), %ymm13
vmovups 0x1c0(%rsi), %ymm14
vmovups 0x1e0(%rsi), %ymm15
vmovaps %ymm0, 0x0(%rdi)
vmovaps %ymm1, 0x20(%rdi)
vmovaps %ymm2, 0x40(%rdi)
vmovaps %ymm3, 0x60(%rdi)
vmovaps %ymm4, 0x80(%rdi)
vmovaps %ymm5, 0xa0(%rdi)
vmovaps %ymm6, 0xc0(%rdi)
vmovaps %ymm7, 0xe0(%rdi)
vmovaps %ymm8, 0x100(%rdi)
vmovaps %ymm9, 0x120(%rdi)
vmovaps %ymm10, 0x140(%rdi)
vmovaps %ymm11, 0x160(%rdi)
vmovaps %ymm12, 0x180(%rdi)
vmovaps %ymm13, 0x1a0(%rdi)
vmovaps %ymm14, 0x1c0(%rdi)
vmovaps %ymm15, 0x1e0(%rdi)
/* advance pointers */
addq $0x200, %rsi
addq $0x200, %rdi
subq $1, %rcx
jnz .repeat_ua
jz .trailer
.align 32
.repeat_a:
prefetchnta 0x80(%rsi)
prefetchnta 0x100(%rsi)
prefetchnta 0x180(%rsi)
vmovaps 0x0(%rsi), %ymm0
vmovaps 0x20(%rsi), %ymm1
vmovaps 0x40(%rsi), %ymm2
vmovaps 0x60(%rsi), %ymm3
vmovaps 0x80(%rsi), %ymm4
vmovaps 0xa0(%rsi), %ymm5
vmovaps 0xc0(%rsi), %ymm6
vmovaps 0xe0(%rsi), %ymm7
vmovaps 0x100(%rsi), %ymm8
vmovaps 0x120(%rsi), %ymm9
vmovaps 0x140(%rsi), %ymm10
vmovaps 0x160(%rsi), %ymm11
vmovaps 0x180(%rsi), %ymm12
vmovaps 0x1a0(%rsi), %ymm13
vmovaps 0x1c0(%rsi), %ymm14
vmovaps 0x1e0(%rsi), %ymm15
vmovaps %ymm0, 0x0(%rdi)
vmovaps %ymm1, 0x20(%rdi)
vmovaps %ymm2, 0x40(%rdi)
vmovaps %ymm3, 0x60(%rdi)
vmovaps %ymm4, 0x80(%rdi)
vmovaps %ymm5, 0xa0(%rdi)
vmovaps %ymm6, 0xc0(%rdi)
vmovaps %ymm7, 0xe0(%rdi)
vmovaps %ymm8, 0x100(%rdi)
vmovaps %ymm9, 0x120(%rdi)
vmovaps %ymm10, 0x140(%rdi)
vmovaps %ymm11, 0x160(%rdi)
vmovaps %ymm12, 0x180(%rdi)
vmovaps %ymm13, 0x1a0(%rdi)
vmovaps %ymm14, 0x1c0(%rdi)
vmovaps %ymm15, 0x1e0(%rdi)
/* advance pointers */
addq $0x200, %rsi
addq $0x200, %rdi
subq $1, %rcx
jnz .repeat_a
.align 32
.trailer:
movq %rdx, %rcx
shrq $3, %rcx
rep; movsq
movq %rdx, %rcx
andq $0x7, %rcx
rep; movsb
retq
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-09-01 15:15 ` Maarten Lankhorst
@ 2011-09-01 16:18 ` Linus Torvalds
2011-09-08 8:35 ` Borislav Petkov
2011-12-05 12:54 ` melwyn lobo
1 sibling, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2011-09-01 16:18 UTC (permalink / raw)
To: Maarten Lankhorst
Cc: Borislav Petkov, Valdis.Kletnieks, Borislav Petkov, Ingo Molnar,
melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner,
Peter Zijlstra
On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
<m.b.lankhorst@gmail.com> wrote:
>
> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
> and I finally figured out why. I also extended the test to an optimized avx memcpy,
> but I think the kernel memcpy will always win in the aligned case.
"rep movs" is generally optimized in microcode on most modern Intel
CPU's for some easyish cases, and it will outperform just about
anything.
Atom is a notable exception, but if you expect performance on any
general loads from Atom, you need to get your head examined. Atom is a
disaster for anything but tuned loops.
The "easyish cases" depend on microarchitecture. They are improving,
so long-term "rep movs" is the best way regardless, but for most
current ones it's something like "source aligned to 8 bytes *and*
source and destination are equal "mod 64"".
And that's true in a lot of common situations. It's true for the page
copy, for example, and it's often true for big user "read()/write()"
calls (but "often" may not be "often enough" - high-performance
userland should strive to align read/write buffers to 64 bytes, for
example).
Many other cases of "memcpy()" are the fairly small, constant-sized
ones, where the optimal strategy tends to be "move words by hand".
Linus
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-09-01 16:18 ` Linus Torvalds
@ 2011-09-08 8:35 ` Borislav Petkov
2011-09-08 10:58 ` Maarten Lankhorst
0 siblings, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-09-08 8:35 UTC (permalink / raw)
To: Linus Torvalds
Cc: Maarten Lankhorst, Borislav Petkov, Valdis.Kletnieks,
Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin,
Thomas Gleixner, Peter Zijlstra
On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote:
> On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
> <m.b.lankhorst@gmail.com> wrote:
> >
> > This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
> > and I finally figured out why. I also extended the test to an optimized avx memcpy,
> > but I think the kernel memcpy will always win in the aligned case.
>
> "rep movs" is generally optimized in microcode on most modern Intel
> CPU's for some easyish cases, and it will outperform just about
> anything.
>
> Atom is a notable exception, but if you expect performance on any
> general loads from Atom, you need to get your head examined. Atom is a
> disaster for anything but tuned loops.
>
> The "easyish cases" depend on microarchitecture. They are improving,
> so long-term "rep movs" is the best way regardless, but for most
> current ones it's something like "source aligned to 8 bytes *and*
> source and destination are equal "mod 64"".
>
> And that's true in a lot of common situations. It's true for the page
> copy, for example, and it's often true for big user "read()/write()"
> calls (but "often" may not be "often enough" - high-performance
> userland should strive to align read/write buffers to 64 bytes, for
> example).
>
> Many other cases of "memcpy()" are the fairly small, constant-sized
> ones, where the optimal strategy tends to be "move words by hand".
Yeah,
this probably makes enabling SSE memcpy in the kernel a task
with diminishing returns. There are also the additional costs of
saving/restoring FPU context in the kernel which eat off from any SSE
speedup.
And then there's the additional I$ pressure because "rep movs" is
much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the
smallest (two-byte) instructions I could use - in the AVX case they can
get up to 4 Bytes of length with the VEX prefix and the additional SIB,
size override, etc. fields.
Oh, and then there's copy_*_user which also does fault handling and
replacing that with a SSE version of memcpy could get quite hairy quite
fast.
Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel
when I get the time to see whether it still makes sense, at all.
Thanks.
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-09-08 8:35 ` Borislav Petkov
@ 2011-09-08 10:58 ` Maarten Lankhorst
2011-09-09 8:14 ` Borislav Petkov
0 siblings, 1 reply; 40+ messages in thread
From: Maarten Lankhorst @ 2011-09-08 10:58 UTC (permalink / raw)
To: Borislav Petkov, Linus Torvalds, Borislav Petkov,
Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel,
H. Peter Anvin, Thomas Gleixner, Peter Zijlstra
[-- Attachment #1: Type: text/plain, Size: 3330 bytes --]
On 09/08/2011 10:35 AM, Borislav Petkov wrote:
> On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote:
>> On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
>> <m.b.lankhorst@gmail.com> wrote:
>>> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
>>> and I finally figured out why. I also extended the test to an optimized avx memcpy,
>>> but I think the kernel memcpy will always win in the aligned case.
>> "rep movs" is generally optimized in microcode on most modern Intel
>> CPU's for some easyish cases, and it will outperform just about
>> anything.
>>
>> Atom is a notable exception, but if you expect performance on any
>> general loads from Atom, you need to get your head examined. Atom is a
>> disaster for anything but tuned loops.
>>
>> The "easyish cases" depend on microarchitecture. They are improving,
>> so long-term "rep movs" is the best way regardless, but for most
>> current ones it's something like "source aligned to 8 bytes *and*
>> source and destination are equal "mod 64"".
>>
>> And that's true in a lot of common situations. It's true for the page
>> copy, for example, and it's often true for big user "read()/write()"
>> calls (but "often" may not be "often enough" - high-performance
>> userland should strive to align read/write buffers to 64 bytes, for
>> example).
>>
>> Many other cases of "memcpy()" are the fairly small, constant-sized
>> ones, where the optimal strategy tends to be "move words by hand".
> Yeah,
>
> this probably makes enabling SSE memcpy in the kernel a task
> with diminishing returns. There are also the additional costs of
> saving/restoring FPU context in the kernel which eat off from any SSE
> speedup.
>
> And then there's the additional I$ pressure because "rep movs" is
> much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the
> smallest (two-byte) instructions I could use - in the AVX case they can
> get up to 4 Bytes of length with the VEX prefix and the additional SIB,
> size override, etc. fields.
>
> Oh, and then there's copy_*_user which also does fault handling and
> replacing that with a SSE version of memcpy could get quite hairy quite
> fast.
>
> Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel
> when I get the time to see whether it still makes sense, at all.
>
I have changed your sse memcpy to test various alignments with
source/destination offsets instead of random, from that you can
see that you don't really get a speedup at all. It seems to be more
a case of 'kernel memcpy is significantly slower with some alignments',
than 'avx memcpy is just that much faster'.
For example 3754 with src misalignment 4 and target misalignment 20
takes 1185 units on avx memcpy, but 1480 units with kernel memcpy
The modified testcase is attached, I did some optimizations in avx memcpy,
but I fear I may be missing something, when I tried to put it in the kernel, it
complained about sata errors I never had before, so I immediately went for
the power button to prevent more errors, fortunately it only corrupted some
kernel object files, and btrfs threw checksum errors. :)
All in all I think testing in userspace is safer, you might want to run it on an
idle cpu with schedtool, with a high fifo priority, and set cpufreq governor to
performance.
~Maarten
[-- Attachment #2: memcpy.tar.gz --]
[-- Type: application/x-gzip, Size: 4352 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-09-08 10:58 ` Maarten Lankhorst
@ 2011-09-09 8:14 ` Borislav Petkov
2011-09-09 10:12 ` Maarten Lankhorst
2011-09-09 14:39 ` Linus Torvalds
0 siblings, 2 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-09-09 8:14 UTC (permalink / raw)
To: Maarten Lankhorst
Cc: Linus Torvalds, Borislav Petkov, Valdis.Kletnieks, Ingo Molnar,
melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner,
Peter Zijlstra
On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote:
> I have changed your sse memcpy to test various alignments with
> source/destination offsets instead of random, from that you can
> see that you don't really get a speedup at all. It seems to be more
> a case of 'kernel memcpy is significantly slower with some alignments',
> than 'avx memcpy is just that much faster'.
>
> For example 3754 with src misalignment 4 and target misalignment 20
> takes 1185 units on avx memcpy, but 1480 units with kernel memcpy
Right, so the idea is to check whether with the bigger buffer sizes
(and misaligned, although this should not be that often the case in
the kernel) the SSE version would outperform a "rep movs" with ucode
optimizations not kicking in.
With your version modified back to SSE memcpy (don't have an AVX box
right now) I get on an AMD F10h:
...
16384(12/40) 4756.24 7867.74 1.654192552
16384(40/12) 5067.81 6068.71 1.197500008
16384(12/44) 4341.3 8474.96 1.952172387
16384(44/12) 4277.13 7107.64 1.661777347
16384(12/48) 4989.16 7964.54 1.596369011
16384(48/12) 4644.94 6499.5 1.399264281
...
which looks like pretty nice numbers to me. I can't say whether there
ever is 16K buffer we copy in the kernel but if there were... But <16K
buffers also show up to 1.5x speedup. So I'd say it's a uarch thing.
As I said, best it would be to put it in the kernel and run a bunch of
benchmarks...
> The modified testcase is attached, I did some optimizations in avx
> memcpy, but I fear I may be missing something, when I tried to put it
> in the kernel, it complained about sata errors I never had before,
> so I immediately went for the power button to prevent more errors,
> fortunately it only corrupted some kernel object files, and btrfs
> threw checksum errors. :)
Well, your version should do something similar to what _mmx_memcpy does:
save FPU state and not execute in IRQ context.
> All in all I think testing in userspace is safer, you might want to
> run it on an idle cpu with schedtool, with a high fifo priority, and
> set cpufreq governor to performance.
No, you need a generic system with default settings - otherwise it is
blatant benchmark lying :-)
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-09-09 8:14 ` Borislav Petkov
@ 2011-09-09 10:12 ` Maarten Lankhorst
2011-09-09 11:23 ` Maarten Lankhorst
2011-09-09 14:39 ` Linus Torvalds
1 sibling, 1 reply; 40+ messages in thread
From: Maarten Lankhorst @ 2011-09-09 10:12 UTC (permalink / raw)
To: Borislav Petkov, Linus Torvalds, Borislav Petkov,
Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel,
H. Peter Anvin, Thomas Gleixner, Peter Zijlstra
Hey,
On 09/09/2011 10:14 AM, Borislav Petkov wrote:
> On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote:
>> I have changed your sse memcpy to test various alignments with
>> source/destination offsets instead of random, from that you can
>> see that you don't really get a speedup at all. It seems to be more
>> a case of 'kernel memcpy is significantly slower with some alignments',
>> than 'avx memcpy is just that much faster'.
>>
>> For example 3754 with src misalignment 4 and target misalignment 20
>> takes 1185 units on avx memcpy, but 1480 units with kernel memcpy
> Right, so the idea is to check whether with the bigger buffer sizes
> (and misaligned, although this should not be that often the case in
> the kernel) the SSE version would outperform a "rep movs" with ucode
> optimizations not kicking in.
>
> With your version modified back to SSE memcpy (don't have an AVX box
> right now) I get on an AMD F10h:
>
> ...
> 16384(12/40) 4756.24 7867.74 1.654192552
> 16384(40/12) 5067.81 6068.71 1.197500008
> 16384(12/44) 4341.3 8474.96 1.952172387
> 16384(44/12) 4277.13 7107.64 1.661777347
> 16384(12/48) 4989.16 7964.54 1.596369011
> 16384(48/12) 4644.94 6499.5 1.399264281
> ...
>
> which looks like pretty nice numbers to me. I can't say whether there
> ever is 16K buffer we copy in the kernel but if there were... But <16K
> buffers also show up to 1.5x speedup. So I'd say it's a uarch thing.
> As I said, best it would be to put it in the kernel and run a bunch of
> benchmarks...
I think for bigger memcpy's it might make sense to demand stricter
alignment. What are your numbers for (0/0) ? In my case it seems
that kernel memcpy is always faster for that. In fact, it seems
src&63 == dst&63 is generally faster with kernel memcpy.
Patching my tree to WARN_ON_ONCE for when this condition isn't true, I get the following warnings:
WARNING: at arch/x86/kernel/head64.c:49 x86_64_start_reservations+0x3b/0x18d()
WARNING: at arch/x86/kernel/head64.c:52 x86_64_start_reservations+0xcb/0x18d()
WARNING: at arch/x86/kernel/e820.c:1077 setup_memory_map+0x3b/0x72()
WARNING: at kernel/fork.c:938 copy_process+0x148f/0x1550()
WARNING: at arch/x86/vdso/vdso32-setup.c:306 sysenter_setup+0xd4/0x301()
WARNING: at mm/util.c:72 kmemdup+0x75/0x80()
WARNING: at fs/btrfs/disk-io.c:1742 open_ctree+0x1ab5/0x1bb0()
WARNING: at fs/btrfs/disk-io.c:1744 open_ctree+0x1b35/0x1bb0()
WARNING: at fs/btrfs/extent_io.c:3634 write_extent_buffer+0x209/0x240()
WARNING: at fs/exec.c:1002 flush_old_exec+0x6c3/0x750()
WARNING: at fs/btrfs/extent_io.c:3496 read_extent_buffer+0x1b1/0x1e0()
WARNING: at kernel/module.c:2585 load_module+0x1933/0x1c30()
WARNING: at fs/btrfs/extent_io.c:3748 memcpy_extent_buffer+0x2aa/0x2f0()
WARNING: at fs/btrfs/disk-io.c:2276 write_dev_supers+0x34e/0x360()
WARNING: at lib/swiotlb.c:367 swiotlb_bounce+0xc6/0xe0()
WARNING: at fs/btrfs/transaction.c:1387 btrfs_commit_transaction+0x867/0x8a0()
WARNING: at drivers/tty/serial/serial_core.c:527 uart_write+0x14a/0x160()
WARNING: at mm/memory.c:3830 __access_remote_vm+0x251/0x270()
The most persistent one appears to be the btrfs' *_extent_buffer,
it gets the most warnings on my system. Apart from that on my
system there's not much to gain, since the alignment is already
close to optimal.
My ext4 /home doesn't throw warnings, so I'd gain the most
by figuring out if I could improve btrfs/extent_io.c in some way.
The patch for triggering those warnings is below, change to WARN_ON
if you want to see which one happens the most for you.
I was pleasantly surprised though.
>> The modified testcase is attached, I did some optimizations in avx
>> memcpy, but I fear I may be missing something, when I tried to put it
>> in the kernel, it complained about sata errors I never had before,
>> so I immediately went for the power button to prevent more errors,
>> fortunately it only corrupted some kernel object files, and btrfs
>> threw checksum errors. :)
> Well, your version should do something similar to what _mmx_memcpy does:
> save FPU state and not execute in IRQ context.
>
>> All in all I think testing in userspace is safer, you might want to
>> run it on an idle cpu with schedtool, with a high fifo priority, and
>> set cpufreq governor to performance.
> No, you need a generic system with default settings - otherwise it is
> blatant benchmark lying :-)
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..77180bb 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -30,6 +30,14 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
#ifndef CONFIG_KMEMCHECK
#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
extern void *memcpy(void *to, const void *from, size_t len);
+#define memcpy(dst, src, len) \
+({ \
+ size_t __len = (len); \
+ const void *__src = (src); \
+ void *__dst = (dst); \
+ WARN_ON_ONCE(__len > 1024 && (((long)__src & 63) != ((long)__dst & 63))); \
+ memcpy(__dst, __src, __len); \
+})
#else
extern void *__memcpy(void *to, const void *from, size_t len);
#define memcpy(dst, src, len) \
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-09-09 10:12 ` Maarten Lankhorst
@ 2011-09-09 11:23 ` Maarten Lankhorst
2011-09-09 13:42 ` Borislav Petkov
0 siblings, 1 reply; 40+ messages in thread
From: Maarten Lankhorst @ 2011-09-09 11:23 UTC (permalink / raw)
To: Borislav Petkov, Linus Torvalds, Borislav Petkov,
Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel,
H. Peter Anvin, Thomas Gleixner, Peter Zijlstra
Hey just a followup on btrfs,
On 09/09/2011 12:12 PM, Maarten Lankhorst wrote:
> Hey,
>
> On 09/09/2011 10:14 AM, Borislav Petkov wrote:
>> On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote:
>>> I have changed your sse memcpy to test various alignments with
>>> source/destination offsets instead of random, from that you can
>>> see that you don't really get a speedup at all. It seems to be more
>>> a case of 'kernel memcpy is significantly slower with some alignments',
>>> than 'avx memcpy is just that much faster'.
>>>
>>> For example 3754 with src misalignment 4 and target misalignment 20
>>> takes 1185 units on avx memcpy, but 1480 units with kernel memcpy
>> Right, so the idea is to check whether with the bigger buffer sizes
>> (and misaligned, although this should not be that often the case in
>> the kernel) the SSE version would outperform a "rep movs" with ucode
>> optimizations not kicking in.
>>
>> With your version modified back to SSE memcpy (don't have an AVX box
>> right now) I get on an AMD F10h:
>>
>> ...
>> 16384(12/40) 4756.24 7867.74 1.654192552
>> 16384(40/12) 5067.81 6068.71 1.197500008
>> 16384(12/44) 4341.3 8474.96 1.952172387
>> 16384(44/12) 4277.13 7107.64 1.661777347
>> 16384(12/48) 4989.16 7964.54 1.596369011
>> 16384(48/12) 4644.94 6499.5 1.399264281
>> ...
>>
>> which looks like pretty nice numbers to me. I can't say whether there
>> ever is 16K buffer we copy in the kernel but if there were... But <16K
>> buffers also show up to 1.5x speedup. So I'd say it's a uarch thing.
>> As I said, best it would be to put it in the kernel and run a bunch of
>> benchmarks...
> I think for bigger memcpy's it might make sense to demand stricter
> alignment. What are your numbers for (0/0) ? In my case it seems
> that kernel memcpy is always faster for that. In fact, it seems
> src&63 == dst&63 is generally faster with kernel memcpy.
>
> Patching my tree to WARN_ON_ONCE for when this condition isn't true, I get the following warnings:
>
> WARNING: at arch/x86/kernel/head64.c:49 x86_64_start_reservations+0x3b/0x18d()
> WARNING: at arch/x86/kernel/head64.c:52 x86_64_start_reservations+0xcb/0x18d()
> WARNING: at arch/x86/kernel/e820.c:1077 setup_memory_map+0x3b/0x72()
> WARNING: at kernel/fork.c:938 copy_process+0x148f/0x1550()
> WARNING: at arch/x86/vdso/vdso32-setup.c:306 sysenter_setup+0xd4/0x301()
> WARNING: at mm/util.c:72 kmemdup+0x75/0x80()
> WARNING: at fs/btrfs/disk-io.c:1742 open_ctree+0x1ab5/0x1bb0()
> WARNING: at fs/btrfs/disk-io.c:1744 open_ctree+0x1b35/0x1bb0()
> WARNING: at fs/btrfs/extent_io.c:3634 write_extent_buffer+0x209/0x240()
> WARNING: at fs/exec.c:1002 flush_old_exec+0x6c3/0x750()
> WARNING: at fs/btrfs/extent_io.c:3496 read_extent_buffer+0x1b1/0x1e0()
> WARNING: at kernel/module.c:2585 load_module+0x1933/0x1c30()
> WARNING: at fs/btrfs/extent_io.c:3748 memcpy_extent_buffer+0x2aa/0x2f0()
> WARNING: at fs/btrfs/disk-io.c:2276 write_dev_supers+0x34e/0x360()
> WARNING: at lib/swiotlb.c:367 swiotlb_bounce+0xc6/0xe0()
> WARNING: at fs/btrfs/transaction.c:1387 btrfs_commit_transaction+0x867/0x8a0()
> WARNING: at drivers/tty/serial/serial_core.c:527 uart_write+0x14a/0x160()
> WARNING: at mm/memory.c:3830 __access_remote_vm+0x251/0x270()
>
> The most persistent one appears to be the btrfs' *_extent_buffer,
> it gets the most warnings on my system. Apart from that on my
> system there's not much to gain, since the alignment is already
> close to optimal.
>
> My ext4 /home doesn't throw warnings, so I'd gain the most
> by figuring out if I could improve btrfs/extent_io.c in some way.
> The patch for triggering those warnings is below, change to WARN_ON
> if you want to see which one happens the most for you.
>
> I was pleasantly surprised though.
The btrfs one which happens far more often than all others is read_extent_buffer,
but most of them are page aligned on destination. This means that for me,
avx memcpy might be 10% slower or 10% faster, depending on the specific source
alignment, so avx memcpy wouldn't help much.
This specific one happened far more than any of the other memcpy usages, and
ignoring the check when destination is page aligned, most of them are gone.
In short: I don't think I can get a speedup by using avx memcpy in-kernel.
YMMV, if it does speed up for you, I'd love to see concrete numbers. And not only worst
case, but for the common aligned cases too. Or some concrete numbers that misaligned
happens a lot for you.
~Maarten
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-09-09 11:23 ` Maarten Lankhorst
@ 2011-09-09 13:42 ` Borislav Petkov
0 siblings, 0 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-09-09 13:42 UTC (permalink / raw)
To: Maarten Lankhorst
Cc: Linus Torvalds, Valdis.Kletnieks, Ingo Molnar, melwyn lobo,
linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra
[-- Attachment #1: Type: text/plain, Size: 2343 bytes --]
On Fri, Sep 09, 2011 at 01:23:09PM +0200, Maarten Lankhorst wrote:
> This specific one happened far more than any of the other memcpy usages, and
> ignoring the check when destination is page aligned, most of them are gone.
>
> In short: I don't think I can get a speedup by using avx memcpy in-kernel.
>
> YMMV, if it does speed up for you, I'd love to see concrete numbers. And not only worst
> case, but for the common aligned cases too. Or some concrete numbers that misaligned
> happens a lot for you.
Actually,
assuming alignment matters, I'd need to redo the trace_printk run I did
initially on buffer sizes:
http://marc.info/?l=linux-kernel&m=131331602309340 (kernel_build.sizes attached)
to get a more sensible grasp on the alignment of kernel buffers along
with their sizes and to see whether we're doing a lot of unaligned large
buffer copies in the kernel. I seriously doubt that, though, we should
be doing everything pagewise anyway so...
Concerning numbers, I ran your version again and sorted the output by
speedup. The highest scores are:
30037(12/44) 5566.4 12797.2 2.299011642
28672(12/44) 5512.97 12588.7 2.283467991
30037(28/60) 5610.34 12732.7 2.269502799
27852(12/44) 5398.36 12242.4 2.267803859
30037(4/36) 5585.02 12598.6 2.25578257
28672(28/60) 5499.11 12317.5 2.239914033
27852(28/60) 5349.78 11918.9 2.227919527
27852(20/52) 5335.92 11750.7 2.202186795
24576(12/44) 4991.37 10987.2 2.201247446
and this is pretty cool. Here are the (0/0) cases:
8192(0/0) 2627.82 3038.43 1.156255766
12288(0/0) 3116.62 3675.98 1.179475031
13926(0/0) 3330.04 4077.08 1.224334839
14336(0/0) 3377.95 4067.24 1.204055286
15018(0/0) 3465.3 4215.3 1.216430725
16384(0/0) 3623.33 4442.38 1.226050715
24576(0/0) 4629.53 6021.81 1.300737559
27852(0/0) 5026.69 6619.26 1.316823133
28672(0/0) 5157.73 6831.39 1.324495749
30037(0/0) 5322.01 6978.36 1.3112261
It is not 2x anymore but still.
Anyway, looking at the buffer sizes, they're rather ridiculous and even
if we get them in some workload, they won't repeat n times per second to
be relevant. So we'll see...
Thanks.
--
Regards/Gruss,
Boris.
[-- Attachment #2: kernel_build.sizes --]
[-- Type: text/plain, Size: 925 bytes --]
Bytes Count
===== =====
0 5447
1 3850
2 16255
3 11113
4 68870
5 4256
6 30433
7 19188
8 50490
9 5999
10 78275
11 5628
12 6870
13 7371
14 4742
15 4911
16 143835
17 14096
18 1573
19 13603
20 424321
21 741
22 584
23 450
24 472
25 685
26 367
27 365
28 333
29 301
30 300
31 269
32 489
33 272
34 266
35 220
36 239
37 209
38 249
39 235
40 207
41 181
42 150
43 98
44 194
45 66
46 62
47 52
48 67226
49 138
50 171
51 26
52 20
53 12
54 15
55 4
56 13
57 8
58 6
59 6
60 115
61 10
62 5
63 12
64 67353
65 6
66 2363
67 9
68 11
69 6
70 5
71 6
72 10
73 4
74 9
75 8
76 4
77 6
78 3
79 4
80 3
81 4
82 4
83 4
84 4
85 8
86 6
87 2
88 3
89 2
90 2
91 1
92 9
93 1
94 2
96 2
97 2
98 3
100 2
102 1
104 1
105 1
106 1
107 2
109 1
110 1
111 1
112 1
113 2
115 2
117 1
118 1
119 1
120 14
127 1
128 1
130 1
131 2
134 2
137 1
144 100092
149 1
151 1
153 1
158 1
185 1
217 4
224 3
225 3
227 3
244 1
254 5
255 13
256 21708
512 21746
848 12907
1920 36536
2048 21708
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-09-09 8:14 ` Borislav Petkov
2011-09-09 10:12 ` Maarten Lankhorst
@ 2011-09-09 14:39 ` Linus Torvalds
2011-09-09 15:35 ` Borislav Petkov
1 sibling, 1 reply; 40+ messages in thread
From: Linus Torvalds @ 2011-09-09 14:39 UTC (permalink / raw)
To: Borislav Petkov, Maarten Lankhorst, Linus Torvalds,
Borislav Petkov, Valdis.Kletnieks, Ingo Molnar, melwyn lobo,
linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra
On Fri, Sep 9, 2011 at 1:14 AM, Borislav Petkov <bp@alien8.de> wrote:
>
> which looks like pretty nice numbers to me. I can't say whether there
> ever is 16K buffer we copy in the kernel but if there were...
Kernel memcpy's are basically almost always smaller than a page size,
because that tends to be the fundamental allocation size.
Yes, there are exceptions that copy into big vmalloc'ed buffers, but
they don't tend to matter. Things like module loading etc.
Linus
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-09-09 14:39 ` Linus Torvalds
@ 2011-09-09 15:35 ` Borislav Petkov
2011-12-05 12:20 ` melwyn lobo
0 siblings, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-09-09 15:35 UTC (permalink / raw)
To: Linus Torvalds
Cc: Maarten Lankhorst, Borislav Petkov, Valdis.Kletnieks,
Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin,
Thomas Gleixner, Peter Zijlstra
On Fri, Sep 09, 2011 at 07:39:18AM -0700, Linus Torvalds wrote:
> Kernel memcpy's are basically almost always smaller than a page size,
> because that tends to be the fundamental allocation size.
Yeah, this is what my trace of a kernel build showed too:
Bytes Count
===== =====
...
224 3
225 3
227 3
244 1
254 5
255 13
256 21708
512 21746
848 12907
1920 36536
2048 21708
OTOH, I keep thinking that copy_*_user might be doing bigger sizes, for
example when shuffling network buffers to/from userspace. Converting
those to SSE memcpy might not be as easy as memcpy itself, though.
> Yes, there are exceptions that copy into big vmalloc'ed buffers, but
> they don't tend to matter. Things like module loading etc.
Too small a number of repetitions to matter, yes.
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-09-09 15:35 ` Borislav Petkov
@ 2011-12-05 12:20 ` melwyn lobo
0 siblings, 0 replies; 40+ messages in thread
From: melwyn lobo @ 2011-12-05 12:20 UTC (permalink / raw)
To: Borislav Petkov
Cc: Linus Torvalds, Maarten Lankhorst, Borislav Petkov,
Valdis.Kletnieks, Ingo Molnar, linux-kernel, H. Peter Anvin,
Thomas Gleixner, Peter Zijlstra
The driver has a loop of memcpy the source and destination addresses
based on a runtime computed value and confuses the compiler on the
alignement.
So instead of generating neat 32 bit memcpy, gcc generates "rep movsb"
Example code snippet:
src = (char *)kmap(bo->pages[idx]);
src += offset;
memcpy(des, src, len);
Be replacing ssse3 only for memcpy of length larger than 1K bytes (for
my driver typical length are 2k metadata from SRAM to DDR) I think
overheads of FPU save and restore can be forgiven.
Will SSSE3 work for unlaigned pointers as well ? If it doesn't I am
lucky for past 6 months :)
On Fri, Sep 9, 2011 at 9:05 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Sep 09, 2011 at 07:39:18AM -0700, Linus Torvalds wrote:
>> Kernel memcpy's are basically almost always smaller than a page size,
>> because that tends to be the fundamental allocation size.
>
> Yeah, this is what my trace of a kernel build showed too:
>
> Bytes Count
> ===== =====
>
> ...
>
> 224 3
> 225 3
> 227 3
> 244 1
> 254 5
> 255 13
> 256 21708
> 512 21746
> 848 12907
> 1920 36536
> 2048 21708
>
> OTOH, I keep thinking that copy_*_user might be doing bigger sizes, for
> example when shuffling network buffers to/from userspace. Converting
> those to SSE memcpy might not be as easy as memcpy itself, though.
>
>> Yes, there are exceptions that copy into big vmalloc'ed buffers, but
>> they don't tend to matter. Things like module loading etc.
>
> Too small a number of repetitions to matter, yes.
>
> --
> Regards/Gruss,
> Boris.
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-09-01 15:15 ` Maarten Lankhorst
2011-09-01 16:18 ` Linus Torvalds
@ 2011-12-05 12:54 ` melwyn lobo
2011-12-05 14:36 ` Alan Cox
1 sibling, 1 reply; 40+ messages in thread
From: melwyn lobo @ 2011-12-05 12:54 UTC (permalink / raw)
To: Maarten Lankhorst
Cc: Borislav Petkov, Valdis.Kletnieks, Borislav Petkov, Ingo Molnar,
linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
Peter Zijlstra
Will AVX work on Intel ATOM. I guess not. Then is this now not the
time for having architecture dependant definitions for basic cpu
intensive tasks
On Thu, Sep 1, 2011 at 8:45 PM, Maarten Lankhorst
<m.b.lankhorst@gmail.com> wrote:
> Hey,
>
> 2011/8/16 Borislav Petkov <bp@amd64.org>:
>> On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@vt.edu wrote:
>>> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
>>>
>>> > Benchmarking with 10000 iterations, average results:
>>> > size XM MM speedup
>>> > 119 540.58 449.491 0.8314969419
>>>
>>> > 12273 2307.86 4042.88 1.751787902
>>> > 13924 2431.8 4224.48 1.737184756
>>> > 14335 2469.4 4218.82 1.708440514
>>> > 15018 2675.67 1904.07 0.711622886
>>> > 16374 2989.75 5296.26 1.771470902
>>> > 24564 4262.15 7696.86 1.805863077
>>> > 27852 4362.53 3347.72 0.7673805572
>>> > 28672 5122.8 7113.14 1.388524413
>>> > 30033 4874.62 8740.04 1.792967931
>>>
>>> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
>>> really good about this till we understand what happened for those two cases.
>>
>> Yep.
>>
>>> Also, anytime I see "10000 iterations", I ask myself if the benchmark
>>> rigging took proper note of hot/cold cache issues. That *may* explain
>>> the two oddball results we see above - but not knowing more about how
>>> it was benched, it's hard to say.
>>
>> Yeah, the more scrutiny this gets the better. So I've cleaned up my
>> setup and have attached it.
>>
>> xm_mem.c does the benchmarking and in bench_memcpy() there's the
>> sse_memcpy call which is the SSE memcpy implementation using inline asm.
>> It looks like gcc produces pretty crappy code here because if I replace
>> the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
>> same function but in pure asm - I get much better numbers, sometimes
>> even over 2x. It all depends on the alignment of the buffers though.
>> Also, those numbers don't include the context saving/restoring which the
>> kernel does for us.
>>
>> 7491 1509.89 2346.94 1.554378381
>> 8170 2166.81 2857.78 1.318890326
>> 12277 2659.03 4179.31 1.571744176
>> 13907 2571.24 4125.7 1.604558427
>> 14319 2638.74 5799.67 2.19789466 <----
>> 14993 2752.42 4413.85 1.603625603
>> 16371 3479.11 5562.65 1.59887055
>
> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
> and I finally figured out why. I also extended the test to an optimized avx memcpy,
> but I think the kernel memcpy will always win in the aligned case.
>
> Those numbers you posted aren't right it seems. It depends a lot on the alignment,
> for example if both are aligned to 64 relative to each other,
> kernel memcpy will win from avx memcpy on my machine.
>
> I replaced the malloc calls with memalign(65536, size + 256) so I could toy
> around with the alignments a little. This explains why for some sizes, kernel
> memcpy was faster than sse memcpy in the test results you had.
> When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise
> avx memcpy might.
>
> If you want to speed up memcpy, I think your best bet is to find out why it's
> so much slower when src and dst aren't 64-byte aligned compared to each other.
>
> Cheers,
> Maarten
>
> ---
> Attached: my modified version of the sse memcpy you posted.
>
> I changed it a bit, and used avx, but some of the other changes might
> be better for your sse memcpy too.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-12-05 12:54 ` melwyn lobo
@ 2011-12-05 14:36 ` Alan Cox
0 siblings, 0 replies; 40+ messages in thread
From: Alan Cox @ 2011-12-05 14:36 UTC (permalink / raw)
To: melwyn lobo
Cc: Maarten Lankhorst, Borislav Petkov, Valdis.Kletnieks,
Borislav Petkov, Ingo Molnar, linux-kernel, H. Peter Anvin,
Thomas Gleixner, Linus Torvalds, Peter Zijlstra
> Will AVX work on Intel ATOM. I guess not. Then is this now not the
> time for having architecture dependant definitions for basic cpu
> intensive tasks
It's pretty much a necessity if you want to fine tune some of this.
> > If you want to speed up memcpy, I think your best bet is to find out why it's
> > so much slower when src and dst aren't 64-byte aligned compared to each other.
rep mov on most x86 processors is an extremely optimised path. The 64
byte alignment behaviour is to be expected given the processor cache line
size.
Alan
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-16 7:19 ` melwyn lobo
@ 2011-08-16 7:43 ` Borislav Petkov
0 siblings, 0 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-08-16 7:43 UTC (permalink / raw)
To: melwyn lobo
Cc: Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin,
Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov
On Tue, Aug 16, 2011 at 12:49:28PM +0530, melwyn lobo wrote:
> We would rather use the 32 bit patch. Have you already got a 32 bit
> patch.
Nope, only 64-bit for now, sorry.
> How can I use sse3 for 32 bit.
Well, OTTOMH, you have only 8 xmm regs in 32-bit instead of 16, which
should halve the performance of the 64-bit version in a perfect world.
However, we don't know how the performance of a 32-bit SSE memcpy
version behaves vs the gcc builtin one - that would require benchmarking
too.
But other than that, I don't see a problem with having a 32-bit version.
> I don't think you have submitted 64 bit patch in the mainline.
> Is there still work ongoing on this.
Yeah, we are currently benchmarking it to see whether it actually makes
sense to even have SSE memcpy in the kernel.
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 14:55 Borislav Petkov
2011-08-15 14:59 ` Andy Lutomirski
@ 2011-08-16 7:19 ` melwyn lobo
2011-08-16 7:43 ` Borislav Petkov
1 sibling, 1 reply; 40+ messages in thread
From: melwyn lobo @ 2011-08-16 7:19 UTC (permalink / raw)
To: Borislav Petkov
Cc: Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin,
Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov
> Yes, on 32-bit you're using the compiler-supplied version
> __builtin_memcpy when CONFIG_KMEMCHECK=n and your gcc is of version 4
> and above. Reportedly, using __builtin_memcpy generates better code.
>
> Btw, my version of SSE memcpy is 64-bit only.
>
> --
> Regards/Gruss,
> Boris.
>
>
We would rather use the 32 bit patch. Have you already got a 32 bit
patch. How can I use sse3 for 32 bit.
I don't think you have submitted 64 bit patch in the mainline.
Is there still work ongoing on this.
Regards,
Melwyn
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 20:05 ` Borislav Petkov
@ 2011-08-15 20:08 ` Andrew Lutomirski
0 siblings, 0 replies; 40+ messages in thread
From: Andrew Lutomirski @ 2011-08-15 20:08 UTC (permalink / raw)
To: Borislav Petkov, Andrew Lutomirski, melwyn lobo, Denys Vlasenko,
Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner,
Linus Torvalds, Peter Zijlstra, borislav.petkov
On Mon, Aug 15, 2011 at 4:05 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Aug 15, 2011 at 03:11:40PM -0400, Andrew Lutomirski wrote:
>> > Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
>> > shows reasonable speedup there, we might need to make those work too.
>>
>> I'm a little surprised that SSE beats fast string operations, but I
>> guess benchmarking always wins.
>
> If by fast string operations you mean X86_FEATURE_ERMS, then that's
> Intel-only and that actually would need to be benchmarked separately.
> Currently, I see speedup for large(r) buffers only vs rep; movsq. But I
> dunno about rep; movsb's enhanced rep string tricks Intel does.
I meant X86_FEATURE_REP_GOOD. (That may also be Intel-only, but it
sounds like rep;movsq might move whole cachelines on cpus at least a
few generations back.) I don't know if any ERMS cpus exist yet.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 19:11 ` Andrew Lutomirski
@ 2011-08-15 20:05 ` Borislav Petkov
2011-08-15 20:08 ` Andrew Lutomirski
0 siblings, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-08-15 20:05 UTC (permalink / raw)
To: Andrew Lutomirski
Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel,
H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
borislav.petkov
On Mon, Aug 15, 2011 at 03:11:40PM -0400, Andrew Lutomirski wrote:
> > Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
> > shows reasonable speedup there, we might need to make those work too.
>
> I'm a little surprised that SSE beats fast string operations, but I
> guess benchmarking always wins.
If by fast string operations you mean X86_FEATURE_ERMS, then that's
Intel-only and that actually would need to be benchmarked separately.
Currently, I see speedup for large(r) buffers only vs rep; movsq. But I
dunno about rep; movsb's enhanced rep string tricks Intel does.
> Yes. But we don't nest that much, and the save/restore isn't all that
> expensive. And we don't have to save/restore unless kernel entries
> nest and both entries try to use kernel_fpu_begin at the same time.
Yep.
> This whole project may take awhile. The code in there is a
> poorly-documented mess, even after Hans' cleanups. (It's a lot worse
> without them, though.)
Oh yeah, this code could use lotsa scrubbing :)
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 18:49 ` Borislav Petkov
@ 2011-08-15 19:11 ` Andrew Lutomirski
2011-08-15 20:05 ` Borislav Petkov
0 siblings, 1 reply; 40+ messages in thread
From: Andrew Lutomirski @ 2011-08-15 19:11 UTC (permalink / raw)
To: Borislav Petkov
Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel,
H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
borislav.petkov
On Mon, Aug 15, 2011 at 2:49 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, 15 August, 2011 7:04 pm, Andrew Lutomirski wrote:
>>> Or, if we want to use SSE stuff in the kernel, we might think of
>>> allocating its own FPU context(s) and handle those...
>>
>> I'm thinking of having a stack of FPU states to parallel irq stacks
>> and IST stacks.
>
> ... I'm guessing with the same nesting as hardirqs? Making FPU
> instructions usable in irq contexts too.
>
>> It gets a little hairy when code inside kernel_fpu_begin traps for a
>> non-irq non-IST reason, though.
>
> How does that happen? You're in the kernel with preemption disabled and
> TS cleared, what would cause the #NM? I think that if you need to switch
> context, you simply "push" the current FPU context, allocate a new one
> and clts as part of the FPU context switching, no?
Not #NM, but page faults can happen too (even just accessing vmalloc space).
>
>> Fortunately, those are rare and all of the EX_TABLE users could mark
>> xmm regs as clobbered (except for copy_from_user...).
>
> Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
> shows reasonable speedup there, we might need to make those work too.
I'm a little surprised that SSE beats fast string operations, but I
guess benchmarking always wins.
>
>> Keeping kernel_fpu_begin non-preemptable makes it less bad because the
>> extra FPU state can be per-cpu and not per-task.
>
> Yep.
>
>> This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
>>
>> The major speedup will come from saving state in kernel_fpu_begin but
>> not restoring it until the code in entry_??.S restores registers.
>
> But you'd need to save each kernel FPU state when nesting, no?
>
Yes. But we don't nest that much, and the save/restore isn't all that
expensive. And we don't have to save/restore unless kernel entries
nest and both entries try to use kernel_fpu_begin at the same time.
This whole project may take awhile. The code in there is a
poorly-documented mess, even after Hans' cleanups. (It's a lot worse
without them, though.)
--Andy
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 18:35 ` Andrew Lutomirski
@ 2011-08-15 18:52 ` H. Peter Anvin
0 siblings, 0 replies; 40+ messages in thread
From: H. Peter Anvin @ 2011-08-15 18:52 UTC (permalink / raw)
To: Andrew Lutomirski
Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
borislav.petkov
On 08/15/2011 11:35 AM, Andrew Lutomirski wrote:
>
> Are there any architecture-neutral users of this thing?
Look at the RAID-6 code, for example. It makes the various
architecture-specific codes look more similar.
-hpa
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 17:04 ` Andrew Lutomirski
@ 2011-08-15 18:49 ` Borislav Petkov
2011-08-15 19:11 ` Andrew Lutomirski
0 siblings, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-08-15 18:49 UTC (permalink / raw)
To: Andrew Lutomirski
Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
Peter Zijlstra, borislav.petkov
On Mon, 15 August, 2011 7:04 pm, Andrew Lutomirski wrote:
>> Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
>> This would obviate the need to muck with contexts but that could get
>> expensive wrt stack operations. The advantage is that I'm not dealing
>> with the whole FPU state but only with 16 XMM regs. I should probably
>> dust off that version again and retest.
>
> I bet it won't be a significant win. On Sandy Bridge, clts/stts takes
> 80 ns and a full state save+restore is only ~60 ns.
> Without infrastructure changes, I don't think you can avoid the clts
> and stts.
Yeah, probably.
> You might be able to get away with turning off IRQs, reading CR0 to
> check TS, pushing XMM regs, and being very certain that you don't
> accidentally generate any VEX-coded instructions.
That's ok - I'm using movaps/movups. But, the problem is that I still
need to save FPU state if the task I'm interrupting has been using FPU
instructions. So, I can't get away without saving the context in which
case I don't need to save the XMM regs anyway.
>> Or, if we want to use SSE stuff in the kernel, we might think of
>> allocating its own FPU context(s) and handle those...
>
> I'm thinking of having a stack of FPU states to parallel irq stacks
> and IST stacks.
... I'm guessing with the same nesting as hardirqs? Making FPU
instructions usable in irq contexts too.
> It gets a little hairy when code inside kernel_fpu_begin traps for a
> non-irq non-IST reason, though.
How does that happen? You're in the kernel with preemption disabled and
TS cleared, what would cause the #NM? I think that if you need to switch
context, you simply "push" the current FPU context, allocate a new one
and clts as part of the FPU context switching, no?
> Fortunately, those are rare and all of the EX_TABLE users could mark
> xmm regs as clobbered (except for copy_from_user...).
Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
shows reasonable speedup there, we might need to make those work too.
> Keeping kernel_fpu_begin non-preemptable makes it less bad because the
> extra FPU state can be per-cpu and not per-task.
Yep.
> This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
>
> The major speedup will come from saving state in kernel_fpu_begin but
> not restoring it until the code in entry_??.S restores registers.
But you'd need to save each kernel FPU state when nesting, no?
>>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>>> 387 equivalent) could contain garbage.
>>
>> Well, do we want to use floating point instructions in the kernel?
>
> The only use I could find is in staging.
Exactly my point - I think we should do it only when it's really worth
the trouble.
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 18:26 ` H. Peter Anvin
@ 2011-08-15 18:35 ` Andrew Lutomirski
2011-08-15 18:52 ` H. Peter Anvin
0 siblings, 1 reply; 40+ messages in thread
From: Andrew Lutomirski @ 2011-08-15 18:35 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
borislav.petkov
On Mon, Aug 15, 2011 at 2:26 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 08/15/2011 09:58 AM, Andrew Lutomirski wrote:
>> On Mon, Aug 15, 2011 at 12:12 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 08/15/2011 08:36 AM, Andrew Lutomirski wrote:
>>>>
>>>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>>>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>>>> 387 equivalent) could contain garbage.
>>>>
>>>
>>> Uh... no, it just means you have to initialize the settings. It's a
>>> perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.
>>
>> I prefer get_xstate / put_xstate, but this could rapidly devolve into
>> bikeshedding. :)
>>
>
> a) Quite.
>
> b) xstate is not architecture-neutral.
Are there any architecture-neutral users of this thing? If I were
writing generic code, I would expect:
kernel_fpu_begin();
foo *= 1.5;
kernel_fpu_end();
to work, but I would not expect:
kernel_fpu_begin();
use_xmm_registers();
kernel_fpu_end();
to make any sense.
Since the former does not actually work, I would hope that there is no
non-x86-specific user.
--Andy
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 16:58 ` Andrew Lutomirski
@ 2011-08-15 18:26 ` H. Peter Anvin
2011-08-15 18:35 ` Andrew Lutomirski
0 siblings, 1 reply; 40+ messages in thread
From: H. Peter Anvin @ 2011-08-15 18:26 UTC (permalink / raw)
To: Andrew Lutomirski
Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
borislav.petkov
On 08/15/2011 09:58 AM, Andrew Lutomirski wrote:
> On Mon, Aug 15, 2011 at 12:12 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 08/15/2011 08:36 AM, Andrew Lutomirski wrote:
>>>
>>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>>> 387 equivalent) could contain garbage.
>>>
>>
>> Uh... no, it just means you have to initialize the settings. It's a
>> perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.
>
> I prefer get_xstate / put_xstate, but this could rapidly devolve into
> bikeshedding. :)
>
a) Quite.
b) xstate is not architecture-neutral.
-hpa
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 16:12 ` Borislav Petkov
@ 2011-08-15 17:04 ` Andrew Lutomirski
2011-08-15 18:49 ` Borislav Petkov
0 siblings, 1 reply; 40+ messages in thread
From: Andrew Lutomirski @ 2011-08-15 17:04 UTC (permalink / raw)
To: Borislav Petkov
Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel,
H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
borislav.petkov
On Mon, Aug 15, 2011 at 12:12 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, 15 August, 2011 5:36 pm, Andrew Lutomirski wrote:
>>> But still, irq_fpu_usable() still checks !in_interrupt() which means
>>> that we don't want to run SSE instructions in IRQ context. OTOH, we
>>> still are fine when running with CR0.TS. So what happens when we get an
>>> #NM as a result of executing an FPU instruction in an IRQ handler? We
>>> will have to do init_fpu() on the current task if the last hasn't used
>>> math yet and do the slab allocation of the FPU context area (I'm looking
>>> at math_state_restore, btw).
>>
>> IIRC kernel_fpu_begin does clts, so #NM won't happen. But if we're in
>> an interrupt and TS=1, when we know that we're not in a
>> kernel_fpu_begin section, so it's safe to start one (and do clts).
>
> Doh, yes, I see it now. This way we save the math state of the current
> process if needed and "disable" #NM exceptions until kernel_fpu_end() by
> clearing CR0.TS, sure. Thanks.
>
>> IMO this code is not very good, and I plan to fix it sooner or later.
>
> Yep. Also, AFAIR, Hans did some FPU cleanup as part of his xsave rework.
> You could probably reuse some bits from there. The patchset should be in
> tip/x86/xsave.
>
>> I want kernel_fpu_begin (or its equivalent*) to be very fast and
>> usable from any context whatsoever. Mucking with TS is slower than a
>> complete save and restore of YMM state.
>
> Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
> This would obviate the need to muck with contexts but that could get
> expensive wrt stack operations. The advantage is that I'm not dealing
> with the whole FPU state but only with 16 XMM regs. I should probably
> dust off that version again and retest.
I bet it won't be a significant win. On Sandy Bridge, clts/stts takes
80 ns and a full state save+restore is only ~60 ns. Without
infrastructure changes, I don't think you can avoid the clts and stts.
You might be able to get away with turning off IRQs, reading CR0 to
check TS, pushing XMM regs, and being very certain that you don't
accidentally generate any VEX-coded instructions.
>
> Or, if we want to use SSE stuff in the kernel, we might think of
> allocating its own FPU context(s) and handle those...
I'm thinking of having a stack of FPU states to parallel irq stacks
and IST stacks. It gets a little hairy when code inside
kernel_fpu_begin traps for a non-irq non-IST reason, though.
Fortunately, those are rare and all of the EX_TABLE users could mark
xmm regs as clobbered (except for copy_from_user...). Keeping
kernel_fpu_begin non-preemptable makes it less bad because the extra
FPU state can be per-cpu and not per-task.
This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
The major speedup will come from saving state in kernel_fpu_begin but
not restoring it until the code in entry_??.S restores registers.
>
>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>> 387 equivalent) could contain garbage.
>
> Well, do we want to use floating point instructions in the kernel?
The only use I could find is in staging.
--Andy
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 16:12 ` H. Peter Anvin
@ 2011-08-15 16:58 ` Andrew Lutomirski
2011-08-15 18:26 ` H. Peter Anvin
0 siblings, 1 reply; 40+ messages in thread
From: Andrew Lutomirski @ 2011-08-15 16:58 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
borislav.petkov
On Mon, Aug 15, 2011 at 12:12 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 08/15/2011 08:36 AM, Andrew Lutomirski wrote:
>>
>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>> 387 equivalent) could contain garbage.
>>
>
> Uh... no, it just means you have to initialize the settings. It's a
> perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.
I prefer get_xstate / put_xstate, but this could rapidly devolve into
bikeshedding. :)
--Andy
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 15:36 ` Andrew Lutomirski
2011-08-15 16:12 ` Borislav Petkov
@ 2011-08-15 16:12 ` H. Peter Anvin
2011-08-15 16:58 ` Andrew Lutomirski
1 sibling, 1 reply; 40+ messages in thread
From: H. Peter Anvin @ 2011-08-15 16:12 UTC (permalink / raw)
To: Andrew Lutomirski
Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
borislav.petkov
On 08/15/2011 08:36 AM, Andrew Lutomirski wrote:
>
> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
> instructions inside a kernel_fpu_begin section because MXCSR (and the
> 387 equivalent) could contain garbage.
>
Uh... no, it just means you have to initialize the settings. It's a
perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.
-hpa
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 15:36 ` Andrew Lutomirski
@ 2011-08-15 16:12 ` Borislav Petkov
2011-08-15 17:04 ` Andrew Lutomirski
2011-08-15 16:12 ` H. Peter Anvin
1 sibling, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-08-15 16:12 UTC (permalink / raw)
To: Andrew Lutomirski
Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
Peter Zijlstra, borislav.petkov
On Mon, 15 August, 2011 5:36 pm, Andrew Lutomirski wrote:
>> But still, irq_fpu_usable() still checks !in_interrupt() which means
>> that we don't want to run SSE instructions in IRQ context. OTOH, we
>> still are fine when running with CR0.TS. So what happens when we get an
>> #NM as a result of executing an FPU instruction in an IRQ handler? We
>> will have to do init_fpu() on the current task if the last hasn't used
>> math yet and do the slab allocation of the FPU context area (I'm looking
>> at math_state_restore, btw).
>
> IIRC kernel_fpu_begin does clts, so #NM won't happen. But if we're in
> an interrupt and TS=1, when we know that we're not in a
> kernel_fpu_begin section, so it's safe to start one (and do clts).
Doh, yes, I see it now. This way we save the math state of the current
process if needed and "disable" #NM exceptions until kernel_fpu_end() by
clearing CR0.TS, sure. Thanks.
> IMO this code is not very good, and I plan to fix it sooner or later.
Yep. Also, AFAIR, Hans did some FPU cleanup as part of his xsave rework.
You could probably reuse some bits from there. The patchset should be in
tip/x86/xsave.
> I want kernel_fpu_begin (or its equivalent*) to be very fast and
> usable from any context whatsoever. Mucking with TS is slower than a
> complete save and restore of YMM state.
Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
This would obviate the need to muck with contexts but that could get
expensive wrt stack operations. The advantage is that I'm not dealing
with the whole FPU state but only with 16 XMM regs. I should probably
dust off that version again and retest.
Or, if we want to use SSE stuff in the kernel, we might think of
allocating its own FPU context(s) and handle those...
> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
> instructions inside a kernel_fpu_begin section because MXCSR (and the
> 387 equivalent) could contain garbage.
Well, do we want to use floating point instructions in the kernel?
Thanks.
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 15:29 ` Borislav Petkov
@ 2011-08-15 15:36 ` Andrew Lutomirski
2011-08-15 16:12 ` Borislav Petkov
2011-08-15 16:12 ` H. Peter Anvin
0 siblings, 2 replies; 40+ messages in thread
From: Andrew Lutomirski @ 2011-08-15 15:36 UTC (permalink / raw)
To: Borislav Petkov
Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel,
H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
borislav.petkov
On Mon, Aug 15, 2011 at 11:29 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, 15 August, 2011 4:59 pm, Andy Lutomirski wrote:
>>>> So what is the reason we cannot use sse_memcpy in interrupt context.
>>>> (fpu registers not saved ? )
>>>
>>> Because, AFAICT, when we handle an #NM exception while running
>>> sse_memcpy in an IRQ handler, we might need to allocate FPU save state
>>> area, which in turn, can sleep. Then, we might get another IRQ while
>>> sleeping and we should be deadlocked.
>>>
>>> But let me stress on the "AFAICT" above, someone who actually knows the
>>> FPU code should correct me if I'm missing something.
>>
>> I don't think you ever get #NM as a result of kernel_fpu_begin, but you
>> can certainly have problems when kernel_fpu_begin nests by accident.
>> There's irq_fpu_usable() for this.
>>
>> (irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.)
>
> Oh I didn't know about irq_fpu_usable(), thanks.
>
> But still, irq_fpu_usable() still checks !in_interrupt() which means
> that we don't want to run SSE instructions in IRQ context. OTOH, we
> still are fine when running with CR0.TS. So what happens when we get an
> #NM as a result of executing an FPU instruction in an IRQ handler? We
> will have to do init_fpu() on the current task if the last hasn't used
> math yet and do the slab allocation of the FPU context area (I'm looking
> at math_state_restore, btw).
IIRC kernel_fpu_begin does clts, so #NM won't happen. But if we're in
an interrupt and TS=1, when we know that we're not in a
kernel_fpu_begin section, so it's safe to start one (and do clts).
IMO this code is not very good, and I plan to fix it sooner or later.
I want kernel_fpu_begin (or its equivalent*) to be very fast and
usable from any context whatsoever. Mucking with TS is slower than a
complete save and restore of YMM state.
(*) kernel_fpu_begin is a bad name. It's only safe to use integer
instructions inside a kernel_fpu_begin section because MXCSR (and the
387 equivalent) could contain garbage.
--Andy
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 14:59 ` Andy Lutomirski
@ 2011-08-15 15:29 ` Borislav Petkov
2011-08-15 15:36 ` Andrew Lutomirski
0 siblings, 1 reply; 40+ messages in thread
From: Borislav Petkov @ 2011-08-15 15:29 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar,
linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
Peter Zijlstra, borislav.petkov
On Mon, 15 August, 2011 4:59 pm, Andy Lutomirski wrote:
>>> So what is the reason we cannot use sse_memcpy in interrupt context.
>>> (fpu registers not saved ? )
>>
>> Because, AFAICT, when we handle an #NM exception while running
>> sse_memcpy in an IRQ handler, we might need to allocate FPU save state
>> area, which in turn, can sleep. Then, we might get another IRQ while
>> sleeping and we should be deadlocked.
>>
>> But let me stress on the "AFAICT" above, someone who actually knows the
>> FPU code should correct me if I'm missing something.
>
> I don't think you ever get #NM as a result of kernel_fpu_begin, but you
> can certainly have problems when kernel_fpu_begin nests by accident.
> There's irq_fpu_usable() for this.
>
> (irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.)
Oh I didn't know about irq_fpu_usable(), thanks.
But still, irq_fpu_usable() still checks !in_interrupt() which means
that we don't want to run SSE instructions in IRQ context. OTOH, we
still are fine when running with CR0.TS. So what happens when we get an
#NM as a result of executing an FPU instruction in an IRQ handler? We
will have to do init_fpu() on the current task if the last hasn't used
math yet and do the slab allocation of the FPU context area (I'm looking
at math_state_restore, btw).
Thanks.
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
2011-08-15 14:55 Borislav Petkov
@ 2011-08-15 14:59 ` Andy Lutomirski
2011-08-15 15:29 ` Borislav Petkov
2011-08-16 7:19 ` melwyn lobo
1 sibling, 1 reply; 40+ messages in thread
From: Andy Lutomirski @ 2011-08-15 14:59 UTC (permalink / raw)
To: Borislav Petkov
Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel,
H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra,
borislav.petkov
On 08/15/2011 10:55 AM, Borislav Petkov wrote:
> On Mon, 15 August, 2011 3:27 pm, melwyn lobo wrote:
>> Hi,
>> Was on a vacation for last two days. Thanks for the good insights into
>> the issue.
>> Ingo, unfortunately the data we have is on a soon to be released
>> platform and strictly confidential at this stage.
>>
>> Boris, thanks for the patch. On seeing your patch:
>> +void *__sse_memcpy(void *to, const void *from, size_t len)
>> +{
>> + unsigned long src = (unsigned long)from;
>> + unsigned long dst = (unsigned long)to;
>> + void *p = to;
>> + int i;
>> +
>> + if (in_interrupt())
>> + return __memcpy(to, from, len)
>> So what is the reason we cannot use sse_memcpy in interrupt context.
>> (fpu registers not saved ? )
>
> Because, AFAICT, when we handle an #NM exception while running
> sse_memcpy in an IRQ handler, we might need to allocate FPU save state
> area, which in turn, can sleep. Then, we might get another IRQ while
> sleeping and we should be deadlocked.
>
> But let me stress on the "AFAICT" above, someone who actually knows the
> FPU code should correct me if I'm missing something.
I don't think you ever get #NM as a result of kernel_fpu_begin, but you
can certainly have problems when kernel_fpu_begin nests by accident.
There's irq_fpu_usable() for this.
(irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.)
--Andy
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance
@ 2011-08-15 14:55 Borislav Petkov
2011-08-15 14:59 ` Andy Lutomirski
2011-08-16 7:19 ` melwyn lobo
0 siblings, 2 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-08-15 14:55 UTC (permalink / raw)
To: melwyn lobo
Cc: Borislav Petkov, Denys Vlasenko, Ingo Molnar, melwyn lobo,
linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
Peter Zijlstra, borislav.petkov
On Mon, 15 August, 2011 3:27 pm, melwyn lobo wrote:
> Hi,
> Was on a vacation for last two days. Thanks for the good insights into
> the issue.
> Ingo, unfortunately the data we have is on a soon to be released
> platform and strictly confidential at this stage.
>
> Boris, thanks for the patch. On seeing your patch:
> +void *__sse_memcpy(void *to, const void *from, size_t len)
> +{
> + unsigned long src = (unsigned long)from;
> + unsigned long dst = (unsigned long)to;
> + void *p = to;
> + int i;
> +
> + if (in_interrupt())
> + return __memcpy(to, from, len)
> So what is the reason we cannot use sse_memcpy in interrupt context.
> (fpu registers not saved ? )
Because, AFAICT, when we handle an #NM exception while running
sse_memcpy in an IRQ handler, we might need to allocate FPU save state
area, which in turn, can sleep. Then, we might get another IRQ while
sleeping and we should be deadlocked.
But let me stress on the "AFAICT" above, someone who actually knows the
FPU code should correct me if I'm missing something.
> My question is still not answered. There are 3 versions of memcpy in
> kernel:
>
> ***********************************arch/x86/include/asm/string_32.h******************************
> 179 #ifndef CONFIG_KMEMCHECK
> 180
> 181 #if (__GNUC__ >= 4)
> 182 #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
> 183 #else
> 184 #define memcpy(t, f, n) \
> 185 (__builtin_constant_p((n)) \
> 186 ? __constant_memcpy((t), (f), (n)) \
> 187 : __memcpy((t), (f), (n)))
> 188 #endif
> 189 #else
> 190 /*
> 191 * kmemcheck becomes very happy if we use the REP instructions
> unconditionally,
> 192 * because it means that we know both memory operands in advance.
> 193 */
> 194 #define memcpy(t, f, n) __memcpy((t), (f), (n))
> 195 #endif
> 196
> 197
> ****************************************************************************************.
> I will ignore CONFIG_X86_USE_3DNOW (including mmx_memcpy() ) as this
> is valid only for AMD and not for Atom Z5xx series.
> This means __memcpy, __constant_memcpy, __builtin_memcpy .
> I have a hunch by default we were using __builtin_memcpy.
> This is because I see my GCC version >=4 and CONFIG_KMEMCHECK
> not defined. Can someone confirm of these 3 which is used, with
> i386_defconfig. Again with i386_defconfig which workloads provide the
> best results with the default implementation.
Yes, on 32-bit you're using the compiler-supplied version
__builtin_memcpy when CONFIG_KMEMCHECK=n and your gcc is of version 4
and above. Reportedly, using __builtin_memcpy generates better code.
Btw, my version of SSE memcpy is 64-bit only.
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2011-12-05 14:35 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-12 17:59 x86 memcpy performance melwyn lobo
2011-08-12 18:33 ` Andi Kleen
2011-08-12 19:52 ` Ingo Molnar
2011-08-14 9:59 ` Borislav Petkov
2011-08-14 11:13 ` Denys Vlasenko
2011-08-14 12:40 ` Borislav Petkov
2011-08-15 13:27 ` melwyn lobo
2011-08-15 13:44 ` Denys Vlasenko
2011-08-16 2:34 ` Valdis.Kletnieks
2011-08-16 12:16 ` Borislav Petkov
2011-09-01 15:15 ` Maarten Lankhorst
2011-09-01 16:18 ` Linus Torvalds
2011-09-08 8:35 ` Borislav Petkov
2011-09-08 10:58 ` Maarten Lankhorst
2011-09-09 8:14 ` Borislav Petkov
2011-09-09 10:12 ` Maarten Lankhorst
2011-09-09 11:23 ` Maarten Lankhorst
2011-09-09 13:42 ` Borislav Petkov
2011-09-09 14:39 ` Linus Torvalds
2011-09-09 15:35 ` Borislav Petkov
2011-12-05 12:20 ` melwyn lobo
2011-12-05 12:54 ` melwyn lobo
2011-12-05 14:36 ` Alan Cox
2011-08-15 14:55 Borislav Petkov
2011-08-15 14:59 ` Andy Lutomirski
2011-08-15 15:29 ` Borislav Petkov
2011-08-15 15:36 ` Andrew Lutomirski
2011-08-15 16:12 ` Borislav Petkov
2011-08-15 17:04 ` Andrew Lutomirski
2011-08-15 18:49 ` Borislav Petkov
2011-08-15 19:11 ` Andrew Lutomirski
2011-08-15 20:05 ` Borislav Petkov
2011-08-15 20:08 ` Andrew Lutomirski
2011-08-15 16:12 ` H. Peter Anvin
2011-08-15 16:58 ` Andrew Lutomirski
2011-08-15 18:26 ` H. Peter Anvin
2011-08-15 18:35 ` Andrew Lutomirski
2011-08-15 18:52 ` H. Peter Anvin
2011-08-16 7:19 ` melwyn lobo
2011-08-16 7:43 ` Borislav Petkov
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.