All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: x86 memcpy performance
@ 2011-08-15 14:55 Borislav Petkov
  2011-08-15 14:59 ` Andy Lutomirski
  2011-08-16  7:19 ` melwyn lobo
  0 siblings, 2 replies; 40+ messages in thread
From: Borislav Petkov @ 2011-08-15 14:55 UTC (permalink / raw)
  To: melwyn lobo
  Cc: Borislav Petkov, Denys Vlasenko, Ingo Molnar, melwyn lobo,
	linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds,
	Peter Zijlstra, borislav.petkov

On Mon, 15 August, 2011 3:27 pm, melwyn lobo wrote:
> Hi,
> Was on a vacation for last two days. Thanks for the good insights into
> the issue.
> Ingo, unfortunately the data we have is on a soon to be released
> platform and strictly confidential at this stage.
>
> Boris, thanks for the patch. On seeing your patch:
> +void *__sse_memcpy(void *to, const void *from, size_t len)
> +{
> +       unsigned long src = (unsigned long)from;
> +       unsigned long dst = (unsigned long)to;
> +       void *p = to;
> +       int i;
> +
> +       if (in_interrupt())
> +               return __memcpy(to, from, len)
> So what is the reason we cannot use sse_memcpy in interrupt context.
> (fpu registers not saved ? )

Because, AFAICT, when we handle an #NM exception while running
sse_memcpy in an IRQ handler, we might need to allocate FPU save state
area, which in turn, can sleep. Then, we might get another IRQ while
sleeping and we should be deadlocked.

But let me stress on the "AFAICT" above, someone who actually knows the
FPU code should correct me if I'm missing something.

> My question is still not answered. There are 3 versions of memcpy in
> kernel:
>
> ***********************************arch/x86/include/asm/string_32.h******************************
> 179 #ifndef CONFIG_KMEMCHECK
> 180
> 181 #if (__GNUC__ >= 4)
> 182 #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
> 183 #else
> 184 #define memcpy(t, f, n)                         \
> 185         (__builtin_constant_p((n))              \
> 186          ? __constant_memcpy((t), (f), (n))     \
> 187          : __memcpy((t), (f), (n)))
> 188 #endif
> 189 #else
> 190 /*
> 191  * kmemcheck becomes very happy if we use the REP instructions
> unconditionally,
> 192  * because it means that we know both memory operands in advance.
> 193  */
> 194 #define memcpy(t, f, n) __memcpy((t), (f), (n))
> 195 #endif
> 196
> 197
> ****************************************************************************************.
> I will ignore CONFIG_X86_USE_3DNOW (including mmx_memcpy() ) as this
> is valid only for AMD and not for Atom Z5xx series.
> This means __memcpy, __constant_memcpy, __builtin_memcpy .
> I have a hunch by default we were using  __builtin_memcpy.
> This is because I see my GCC version >=4 and CONFIG_KMEMCHECK
> not defined. Can someone confirm of these 3 which is used, with
> i386_defconfig. Again with i386_defconfig which workloads provide the
> best results with the default implementation.

Yes, on 32-bit you're using the compiler-supplied version
__builtin_memcpy when CONFIG_KMEMCHECK=n and your gcc is of version 4
and above. Reportedly, using __builtin_memcpy generates better code.

Btw, my version of SSE memcpy is 64-bit only.

-- 
Regards/Gruss,
Boris.


^ permalink raw reply	[flat|nested] 40+ messages in thread
* x86 memcpy performance
@ 2011-08-12 17:59 melwyn lobo
  2011-08-12 18:33 ` Andi Kleen
  2011-08-12 19:52 ` Ingo Molnar
  0 siblings, 2 replies; 40+ messages in thread
From: melwyn lobo @ 2011-08-12 17:59 UTC (permalink / raw)
  To: linux-kernel

Hi All,
Our Video recorder application uses memcpy for every frame. About 2KB
data every frame on Intel® Atom™ Z5xx processor.
With default 2.6.35 kernel we got 19.6 fps. But it seems kernel
implemented memcpy is suboptimal, because when we replaced
with an optmized one (using ssse3, exact patches are currently being
finalized) ew obtained 22fps a gain of 12.2 %.
C0 residency also reduced from 75% to 67%. This means power benefits too.
My questions:
1. Is kernel memcpy profiled for optimal performance.
2. Does the default kernel configuration for i386 include the best
memcpy implementation (AMD 3DNOW, __builtin_memcpy .... etc)

Any suggestions, prior experience on this is welcome.

Thanks,
M.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2011-12-05 14:35 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-15 14:55 x86 memcpy performance Borislav Petkov
2011-08-15 14:59 ` Andy Lutomirski
2011-08-15 15:29   ` Borislav Petkov
2011-08-15 15:36     ` Andrew Lutomirski
2011-08-15 16:12       ` Borislav Petkov
2011-08-15 17:04         ` Andrew Lutomirski
2011-08-15 18:49           ` Borislav Petkov
2011-08-15 19:11             ` Andrew Lutomirski
2011-08-15 20:05               ` Borislav Petkov
2011-08-15 20:08                 ` Andrew Lutomirski
2011-08-15 16:12       ` H. Peter Anvin
2011-08-15 16:58         ` Andrew Lutomirski
2011-08-15 18:26           ` H. Peter Anvin
2011-08-15 18:35             ` Andrew Lutomirski
2011-08-15 18:52               ` H. Peter Anvin
2011-08-16  7:19 ` melwyn lobo
2011-08-16  7:43   ` Borislav Petkov
  -- strict thread matches above, loose matches on Subject: below --
2011-08-12 17:59 melwyn lobo
2011-08-12 18:33 ` Andi Kleen
2011-08-12 19:52 ` Ingo Molnar
2011-08-14  9:59   ` Borislav Petkov
2011-08-14 11:13     ` Denys Vlasenko
2011-08-14 12:40       ` Borislav Petkov
2011-08-15 13:27         ` melwyn lobo
2011-08-15 13:44         ` Denys Vlasenko
2011-08-16  2:34     ` Valdis.Kletnieks
2011-08-16 12:16       ` Borislav Petkov
2011-09-01 15:15         ` Maarten Lankhorst
2011-09-01 16:18           ` Linus Torvalds
2011-09-08  8:35             ` Borislav Petkov
2011-09-08 10:58               ` Maarten Lankhorst
2011-09-09  8:14                 ` Borislav Petkov
2011-09-09 10:12                   ` Maarten Lankhorst
2011-09-09 11:23                     ` Maarten Lankhorst
2011-09-09 13:42                       ` Borislav Petkov
2011-09-09 14:39                   ` Linus Torvalds
2011-09-09 15:35                     ` Borislav Petkov
2011-12-05 12:20                       ` melwyn lobo
2011-12-05 12:54           ` melwyn lobo
2011-12-05 14:36             ` Alan Cox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.