* Re: x86 memcpy performance @ 2011-08-15 14:55 Borislav Petkov 2011-08-15 14:59 ` Andy Lutomirski 2011-08-16 7:19 ` melwyn lobo 0 siblings, 2 replies; 40+ messages in thread From: Borislav Petkov @ 2011-08-15 14:55 UTC (permalink / raw) To: melwyn lobo Cc: Borislav Petkov, Denys Vlasenko, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Mon, 15 August, 2011 3:27 pm, melwyn lobo wrote: > Hi, > Was on a vacation for last two days. Thanks for the good insights into > the issue. > Ingo, unfortunately the data we have is on a soon to be released > platform and strictly confidential at this stage. > > Boris, thanks for the patch. On seeing your patch: > +void *__sse_memcpy(void *to, const void *from, size_t len) > +{ > + unsigned long src = (unsigned long)from; > + unsigned long dst = (unsigned long)to; > + void *p = to; > + int i; > + > + if (in_interrupt()) > + return __memcpy(to, from, len) > So what is the reason we cannot use sse_memcpy in interrupt context. > (fpu registers not saved ? ) Because, AFAICT, when we handle an #NM exception while running sse_memcpy in an IRQ handler, we might need to allocate FPU save state area, which in turn, can sleep. Then, we might get another IRQ while sleeping and we should be deadlocked. But let me stress on the "AFAICT" above, someone who actually knows the FPU code should correct me if I'm missing something. > My question is still not answered. There are 3 versions of memcpy in > kernel: > > ***********************************arch/x86/include/asm/string_32.h****************************** > 179 #ifndef CONFIG_KMEMCHECK > 180 > 181 #if (__GNUC__ >= 4) > 182 #define memcpy(t, f, n) __builtin_memcpy(t, f, n) > 183 #else > 184 #define memcpy(t, f, n) \ > 185 (__builtin_constant_p((n)) \ > 186 ? __constant_memcpy((t), (f), (n)) \ > 187 : __memcpy((t), (f), (n))) > 188 #endif > 189 #else > 190 /* > 191 * kmemcheck becomes very happy if we use the REP instructions > unconditionally, > 192 * because it means that we know both memory operands in advance. > 193 */ > 194 #define memcpy(t, f, n) __memcpy((t), (f), (n)) > 195 #endif > 196 > 197 > ****************************************************************************************. > I will ignore CONFIG_X86_USE_3DNOW (including mmx_memcpy() ) as this > is valid only for AMD and not for Atom Z5xx series. > This means __memcpy, __constant_memcpy, __builtin_memcpy . > I have a hunch by default we were using __builtin_memcpy. > This is because I see my GCC version >=4 and CONFIG_KMEMCHECK > not defined. Can someone confirm of these 3 which is used, with > i386_defconfig. Again with i386_defconfig which workloads provide the > best results with the default implementation. Yes, on 32-bit you're using the compiler-supplied version __builtin_memcpy when CONFIG_KMEMCHECK=n and your gcc is of version 4 and above. Reportedly, using __builtin_memcpy generates better code. Btw, my version of SSE memcpy is 64-bit only. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 14:55 x86 memcpy performance Borislav Petkov @ 2011-08-15 14:59 ` Andy Lutomirski 2011-08-15 15:29 ` Borislav Petkov 2011-08-16 7:19 ` melwyn lobo 1 sibling, 1 reply; 40+ messages in thread From: Andy Lutomirski @ 2011-08-15 14:59 UTC (permalink / raw) To: Borislav Petkov Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On 08/15/2011 10:55 AM, Borislav Petkov wrote: > On Mon, 15 August, 2011 3:27 pm, melwyn lobo wrote: >> Hi, >> Was on a vacation for last two days. Thanks for the good insights into >> the issue. >> Ingo, unfortunately the data we have is on a soon to be released >> platform and strictly confidential at this stage. >> >> Boris, thanks for the patch. On seeing your patch: >> +void *__sse_memcpy(void *to, const void *from, size_t len) >> +{ >> + unsigned long src = (unsigned long)from; >> + unsigned long dst = (unsigned long)to; >> + void *p = to; >> + int i; >> + >> + if (in_interrupt()) >> + return __memcpy(to, from, len) >> So what is the reason we cannot use sse_memcpy in interrupt context. >> (fpu registers not saved ? ) > > Because, AFAICT, when we handle an #NM exception while running > sse_memcpy in an IRQ handler, we might need to allocate FPU save state > area, which in turn, can sleep. Then, we might get another IRQ while > sleeping and we should be deadlocked. > > But let me stress on the "AFAICT" above, someone who actually knows the > FPU code should correct me if I'm missing something. I don't think you ever get #NM as a result of kernel_fpu_begin, but you can certainly have problems when kernel_fpu_begin nests by accident. There's irq_fpu_usable() for this. (irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.) --Andy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 14:59 ` Andy Lutomirski @ 2011-08-15 15:29 ` Borislav Petkov 2011-08-15 15:36 ` Andrew Lutomirski 0 siblings, 1 reply; 40+ messages in thread From: Borislav Petkov @ 2011-08-15 15:29 UTC (permalink / raw) To: Andy Lutomirski Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Mon, 15 August, 2011 4:59 pm, Andy Lutomirski wrote: >>> So what is the reason we cannot use sse_memcpy in interrupt context. >>> (fpu registers not saved ? ) >> >> Because, AFAICT, when we handle an #NM exception while running >> sse_memcpy in an IRQ handler, we might need to allocate FPU save state >> area, which in turn, can sleep. Then, we might get another IRQ while >> sleeping and we should be deadlocked. >> >> But let me stress on the "AFAICT" above, someone who actually knows the >> FPU code should correct me if I'm missing something. > > I don't think you ever get #NM as a result of kernel_fpu_begin, but you > can certainly have problems when kernel_fpu_begin nests by accident. > There's irq_fpu_usable() for this. > > (irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.) Oh I didn't know about irq_fpu_usable(), thanks. But still, irq_fpu_usable() still checks !in_interrupt() which means that we don't want to run SSE instructions in IRQ context. OTOH, we still are fine when running with CR0.TS. So what happens when we get an #NM as a result of executing an FPU instruction in an IRQ handler? We will have to do init_fpu() on the current task if the last hasn't used math yet and do the slab allocation of the FPU context area (I'm looking at math_state_restore, btw). Thanks. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 15:29 ` Borislav Petkov @ 2011-08-15 15:36 ` Andrew Lutomirski 2011-08-15 16:12 ` Borislav Petkov 2011-08-15 16:12 ` H. Peter Anvin 0 siblings, 2 replies; 40+ messages in thread From: Andrew Lutomirski @ 2011-08-15 15:36 UTC (permalink / raw) To: Borislav Petkov Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Mon, Aug 15, 2011 at 11:29 AM, Borislav Petkov <bp@alien8.de> wrote: > On Mon, 15 August, 2011 4:59 pm, Andy Lutomirski wrote: >>>> So what is the reason we cannot use sse_memcpy in interrupt context. >>>> (fpu registers not saved ? ) >>> >>> Because, AFAICT, when we handle an #NM exception while running >>> sse_memcpy in an IRQ handler, we might need to allocate FPU save state >>> area, which in turn, can sleep. Then, we might get another IRQ while >>> sleeping and we should be deadlocked. >>> >>> But let me stress on the "AFAICT" above, someone who actually knows the >>> FPU code should correct me if I'm missing something. >> >> I don't think you ever get #NM as a result of kernel_fpu_begin, but you >> can certainly have problems when kernel_fpu_begin nests by accident. >> There's irq_fpu_usable() for this. >> >> (irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.) > > Oh I didn't know about irq_fpu_usable(), thanks. > > But still, irq_fpu_usable() still checks !in_interrupt() which means > that we don't want to run SSE instructions in IRQ context. OTOH, we > still are fine when running with CR0.TS. So what happens when we get an > #NM as a result of executing an FPU instruction in an IRQ handler? We > will have to do init_fpu() on the current task if the last hasn't used > math yet and do the slab allocation of the FPU context area (I'm looking > at math_state_restore, btw). IIRC kernel_fpu_begin does clts, so #NM won't happen. But if we're in an interrupt and TS=1, when we know that we're not in a kernel_fpu_begin section, so it's safe to start one (and do clts). IMO this code is not very good, and I plan to fix it sooner or later. I want kernel_fpu_begin (or its equivalent*) to be very fast and usable from any context whatsoever. Mucking with TS is slower than a complete save and restore of YMM state. (*) kernel_fpu_begin is a bad name. It's only safe to use integer instructions inside a kernel_fpu_begin section because MXCSR (and the 387 equivalent) could contain garbage. --Andy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 15:36 ` Andrew Lutomirski @ 2011-08-15 16:12 ` Borislav Petkov 2011-08-15 17:04 ` Andrew Lutomirski 2011-08-15 16:12 ` H. Peter Anvin 1 sibling, 1 reply; 40+ messages in thread From: Borislav Petkov @ 2011-08-15 16:12 UTC (permalink / raw) To: Andrew Lutomirski Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Mon, 15 August, 2011 5:36 pm, Andrew Lutomirski wrote: >> But still, irq_fpu_usable() still checks !in_interrupt() which means >> that we don't want to run SSE instructions in IRQ context. OTOH, we >> still are fine when running with CR0.TS. So what happens when we get an >> #NM as a result of executing an FPU instruction in an IRQ handler? We >> will have to do init_fpu() on the current task if the last hasn't used >> math yet and do the slab allocation of the FPU context area (I'm looking >> at math_state_restore, btw). > > IIRC kernel_fpu_begin does clts, so #NM won't happen. But if we're in > an interrupt and TS=1, when we know that we're not in a > kernel_fpu_begin section, so it's safe to start one (and do clts). Doh, yes, I see it now. This way we save the math state of the current process if needed and "disable" #NM exceptions until kernel_fpu_end() by clearing CR0.TS, sure. Thanks. > IMO this code is not very good, and I plan to fix it sooner or later. Yep. Also, AFAIR, Hans did some FPU cleanup as part of his xsave rework. You could probably reuse some bits from there. The patchset should be in tip/x86/xsave. > I want kernel_fpu_begin (or its equivalent*) to be very fast and > usable from any context whatsoever. Mucking with TS is slower than a > complete save and restore of YMM state. Well, I had a SSE memcpy which saved/restored the XMM regs on the stack. This would obviate the need to muck with contexts but that could get expensive wrt stack operations. The advantage is that I'm not dealing with the whole FPU state but only with 16 XMM regs. I should probably dust off that version again and retest. Or, if we want to use SSE stuff in the kernel, we might think of allocating its own FPU context(s) and handle those... > (*) kernel_fpu_begin is a bad name. It's only safe to use integer > instructions inside a kernel_fpu_begin section because MXCSR (and the > 387 equivalent) could contain garbage. Well, do we want to use floating point instructions in the kernel? Thanks. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 16:12 ` Borislav Petkov @ 2011-08-15 17:04 ` Andrew Lutomirski 2011-08-15 18:49 ` Borislav Petkov 0 siblings, 1 reply; 40+ messages in thread From: Andrew Lutomirski @ 2011-08-15 17:04 UTC (permalink / raw) To: Borislav Petkov Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Mon, Aug 15, 2011 at 12:12 PM, Borislav Petkov <bp@alien8.de> wrote: > On Mon, 15 August, 2011 5:36 pm, Andrew Lutomirski wrote: >>> But still, irq_fpu_usable() still checks !in_interrupt() which means >>> that we don't want to run SSE instructions in IRQ context. OTOH, we >>> still are fine when running with CR0.TS. So what happens when we get an >>> #NM as a result of executing an FPU instruction in an IRQ handler? We >>> will have to do init_fpu() on the current task if the last hasn't used >>> math yet and do the slab allocation of the FPU context area (I'm looking >>> at math_state_restore, btw). >> >> IIRC kernel_fpu_begin does clts, so #NM won't happen. But if we're in >> an interrupt and TS=1, when we know that we're not in a >> kernel_fpu_begin section, so it's safe to start one (and do clts). > > Doh, yes, I see it now. This way we save the math state of the current > process if needed and "disable" #NM exceptions until kernel_fpu_end() by > clearing CR0.TS, sure. Thanks. > >> IMO this code is not very good, and I plan to fix it sooner or later. > > Yep. Also, AFAIR, Hans did some FPU cleanup as part of his xsave rework. > You could probably reuse some bits from there. The patchset should be in > tip/x86/xsave. > >> I want kernel_fpu_begin (or its equivalent*) to be very fast and >> usable from any context whatsoever. Mucking with TS is slower than a >> complete save and restore of YMM state. > > Well, I had a SSE memcpy which saved/restored the XMM regs on the stack. > This would obviate the need to muck with contexts but that could get > expensive wrt stack operations. The advantage is that I'm not dealing > with the whole FPU state but only with 16 XMM regs. I should probably > dust off that version again and retest. I bet it won't be a significant win. On Sandy Bridge, clts/stts takes 80 ns and a full state save+restore is only ~60 ns. Without infrastructure changes, I don't think you can avoid the clts and stts. You might be able to get away with turning off IRQs, reading CR0 to check TS, pushing XMM regs, and being very certain that you don't accidentally generate any VEX-coded instructions. > > Or, if we want to use SSE stuff in the kernel, we might think of > allocating its own FPU context(s) and handle those... I'm thinking of having a stack of FPU states to parallel irq stacks and IST stacks. It gets a little hairy when code inside kernel_fpu_begin traps for a non-irq non-IST reason, though. Fortunately, those are rare and all of the EX_TABLE users could mark xmm regs as clobbered (except for copy_from_user...). Keeping kernel_fpu_begin non-preemptable makes it less bad because the extra FPU state can be per-cpu and not per-task. This is extra fun on 32 bit, which IIRC doesn't have IST stacks. The major speedup will come from saving state in kernel_fpu_begin but not restoring it until the code in entry_??.S restores registers. > >> (*) kernel_fpu_begin is a bad name. It's only safe to use integer >> instructions inside a kernel_fpu_begin section because MXCSR (and the >> 387 equivalent) could contain garbage. > > Well, do we want to use floating point instructions in the kernel? The only use I could find is in staging. --Andy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 17:04 ` Andrew Lutomirski @ 2011-08-15 18:49 ` Borislav Petkov 2011-08-15 19:11 ` Andrew Lutomirski 0 siblings, 1 reply; 40+ messages in thread From: Borislav Petkov @ 2011-08-15 18:49 UTC (permalink / raw) To: Andrew Lutomirski Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Mon, 15 August, 2011 7:04 pm, Andrew Lutomirski wrote: >> Well, I had a SSE memcpy which saved/restored the XMM regs on the stack. >> This would obviate the need to muck with contexts but that could get >> expensive wrt stack operations. The advantage is that I'm not dealing >> with the whole FPU state but only with 16 XMM regs. I should probably >> dust off that version again and retest. > > I bet it won't be a significant win. On Sandy Bridge, clts/stts takes > 80 ns and a full state save+restore is only ~60 ns. > Without infrastructure changes, I don't think you can avoid the clts > and stts. Yeah, probably. > You might be able to get away with turning off IRQs, reading CR0 to > check TS, pushing XMM regs, and being very certain that you don't > accidentally generate any VEX-coded instructions. That's ok - I'm using movaps/movups. But, the problem is that I still need to save FPU state if the task I'm interrupting has been using FPU instructions. So, I can't get away without saving the context in which case I don't need to save the XMM regs anyway. >> Or, if we want to use SSE stuff in the kernel, we might think of >> allocating its own FPU context(s) and handle those... > > I'm thinking of having a stack of FPU states to parallel irq stacks > and IST stacks. ... I'm guessing with the same nesting as hardirqs? Making FPU instructions usable in irq contexts too. > It gets a little hairy when code inside kernel_fpu_begin traps for a > non-irq non-IST reason, though. How does that happen? You're in the kernel with preemption disabled and TS cleared, what would cause the #NM? I think that if you need to switch context, you simply "push" the current FPU context, allocate a new one and clts as part of the FPU context switching, no? > Fortunately, those are rare and all of the EX_TABLE users could mark > xmm regs as clobbered (except for copy_from_user...). Well, copy_from_user... does a bunch of rep; movsq - if the SSE version shows reasonable speedup there, we might need to make those work too. > Keeping kernel_fpu_begin non-preemptable makes it less bad because the > extra FPU state can be per-cpu and not per-task. Yep. > This is extra fun on 32 bit, which IIRC doesn't have IST stacks. > > The major speedup will come from saving state in kernel_fpu_begin but > not restoring it until the code in entry_??.S restores registers. But you'd need to save each kernel FPU state when nesting, no? >>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer >>> instructions inside a kernel_fpu_begin section because MXCSR (and the >>> 387 equivalent) could contain garbage. >> >> Well, do we want to use floating point instructions in the kernel? > > The only use I could find is in staging. Exactly my point - I think we should do it only when it's really worth the trouble. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 18:49 ` Borislav Petkov @ 2011-08-15 19:11 ` Andrew Lutomirski 2011-08-15 20:05 ` Borislav Petkov 0 siblings, 1 reply; 40+ messages in thread From: Andrew Lutomirski @ 2011-08-15 19:11 UTC (permalink / raw) To: Borislav Petkov Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Mon, Aug 15, 2011 at 2:49 PM, Borislav Petkov <bp@alien8.de> wrote: > On Mon, 15 August, 2011 7:04 pm, Andrew Lutomirski wrote: >>> Or, if we want to use SSE stuff in the kernel, we might think of >>> allocating its own FPU context(s) and handle those... >> >> I'm thinking of having a stack of FPU states to parallel irq stacks >> and IST stacks. > > ... I'm guessing with the same nesting as hardirqs? Making FPU > instructions usable in irq contexts too. > >> It gets a little hairy when code inside kernel_fpu_begin traps for a >> non-irq non-IST reason, though. > > How does that happen? You're in the kernel with preemption disabled and > TS cleared, what would cause the #NM? I think that if you need to switch > context, you simply "push" the current FPU context, allocate a new one > and clts as part of the FPU context switching, no? Not #NM, but page faults can happen too (even just accessing vmalloc space). > >> Fortunately, those are rare and all of the EX_TABLE users could mark >> xmm regs as clobbered (except for copy_from_user...). > > Well, copy_from_user... does a bunch of rep; movsq - if the SSE version > shows reasonable speedup there, we might need to make those work too. I'm a little surprised that SSE beats fast string operations, but I guess benchmarking always wins. > >> Keeping kernel_fpu_begin non-preemptable makes it less bad because the >> extra FPU state can be per-cpu and not per-task. > > Yep. > >> This is extra fun on 32 bit, which IIRC doesn't have IST stacks. >> >> The major speedup will come from saving state in kernel_fpu_begin but >> not restoring it until the code in entry_??.S restores registers. > > But you'd need to save each kernel FPU state when nesting, no? > Yes. But we don't nest that much, and the save/restore isn't all that expensive. And we don't have to save/restore unless kernel entries nest and both entries try to use kernel_fpu_begin at the same time. This whole project may take awhile. The code in there is a poorly-documented mess, even after Hans' cleanups. (It's a lot worse without them, though.) --Andy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 19:11 ` Andrew Lutomirski @ 2011-08-15 20:05 ` Borislav Petkov 2011-08-15 20:08 ` Andrew Lutomirski 0 siblings, 1 reply; 40+ messages in thread From: Borislav Petkov @ 2011-08-15 20:05 UTC (permalink / raw) To: Andrew Lutomirski Cc: melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Mon, Aug 15, 2011 at 03:11:40PM -0400, Andrew Lutomirski wrote: > > Well, copy_from_user... does a bunch of rep; movsq - if the SSE version > > shows reasonable speedup there, we might need to make those work too. > > I'm a little surprised that SSE beats fast string operations, but I > guess benchmarking always wins. If by fast string operations you mean X86_FEATURE_ERMS, then that's Intel-only and that actually would need to be benchmarked separately. Currently, I see speedup for large(r) buffers only vs rep; movsq. But I dunno about rep; movsb's enhanced rep string tricks Intel does. > Yes. But we don't nest that much, and the save/restore isn't all that > expensive. And we don't have to save/restore unless kernel entries > nest and both entries try to use kernel_fpu_begin at the same time. Yep. > This whole project may take awhile. The code in there is a > poorly-documented mess, even after Hans' cleanups. (It's a lot worse > without them, though.) Oh yeah, this code could use lotsa scrubbing :) -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 20:05 ` Borislav Petkov @ 2011-08-15 20:08 ` Andrew Lutomirski 0 siblings, 0 replies; 40+ messages in thread From: Andrew Lutomirski @ 2011-08-15 20:08 UTC (permalink / raw) To: Borislav Petkov, Andrew Lutomirski, melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Mon, Aug 15, 2011 at 4:05 PM, Borislav Petkov <bp@alien8.de> wrote: > On Mon, Aug 15, 2011 at 03:11:40PM -0400, Andrew Lutomirski wrote: >> > Well, copy_from_user... does a bunch of rep; movsq - if the SSE version >> > shows reasonable speedup there, we might need to make those work too. >> >> I'm a little surprised that SSE beats fast string operations, but I >> guess benchmarking always wins. > > If by fast string operations you mean X86_FEATURE_ERMS, then that's > Intel-only and that actually would need to be benchmarked separately. > Currently, I see speedup for large(r) buffers only vs rep; movsq. But I > dunno about rep; movsb's enhanced rep string tricks Intel does. I meant X86_FEATURE_REP_GOOD. (That may also be Intel-only, but it sounds like rep;movsq might move whole cachelines on cpus at least a few generations back.) I don't know if any ERMS cpus exist yet. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 15:36 ` Andrew Lutomirski 2011-08-15 16:12 ` Borislav Petkov @ 2011-08-15 16:12 ` H. Peter Anvin 2011-08-15 16:58 ` Andrew Lutomirski 1 sibling, 1 reply; 40+ messages in thread From: H. Peter Anvin @ 2011-08-15 16:12 UTC (permalink / raw) To: Andrew Lutomirski Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On 08/15/2011 08:36 AM, Andrew Lutomirski wrote: > > (*) kernel_fpu_begin is a bad name. It's only safe to use integer > instructions inside a kernel_fpu_begin section because MXCSR (and the > 387 equivalent) could contain garbage. > Uh... no, it just means you have to initialize the settings. It's a perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin. -hpa ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 16:12 ` H. Peter Anvin @ 2011-08-15 16:58 ` Andrew Lutomirski 2011-08-15 18:26 ` H. Peter Anvin 0 siblings, 1 reply; 40+ messages in thread From: Andrew Lutomirski @ 2011-08-15 16:58 UTC (permalink / raw) To: H. Peter Anvin Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Mon, Aug 15, 2011 at 12:12 PM, H. Peter Anvin <hpa@zytor.com> wrote: > On 08/15/2011 08:36 AM, Andrew Lutomirski wrote: >> >> (*) kernel_fpu_begin is a bad name. It's only safe to use integer >> instructions inside a kernel_fpu_begin section because MXCSR (and the >> 387 equivalent) could contain garbage. >> > > Uh... no, it just means you have to initialize the settings. It's a > perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin. I prefer get_xstate / put_xstate, but this could rapidly devolve into bikeshedding. :) --Andy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 16:58 ` Andrew Lutomirski @ 2011-08-15 18:26 ` H. Peter Anvin 2011-08-15 18:35 ` Andrew Lutomirski 0 siblings, 1 reply; 40+ messages in thread From: H. Peter Anvin @ 2011-08-15 18:26 UTC (permalink / raw) To: Andrew Lutomirski Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On 08/15/2011 09:58 AM, Andrew Lutomirski wrote: > On Mon, Aug 15, 2011 at 12:12 PM, H. Peter Anvin <hpa@zytor.com> wrote: >> On 08/15/2011 08:36 AM, Andrew Lutomirski wrote: >>> >>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer >>> instructions inside a kernel_fpu_begin section because MXCSR (and the >>> 387 equivalent) could contain garbage. >>> >> >> Uh... no, it just means you have to initialize the settings. It's a >> perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin. > > I prefer get_xstate / put_xstate, but this could rapidly devolve into > bikeshedding. :) > a) Quite. b) xstate is not architecture-neutral. -hpa ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 18:26 ` H. Peter Anvin @ 2011-08-15 18:35 ` Andrew Lutomirski 2011-08-15 18:52 ` H. Peter Anvin 0 siblings, 1 reply; 40+ messages in thread From: Andrew Lutomirski @ 2011-08-15 18:35 UTC (permalink / raw) To: H. Peter Anvin Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Mon, Aug 15, 2011 at 2:26 PM, H. Peter Anvin <hpa@zytor.com> wrote: > On 08/15/2011 09:58 AM, Andrew Lutomirski wrote: >> On Mon, Aug 15, 2011 at 12:12 PM, H. Peter Anvin <hpa@zytor.com> wrote: >>> On 08/15/2011 08:36 AM, Andrew Lutomirski wrote: >>>> >>>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer >>>> instructions inside a kernel_fpu_begin section because MXCSR (and the >>>> 387 equivalent) could contain garbage. >>>> >>> >>> Uh... no, it just means you have to initialize the settings. It's a >>> perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin. >> >> I prefer get_xstate / put_xstate, but this could rapidly devolve into >> bikeshedding. :) >> > > a) Quite. > > b) xstate is not architecture-neutral. Are there any architecture-neutral users of this thing? If I were writing generic code, I would expect: kernel_fpu_begin(); foo *= 1.5; kernel_fpu_end(); to work, but I would not expect: kernel_fpu_begin(); use_xmm_registers(); kernel_fpu_end(); to make any sense. Since the former does not actually work, I would hope that there is no non-x86-specific user. --Andy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 18:35 ` Andrew Lutomirski @ 2011-08-15 18:52 ` H. Peter Anvin 0 siblings, 0 replies; 40+ messages in thread From: H. Peter Anvin @ 2011-08-15 18:52 UTC (permalink / raw) To: Andrew Lutomirski Cc: Borislav Petkov, melwyn lobo, Denys Vlasenko, Ingo Molnar, linux-kernel, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On 08/15/2011 11:35 AM, Andrew Lutomirski wrote: > > Are there any architecture-neutral users of this thing? Look at the RAID-6 code, for example. It makes the various architecture-specific codes look more similar. -hpa ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-15 14:55 x86 memcpy performance Borislav Petkov 2011-08-15 14:59 ` Andy Lutomirski @ 2011-08-16 7:19 ` melwyn lobo 2011-08-16 7:43 ` Borislav Petkov 1 sibling, 1 reply; 40+ messages in thread From: melwyn lobo @ 2011-08-16 7:19 UTC (permalink / raw) To: Borislav Petkov Cc: Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov > Yes, on 32-bit you're using the compiler-supplied version > __builtin_memcpy when CONFIG_KMEMCHECK=n and your gcc is of version 4 > and above. Reportedly, using __builtin_memcpy generates better code. > > Btw, my version of SSE memcpy is 64-bit only. > > -- > Regards/Gruss, > Boris. > > We would rather use the 32 bit patch. Have you already got a 32 bit patch. How can I use sse3 for 32 bit. I don't think you have submitted 64 bit patch in the mainline. Is there still work ongoing on this. Regards, Melwyn ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-16 7:19 ` melwyn lobo @ 2011-08-16 7:43 ` Borislav Petkov 0 siblings, 0 replies; 40+ messages in thread From: Borislav Petkov @ 2011-08-16 7:43 UTC (permalink / raw) To: melwyn lobo Cc: Denys Vlasenko, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Tue, Aug 16, 2011 at 12:49:28PM +0530, melwyn lobo wrote: > We would rather use the 32 bit patch. Have you already got a 32 bit > patch. Nope, only 64-bit for now, sorry. > How can I use sse3 for 32 bit. Well, OTTOMH, you have only 8 xmm regs in 32-bit instead of 16, which should halve the performance of the 64-bit version in a perfect world. However, we don't know how the performance of a 32-bit SSE memcpy version behaves vs the gcc builtin one - that would require benchmarking too. But other than that, I don't see a problem with having a 32-bit version. > I don't think you have submitted 64 bit patch in the mainline. > Is there still work ongoing on this. Yeah, we are currently benchmarking it to see whether it actually makes sense to even have SSE memcpy in the kernel. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 40+ messages in thread
* x86 memcpy performance @ 2011-08-12 17:59 melwyn lobo 2011-08-12 18:33 ` Andi Kleen 2011-08-12 19:52 ` Ingo Molnar 0 siblings, 2 replies; 40+ messages in thread From: melwyn lobo @ 2011-08-12 17:59 UTC (permalink / raw) To: linux-kernel Hi All, Our Video recorder application uses memcpy for every frame. About 2KB data every frame on Intel® Atom™ Z5xx processor. With default 2.6.35 kernel we got 19.6 fps. But it seems kernel implemented memcpy is suboptimal, because when we replaced with an optmized one (using ssse3, exact patches are currently being finalized) ew obtained 22fps a gain of 12.2 %. C0 residency also reduced from 75% to 67%. This means power benefits too. My questions: 1. Is kernel memcpy profiled for optimal performance. 2. Does the default kernel configuration for i386 include the best memcpy implementation (AMD 3DNOW, __builtin_memcpy .... etc) Any suggestions, prior experience on this is welcome. Thanks, M. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-12 17:59 melwyn lobo @ 2011-08-12 18:33 ` Andi Kleen 2011-08-12 19:52 ` Ingo Molnar 1 sibling, 0 replies; 40+ messages in thread From: Andi Kleen @ 2011-08-12 18:33 UTC (permalink / raw) To: melwyn lobo; +Cc: linux-kernel melwyn lobo <linux.melwyn@gmail.com> writes: > Hi All, > Our Video recorder application uses memcpy for every frame. About 2KB > data every frame on Intel® Atom™ Z5xx processor. > With default 2.6.35 kernel we got 19.6 fps. But it seems kernel > implemented memcpy is suboptimal, because when we replaced > with an optmized one (using ssse3, exact patches are currently being > finalized) ew obtained 22fps a gain of 12.2 %. SSE3 in the kernel memcpy would be incredible expensive, it would need a full FPU saving for every call and preemption disabled. I haven't seen your patches, but until you get all that right (and add a lot more overhead to most copies) you have a good change currently to corrupting user FPU state. > C0 residency also reduced from 75% to 67%. This means power benefits too. > My questions: > 1. Is kernel memcpy profiled for optimal performance. It depends on the CPU There have been some improvements for Atom on newer kernels I believe. But then kernel memcpy is usually optimized for relatively small copies (<= 4K) because very few kernel loads do more. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-12 17:59 melwyn lobo 2011-08-12 18:33 ` Andi Kleen @ 2011-08-12 19:52 ` Ingo Molnar 2011-08-14 9:59 ` Borislav Petkov 1 sibling, 1 reply; 40+ messages in thread From: Ingo Molnar @ 2011-08-12 19:52 UTC (permalink / raw) To: melwyn lobo Cc: linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra * melwyn lobo <linux.melwyn@gmail.com> wrote: > Hi All, > Our Video recorder application uses memcpy for every frame. About 2KB > data every frame on Intel® Atom™ Z5xx processor. > With default 2.6.35 kernel we got 19.6 fps. But it seems kernel > implemented memcpy is suboptimal, because when we replaced > with an optmized one (using ssse3, exact patches are currently being > finalized) ew obtained 22fps a gain of 12.2 %. > C0 residency also reduced from 75% to 67%. This means power benefits too. > My questions: > 1. Is kernel memcpy profiled for optimal performance. > 2. Does the default kernel configuration for i386 include the best > memcpy implementation (AMD 3DNOW, __builtin_memcpy .... etc) > > Any suggestions, prior experience on this is welcome. Sounds very interesting - it would be nice to see 'perf record' + 'perf report' profiles done on that workload, before and after your patches. The thing is, we obviously want to achieve those gains of 12.2% fps and while we probably do not want to switch the kernel's memcpy to SSE right now (the save/restore costs are significant), we could certainly try to optimize the specific codepath that your video playback path is hitting. If it's some bulk memcpy in a key video driver then we could offer a bulk-optimized x86 memcpy variant which could be called from that driver - and that could use SSE3 as well. So yes, if the speedup is real then i'm sure we can achieve that speedup - but exact profiles and measurements would have to be shown. Thanks, Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-12 19:52 ` Ingo Molnar @ 2011-08-14 9:59 ` Borislav Petkov 2011-08-14 11:13 ` Denys Vlasenko 2011-08-16 2:34 ` Valdis.Kletnieks 0 siblings, 2 replies; 40+ messages in thread From: Borislav Petkov @ 2011-08-14 9:59 UTC (permalink / raw) To: Ingo Molnar Cc: melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov [-- Attachment #1: Type: text/plain, Size: 12636 bytes --] On Fri, Aug 12, 2011 at 09:52:20PM +0200, Ingo Molnar wrote: > Sounds very interesting - it would be nice to see 'perf record' + > 'perf report' profiles done on that workload, before and after your > patches. FWIW, I've been playing with SSE memcpy version for the kernel recently too, here's what I have so far: First of all, I did a trace of all the memcpy buffer sizes used while building a kernel, see attached kernel_build.sizes. On the one hand, there is a large amount of small chunks copied (1.1M of 1.2M calls total), and, on the other, a relatively small amount of larger sized mem copies (256 - 2048 bytes) which are about 100K in total but which account for the larger cumulative amount of data copied: 138MB of 175MB total. So, if the buffer copied is big enough, the context save/restore cost might be something we're willing to pay. I've implemented the SSE memcpy first in userspace to measure the speedup vs memcpy_64 we have right now: Benchmarking with 10000 iterations, average results: size XM MM speedup 119 540.58 449.491 0.8314969419 189 296.318 263.507 0.8892692985 206 297.949 271.399 0.9108923485 224 255.565 235.38 0.9210161798 221 299.383 276.628 0.9239941159 245 299.806 279.432 0.9320430545 369 314.774 316.89 1.006721324 425 327.536 330.475 1.00897153 439 330.847 334.532 1.01113687 458 333.159 340.124 1.020904708 503 334.44 352.166 1.053003229 767 375.612 429.949 1.144661625 870 358.888 312.572 0.8709465025 882 394.297 454.977 1.153893229 925 403.82 472.56 1.170222413 1009 407.147 490.171 1.203915735 1525 512.059 660.133 1.289174911 1737 556.85 725.552 1.302958536 1778 533.839 711.59 1.332965994 1864 558.06 745.317 1.335549882 2039 585.915 813.806 1.388949687 3068 766.462 1105.56 1.442422252 3471 883.983 1239.99 1.40272883 3570 895.822 1266.74 1.414057295 3748 906.832 1302.4 1.436212771 4086 957.649 1486.93 1.552686041 6130 1238.45 1996.42 1.612023046 6961 1413.11 2201.55 1.557939181 7162 1385.5 2216.49 1.59977178 7499 1440.87 2330.12 1.617158856 8182 1610.74 2720.45 1.688950194 12273 2307.86 4042.88 1.751787902 13924 2431.8 4224.48 1.737184756 14335 2469.4 4218.82 1.708440514 15018 2675.67 1904.07 0.711622886 16374 2989.75 5296.26 1.771470902 24564 4262.15 7696.86 1.805863077 27852 4362.53 3347.72 0.7673805572 28672 5122.8 7113.14 1.388524413 30033 4874.62 8740.04 1.792967931 32768 6014.78 7564.2 1.257603505 49142 14464.2 21114.2 1.459757233 55702 16055 23496.8 1.463523623 57339 16725.7 24553.8 1.46803388 60073 17451.5 24407.3 1.398579162 Size is with randomly generated misalignment to test the implementation. I've implemented the SSE memcpy similar to arch/x86/lib/mmx_32.c and did some kernel build traces: with SSE memcpy =============== Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs): 3301761.517649 task-clock # 24.001 CPUs utilized ( +- 1.48% ) 520,658 context-switches # 0.000 M/sec ( +- 0.25% ) 63,845 CPU-migrations # 0.000 M/sec ( +- 0.58% ) 26,070,835 page-faults # 0.008 M/sec ( +- 0.00% ) 1,812,482,599,021 cycles # 0.549 GHz ( +- 0.85% ) [64.55%] 551,783,051,492 stalled-cycles-frontend # 30.44% frontend cycles idle ( +- 0.98% ) [65.64%] 444,996,901,060 stalled-cycles-backend # 24.55% backend cycles idle ( +- 1.15% ) [67.16%] 1,488,917,931,766 instructions # 0.82 insns per cycle # 0.37 stalled cycles per insn ( +- 0.91% ) [69.25%] 340,575,978,517 branches # 103.150 M/sec ( +- 0.99% ) [68.29%] 21,519,667,206 branch-misses # 6.32% of all branches ( +- 1.09% ) [65.11%] 137.567155255 seconds time elapsed ( +- 1.48% ) plain 3.0 ========= Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs): 3504754.425527 task-clock # 24.001 CPUs utilized ( +- 1.31% ) 518,139 context-switches # 0.000 M/sec ( +- 0.32% ) 61,790 CPU-migrations # 0.000 M/sec ( +- 0.73% ) 26,056,947 page-faults # 0.007 M/sec ( +- 0.00% ) 1,826,757,751,616 cycles # 0.521 GHz ( +- 0.66% ) [63.86%] 557,800,617,954 stalled-cycles-frontend # 30.54% frontend cycles idle ( +- 0.79% ) [64.65%] 443,950,768,357 stalled-cycles-backend # 24.30% backend cycles idle ( +- 0.60% ) [67.07%] 1,469,707,613,500 instructions # 0.80 insns per cycle # 0.38 stalled cycles per insn ( +- 0.68% ) [69.98%] 335,560,565,070 branches # 95.744 M/sec ( +- 0.67% ) [69.09%] 21,365,279,176 branch-misses # 6.37% of all branches ( +- 0.65% ) [65.36%] 146.025263276 seconds time elapsed ( +- 1.31% ) So, although kernel build is probably not the proper workload for an SSE memcpy routine, I'm seeing 9 secs build time improvement, i.e. something around 6%. We're executing a bit more instructions but I'd say the amount of data moved per instruction is higher due to the quadword moves. Here's the SSE memcpy version I got so far, I haven't wired in the proper CPU feature detection yet because we want to run more benchmarks like netperf and stuff to see whether we see any positive results there. The SYSTEM_RUNNING check is to take care of early boot situations where we can't handle FPU exceptions but we use memcpy. There's an aligned and misaligned variant which should handle any buffers and sizes although I've set the SSE memcpy threshold at 512 Bytes buffersize the least to cover context save/restore somewhat. Comments are much appreciated! :-) -- >From 385519e844f3466f500774c2c37afe44691ef8d2 Mon Sep 17 00:00:00 2001 From: Borislav Petkov <borislav.petkov@amd.com> Date: Thu, 11 Aug 2011 18:43:08 +0200 Subject: [PATCH] SSE3 memcpy in C Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> --- arch/x86/include/asm/string_64.h | 14 ++++- arch/x86/lib/Makefile | 2 +- arch/x86/lib/sse_memcpy_64.c | 133 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 146 insertions(+), 3 deletions(-) create mode 100644 arch/x86/lib/sse_memcpy_64.c diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h index 19e2c46..7bd51bb 100644 --- a/arch/x86/include/asm/string_64.h +++ b/arch/x86/include/asm/string_64.h @@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t #define __HAVE_ARCH_MEMCPY 1 #ifndef CONFIG_KMEMCHECK +extern void *__memcpy(void *to, const void *from, size_t len); +extern void *__sse_memcpy(void *to, const void *from, size_t len); #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4 -extern void *memcpy(void *to, const void *from, size_t len); +#define memcpy(dst, src, len) \ +({ \ + size_t __len = (len); \ + void *__ret; \ + if (__len >= 512) \ + __ret = __sse_memcpy((dst), (src), __len); \ + else \ + __ret = __memcpy((dst), (src), __len); \ + __ret; \ +}) #else -extern void *__memcpy(void *to, const void *from, size_t len); #define memcpy(dst, src, len) \ ({ \ size_t __len = (len); \ diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile index f2479f1..5f90709 100644 --- a/arch/x86/lib/Makefile +++ b/arch/x86/lib/Makefile @@ -36,7 +36,7 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y) endif lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o else - obj-y += iomap_copy_64.o + obj-y += iomap_copy_64.o sse_memcpy_64.o lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o lib-y += thunk_64.o clear_page_64.o copy_page_64.o lib-y += memmove_64.o memset_64.o diff --git a/arch/x86/lib/sse_memcpy_64.c b/arch/x86/lib/sse_memcpy_64.c new file mode 100644 index 0000000..b53fc31 --- /dev/null +++ b/arch/x86/lib/sse_memcpy_64.c @@ -0,0 +1,133 @@ +#include <linux/module.h> + +#include <asm/i387.h> +#include <asm/string_64.h> + +void *__sse_memcpy(void *to, const void *from, size_t len) +{ + unsigned long src = (unsigned long)from; + unsigned long dst = (unsigned long)to; + void *p = to; + int i; + + if (in_interrupt()) + return __memcpy(to, from, len); + + if (system_state != SYSTEM_RUNNING) + return __memcpy(to, from, len); + + kernel_fpu_begin(); + + /* check alignment */ + if ((src ^ dst) & 0xf) + goto unaligned; + + if (src & 0xf) { + u8 chunk = 0x10 - (src & 0xf); + + /* copy chunk until next 16-byte */ + __memcpy(to, from, chunk); + len -= chunk; + to += chunk; + from += chunk; + } + + /* + * copy in 256 Byte portions + */ + for (i = 0; i < (len & ~0xff); i += 256) { + asm volatile( + "movaps 0x0(%0), %%xmm0\n\t" + "movaps 0x10(%0), %%xmm1\n\t" + "movaps 0x20(%0), %%xmm2\n\t" + "movaps 0x30(%0), %%xmm3\n\t" + "movaps 0x40(%0), %%xmm4\n\t" + "movaps 0x50(%0), %%xmm5\n\t" + "movaps 0x60(%0), %%xmm6\n\t" + "movaps 0x70(%0), %%xmm7\n\t" + "movaps 0x80(%0), %%xmm8\n\t" + "movaps 0x90(%0), %%xmm9\n\t" + "movaps 0xa0(%0), %%xmm10\n\t" + "movaps 0xb0(%0), %%xmm11\n\t" + "movaps 0xc0(%0), %%xmm12\n\t" + "movaps 0xd0(%0), %%xmm13\n\t" + "movaps 0xe0(%0), %%xmm14\n\t" + "movaps 0xf0(%0), %%xmm15\n\t" + + "movaps %%xmm0, 0x0(%1)\n\t" + "movaps %%xmm1, 0x10(%1)\n\t" + "movaps %%xmm2, 0x20(%1)\n\t" + "movaps %%xmm3, 0x30(%1)\n\t" + "movaps %%xmm4, 0x40(%1)\n\t" + "movaps %%xmm5, 0x50(%1)\n\t" + "movaps %%xmm6, 0x60(%1)\n\t" + "movaps %%xmm7, 0x70(%1)\n\t" + "movaps %%xmm8, 0x80(%1)\n\t" + "movaps %%xmm9, 0x90(%1)\n\t" + "movaps %%xmm10, 0xa0(%1)\n\t" + "movaps %%xmm11, 0xb0(%1)\n\t" + "movaps %%xmm12, 0xc0(%1)\n\t" + "movaps %%xmm13, 0xd0(%1)\n\t" + "movaps %%xmm14, 0xe0(%1)\n\t" + "movaps %%xmm15, 0xf0(%1)\n\t" + : : "r" (from), "r" (to) : "memory"); + + from += 256; + to += 256; + } + + goto trailer; + +unaligned: + /* + * copy in 256 Byte portions unaligned + */ + for (i = 0; i < (len & ~0xff); i += 256) { + asm volatile( + "movups 0x0(%0), %%xmm0\n\t" + "movups 0x10(%0), %%xmm1\n\t" + "movups 0x20(%0), %%xmm2\n\t" + "movups 0x30(%0), %%xmm3\n\t" + "movups 0x40(%0), %%xmm4\n\t" + "movups 0x50(%0), %%xmm5\n\t" + "movups 0x60(%0), %%xmm6\n\t" + "movups 0x70(%0), %%xmm7\n\t" + "movups 0x80(%0), %%xmm8\n\t" + "movups 0x90(%0), %%xmm9\n\t" + "movups 0xa0(%0), %%xmm10\n\t" + "movups 0xb0(%0), %%xmm11\n\t" + "movups 0xc0(%0), %%xmm12\n\t" + "movups 0xd0(%0), %%xmm13\n\t" + "movups 0xe0(%0), %%xmm14\n\t" + "movups 0xf0(%0), %%xmm15\n\t" + + "movups %%xmm0, 0x0(%1)\n\t" + "movups %%xmm1, 0x10(%1)\n\t" + "movups %%xmm2, 0x20(%1)\n\t" + "movups %%xmm3, 0x30(%1)\n\t" + "movups %%xmm4, 0x40(%1)\n\t" + "movups %%xmm5, 0x50(%1)\n\t" + "movups %%xmm6, 0x60(%1)\n\t" + "movups %%xmm7, 0x70(%1)\n\t" + "movups %%xmm8, 0x80(%1)\n\t" + "movups %%xmm9, 0x90(%1)\n\t" + "movups %%xmm10, 0xa0(%1)\n\t" + "movups %%xmm11, 0xb0(%1)\n\t" + "movups %%xmm12, 0xc0(%1)\n\t" + "movups %%xmm13, 0xd0(%1)\n\t" + "movups %%xmm14, 0xe0(%1)\n\t" + "movups %%xmm15, 0xf0(%1)\n\t" + : : "r" (from), "r" (to) : "memory"); + + from += 256; + to += 256; + } + +trailer: + __memcpy(to, from, len & 0xff); + + kernel_fpu_end(); + + return p; +} +EXPORT_SYMBOL_GPL(__sse_memcpy); -- 1.7.6.134.gcf13f6 -- Regards/Gruss, Boris. [-- Attachment #2: kernel_build.sizes --] [-- Type: text/plain, Size: 925 bytes --] Bytes Count ===== ===== 0 5447 1 3850 2 16255 3 11113 4 68870 5 4256 6 30433 7 19188 8 50490 9 5999 10 78275 11 5628 12 6870 13 7371 14 4742 15 4911 16 143835 17 14096 18 1573 19 13603 20 424321 21 741 22 584 23 450 24 472 25 685 26 367 27 365 28 333 29 301 30 300 31 269 32 489 33 272 34 266 35 220 36 239 37 209 38 249 39 235 40 207 41 181 42 150 43 98 44 194 45 66 46 62 47 52 48 67226 49 138 50 171 51 26 52 20 53 12 54 15 55 4 56 13 57 8 58 6 59 6 60 115 61 10 62 5 63 12 64 67353 65 6 66 2363 67 9 68 11 69 6 70 5 71 6 72 10 73 4 74 9 75 8 76 4 77 6 78 3 79 4 80 3 81 4 82 4 83 4 84 4 85 8 86 6 87 2 88 3 89 2 90 2 91 1 92 9 93 1 94 2 96 2 97 2 98 3 100 2 102 1 104 1 105 1 106 1 107 2 109 1 110 1 111 1 112 1 113 2 115 2 117 1 118 1 119 1 120 14 127 1 128 1 130 1 131 2 134 2 137 1 144 100092 149 1 151 1 153 1 158 1 185 1 217 4 224 3 225 3 227 3 244 1 254 5 255 13 256 21708 512 21746 848 12907 1920 36536 2048 21708 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-14 9:59 ` Borislav Petkov @ 2011-08-14 11:13 ` Denys Vlasenko 2011-08-14 12:40 ` Borislav Petkov 2011-08-16 2:34 ` Valdis.Kletnieks 1 sibling, 1 reply; 40+ messages in thread From: Denys Vlasenko @ 2011-08-14 11:13 UTC (permalink / raw) To: Borislav Petkov Cc: Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Sunday 14 August 2011 11:59, Borislav Petkov wrote: > Here's the SSE memcpy version I got so far, I haven't wired in the > proper CPU feature detection yet because we want to run more benchmarks > like netperf and stuff to see whether we see any positive results there. > > The SYSTEM_RUNNING check is to take care of early boot situations where > we can't handle FPU exceptions but we use memcpy. There's an aligned and > misaligned variant which should handle any buffers and sizes although > I've set the SSE memcpy threshold at 512 Bytes buffersize the least to > cover context save/restore somewhat. > > Comments are much appreciated! :-) > > --- a/arch/x86/include/asm/string_64.h > +++ b/arch/x86/include/asm/string_64.h > @@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t > > #define __HAVE_ARCH_MEMCPY 1 > #ifndef CONFIG_KMEMCHECK > +extern void *__memcpy(void *to, const void *from, size_t len); > +extern void *__sse_memcpy(void *to, const void *from, size_t len); > #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4 > -extern void *memcpy(void *to, const void *from, size_t len); > +#define memcpy(dst, src, len) \ > +({ \ > + size_t __len = (len); \ > + void *__ret; \ > + if (__len >= 512) \ > + __ret = __sse_memcpy((dst), (src), __len); \ > + else \ > + __ret = __memcpy((dst), (src), __len); \ > + __ret; \ > +}) Please, no. Do not inline every memcpy invocation. This is pure bloat (comsidering how many memcpy calls there are) and it doesn't even win anything in speed, since there will be a fucntion call either way. Put the __len >= 512 check inside your memcpy instead. You may do the check if you know that __len is constant: if (__builtin_constant_p(__len) && __len >= 512) ... because in this case gcc will evaluate it at compile-time. -- vda ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-14 11:13 ` Denys Vlasenko @ 2011-08-14 12:40 ` Borislav Petkov 2011-08-15 13:27 ` melwyn lobo 2011-08-15 13:44 ` Denys Vlasenko 0 siblings, 2 replies; 40+ messages in thread From: Borislav Petkov @ 2011-08-14 12:40 UTC (permalink / raw) To: Denys Vlasenko Cc: Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Sun, Aug 14, 2011 at 01:13:56PM +0200, Denys Vlasenko wrote: > On Sunday 14 August 2011 11:59, Borislav Petkov wrote: > > Here's the SSE memcpy version I got so far, I haven't wired in the > > proper CPU feature detection yet because we want to run more benchmarks > > like netperf and stuff to see whether we see any positive results there. > > > > The SYSTEM_RUNNING check is to take care of early boot situations where > > we can't handle FPU exceptions but we use memcpy. There's an aligned and > > misaligned variant which should handle any buffers and sizes although > > I've set the SSE memcpy threshold at 512 Bytes buffersize the least to > > cover context save/restore somewhat. > > > > Comments are much appreciated! :-) > > > > --- a/arch/x86/include/asm/string_64.h > > +++ b/arch/x86/include/asm/string_64.h > > @@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t > > > > #define __HAVE_ARCH_MEMCPY 1 > > #ifndef CONFIG_KMEMCHECK > > +extern void *__memcpy(void *to, const void *from, size_t len); > > +extern void *__sse_memcpy(void *to, const void *from, size_t len); > > #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4 > > -extern void *memcpy(void *to, const void *from, size_t len); > > +#define memcpy(dst, src, len) \ > > +({ \ > > + size_t __len = (len); \ > > + void *__ret; \ > > + if (__len >= 512) \ > > + __ret = __sse_memcpy((dst), (src), __len); \ > > + else \ > > + __ret = __memcpy((dst), (src), __len); \ > > + __ret; \ > > +}) > > Please, no. Do not inline every memcpy invocation. > This is pure bloat (comsidering how many memcpy calls there are) > and it doesn't even win anything in speed, since there will be > a fucntion call either way. > Put the __len >= 512 check inside your memcpy instead. In the __len < 512 case, this would actually cause two function calls, actually: once the __sse_memcpy and then the __memcpy one. > You may do the check if you know that __len is constant: > if (__builtin_constant_p(__len) && __len >= 512) ... > because in this case gcc will evaluate it at compile-time. That could justify the bloat at least partially. Actually, I had a version which sticks sse_memcpy code into memcpy_64.S and that would save us both the function call and the bloat. I might return to that one if it turns out that SSE memcpy makes sense for the kernel. Thanks. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-14 12:40 ` Borislav Petkov @ 2011-08-15 13:27 ` melwyn lobo 2011-08-15 13:44 ` Denys Vlasenko 1 sibling, 0 replies; 40+ messages in thread From: melwyn lobo @ 2011-08-15 13:27 UTC (permalink / raw) To: Borislav Petkov, Denys Vlasenko, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov Hi, Was on a vacation for last two days. Thanks for the good insights into the issue. Ingo, unfortunately the data we have is on a soon to be released platform and strictly confidential at this stage. Boris, thanks for the patch. On seeing your patch: +void *__sse_memcpy(void *to, const void *from, size_t len) +{ + unsigned long src = (unsigned long)from; + unsigned long dst = (unsigned long)to; + void *p = to; + int i; + + if (in_interrupt()) + return __memcpy(to, from, len) So what is the reason we cannot use sse_memcpy in interrupt context. (fpu registers not saved ? ) My question is still not answered. There are 3 versions of memcpy in kernel: ***********************************arch/x86/include/asm/string_32.h****************************** 179 #ifndef CONFIG_KMEMCHECK 180 181 #if (__GNUC__ >= 4) 182 #define memcpy(t, f, n) __builtin_memcpy(t, f, n) 183 #else 184 #define memcpy(t, f, n) \ 185 (__builtin_constant_p((n)) \ 186 ? __constant_memcpy((t), (f), (n)) \ 187 : __memcpy((t), (f), (n))) 188 #endif 189 #else 190 /* 191 * kmemcheck becomes very happy if we use the REP instructions unconditionally, 192 * because it means that we know both memory operands in advance. 193 */ 194 #define memcpy(t, f, n) __memcpy((t), (f), (n)) 195 #endif 196 197 ****************************************************************************************. I will ignore CONFIG_X86_USE_3DNOW (including mmx_memcpy() ) as this is valid only for AMD and not for Atom Z5xx series. This means __memcpy, __constant_memcpy, __builtin_memcpy . I have a hunch by default we were using __builtin_memcpy. This is because I see my GCC version >=4 and CONFIG_KMEMCHECK not defined. Can someone confirm of these 3 which is used, with i386_defconfig. Again with i386_defconfig which workloads provide the best results with the default implementation. thanks, M. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-14 12:40 ` Borislav Petkov 2011-08-15 13:27 ` melwyn lobo @ 2011-08-15 13:44 ` Denys Vlasenko 1 sibling, 0 replies; 40+ messages in thread From: Denys Vlasenko @ 2011-08-15 13:44 UTC (permalink / raw) To: Borislav Petkov, Denys Vlasenko, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov On Sun, Aug 14, 2011 at 2:40 PM, Borislav Petkov <bp@alien8.de> wrote: >> > + if (__len >= 512) \ >> > + __ret = __sse_memcpy((dst), (src), __len); \ >> > + else \ >> > + __ret = __memcpy((dst), (src), __len); \ >> > + __ret; \ >> > +}) >> >> Please, no. Do not inline every memcpy invocation. >> This is pure bloat (comsidering how many memcpy calls there are) >> and it doesn't even win anything in speed, since there will be >> a fucntion call either way. >> Put the __len >= 512 check inside your memcpy instead. > > In the __len < 512 case, this would actually cause two function calls, > actually: once the __sse_memcpy and then the __memcpy one. You didn't notice the "else". >> You may do the check if you know that __len is constant: >> if (__builtin_constant_p(__len) && __len >= 512) ... >> because in this case gcc will evaluate it at compile-time. > > That could justify the bloat at least partially. There will be no bloat in this case. -- vda ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-14 9:59 ` Borislav Petkov 2011-08-14 11:13 ` Denys Vlasenko @ 2011-08-16 2:34 ` Valdis.Kletnieks 2011-08-16 12:16 ` Borislav Petkov 1 sibling, 1 reply; 40+ messages in thread From: Valdis.Kletnieks @ 2011-08-16 2:34 UTC (permalink / raw) To: Borislav Petkov Cc: Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra, borislav.petkov [-- Attachment #1: Type: text/plain, Size: 1109 bytes --] On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said: > Benchmarking with 10000 iterations, average results: > size XM MM speedup > 119 540.58 449.491 0.8314969419 > 12273 2307.86 4042.88 1.751787902 > 13924 2431.8 4224.48 1.737184756 > 14335 2469.4 4218.82 1.708440514 > 15018 2675.67 1904.07 0.711622886 > 16374 2989.75 5296.26 1.771470902 > 24564 4262.15 7696.86 1.805863077 > 27852 4362.53 3347.72 0.7673805572 > 28672 5122.8 7113.14 1.388524413 > 30033 4874.62 8740.04 1.792967931 The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel really good about this till we understand what happened for those two cases. Also, anytime I see "10000 iterations", I ask myself if the benchmark rigging took proper note of hot/cold cache issues. That *may* explain the two oddball results we see above - but not knowing more about how it was benched, it's hard to say. [-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-16 2:34 ` Valdis.Kletnieks @ 2011-08-16 12:16 ` Borislav Petkov 2011-09-01 15:15 ` Maarten Lankhorst 0 siblings, 1 reply; 40+ messages in thread From: Borislav Petkov @ 2011-08-16 12:16 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Borislav Petkov, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 2448 bytes --] On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@vt.edu wrote: > On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said: > > > Benchmarking with 10000 iterations, average results: > > size XM MM speedup > > 119 540.58 449.491 0.8314969419 > > > 12273 2307.86 4042.88 1.751787902 > > 13924 2431.8 4224.48 1.737184756 > > 14335 2469.4 4218.82 1.708440514 > > 15018 2675.67 1904.07 0.711622886 > > 16374 2989.75 5296.26 1.771470902 > > 24564 4262.15 7696.86 1.805863077 > > 27852 4362.53 3347.72 0.7673805572 > > 28672 5122.8 7113.14 1.388524413 > > 30033 4874.62 8740.04 1.792967931 > > The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel > really good about this till we understand what happened for those two cases. Yep. > Also, anytime I see "10000 iterations", I ask myself if the benchmark > rigging took proper note of hot/cold cache issues. That *may* explain > the two oddball results we see above - but not knowing more about how > it was benched, it's hard to say. Yeah, the more scrutiny this gets the better. So I've cleaned up my setup and have attached it. xm_mem.c does the benchmarking and in bench_memcpy() there's the sse_memcpy call which is the SSE memcpy implementation using inline asm. It looks like gcc produces pretty crappy code here because if I replace the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the same function but in pure asm - I get much better numbers, sometimes even over 2x. It all depends on the alignment of the buffers though. Also, those numbers don't include the context saving/restoring which the kernel does for us. 7491 1509.89 2346.94 1.554378381 8170 2166.81 2857.78 1.318890326 12277 2659.03 4179.31 1.571744176 13907 2571.24 4125.7 1.604558427 14319 2638.74 5799.67 2.19789466 <---- 14993 2752.42 4413.85 1.603625603 16371 3479.11 5562.65 1.59887055 So please take a look and let me know what you think. Thanks. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 [-- Attachment #2: sse_memcpy.tar.bz2 --] [-- Type: application/octet-stream, Size: 3508 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-08-16 12:16 ` Borislav Petkov @ 2011-09-01 15:15 ` Maarten Lankhorst 2011-09-01 16:18 ` Linus Torvalds 2011-12-05 12:54 ` melwyn lobo 0 siblings, 2 replies; 40+ messages in thread From: Maarten Lankhorst @ 2011-09-01 15:15 UTC (permalink / raw) To: Borislav Petkov Cc: Valdis.Kletnieks, Borislav Petkov, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 3418 bytes --] Hey, 2011/8/16 Borislav Petkov <bp@amd64.org>: > On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@vt.edu wrote: >> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said: >> >> > Benchmarking with 10000 iterations, average results: >> > size XM MM speedup >> > 119 540.58 449.491 0.8314969419 >> >> > 12273 2307.86 4042.88 1.751787902 >> > 13924 2431.8 4224.48 1.737184756 >> > 14335 2469.4 4218.82 1.708440514 >> > 15018 2675.67 1904.07 0.711622886 >> > 16374 2989.75 5296.26 1.771470902 >> > 24564 4262.15 7696.86 1.805863077 >> > 27852 4362.53 3347.72 0.7673805572 >> > 28672 5122.8 7113.14 1.388524413 >> > 30033 4874.62 8740.04 1.792967931 >> >> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel >> really good about this till we understand what happened for those two cases. > > Yep. > >> Also, anytime I see "10000 iterations", I ask myself if the benchmark >> rigging took proper note of hot/cold cache issues. That *may* explain >> the two oddball results we see above - but not knowing more about how >> it was benched, it's hard to say. > > Yeah, the more scrutiny this gets the better. So I've cleaned up my > setup and have attached it. > > xm_mem.c does the benchmarking and in bench_memcpy() there's the > sse_memcpy call which is the SSE memcpy implementation using inline asm. > It looks like gcc produces pretty crappy code here because if I replace > the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the > same function but in pure asm - I get much better numbers, sometimes > even over 2x. It all depends on the alignment of the buffers though. > Also, those numbers don't include the context saving/restoring which the > kernel does for us. > > 7491 1509.89 2346.94 1.554378381 > 8170 2166.81 2857.78 1.318890326 > 12277 2659.03 4179.31 1.571744176 > 13907 2571.24 4125.7 1.604558427 > 14319 2638.74 5799.67 2.19789466 <---- > 14993 2752.42 4413.85 1.603625603 > 16371 3479.11 5562.65 1.59887055 This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy, and I finally figured out why. I also extended the test to an optimized avx memcpy, but I think the kernel memcpy will always win in the aligned case. Those numbers you posted aren't right it seems. It depends a lot on the alignment, for example if both are aligned to 64 relative to each other, kernel memcpy will win from avx memcpy on my machine. I replaced the malloc calls with memalign(65536, size + 256) so I could toy around with the alignments a little. This explains why for some sizes, kernel memcpy was faster than sse memcpy in the test results you had. When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise avx memcpy might. If you want to speed up memcpy, I think your best bet is to find out why it's so much slower when src and dst aren't 64-byte aligned compared to each other. Cheers, Maarten --- Attached: my modified version of the sse memcpy you posted. I changed it a bit, and used avx, but some of the other changes might be better for your sse memcpy too. [-- Attachment #2: ym_memcpy.txt --] [-- Type: text/plain, Size: 2668 bytes --] /* * ym_memcpy - AVX version of memcpy * * Input: * rdi destination * rsi source * rdx count * * Output: * rax original destination */ .globl ym_memcpy .type ym_memcpy, @function ym_memcpy: mov %rdi, %rax /* Target align */ movzbq %dil, %rcx negb %cl andb $0x1f, %cl subq %rcx, %rdx rep movsb movq %rdx, %rcx andq $0x1ff, %rdx shrq $9, %rcx jz .trailer movb %sil, %r8b andb $0x1f, %r8b test %r8b, %r8b jz .repeat_a .align 32 .repeat_ua: vmovups 0x0(%rsi), %ymm0 vmovups 0x20(%rsi), %ymm1 vmovups 0x40(%rsi), %ymm2 vmovups 0x60(%rsi), %ymm3 vmovups 0x80(%rsi), %ymm4 vmovups 0xa0(%rsi), %ymm5 vmovups 0xc0(%rsi), %ymm6 vmovups 0xe0(%rsi), %ymm7 vmovups 0x100(%rsi), %ymm8 vmovups 0x120(%rsi), %ymm9 vmovups 0x140(%rsi), %ymm10 vmovups 0x160(%rsi), %ymm11 vmovups 0x180(%rsi), %ymm12 vmovups 0x1a0(%rsi), %ymm13 vmovups 0x1c0(%rsi), %ymm14 vmovups 0x1e0(%rsi), %ymm15 vmovaps %ymm0, 0x0(%rdi) vmovaps %ymm1, 0x20(%rdi) vmovaps %ymm2, 0x40(%rdi) vmovaps %ymm3, 0x60(%rdi) vmovaps %ymm4, 0x80(%rdi) vmovaps %ymm5, 0xa0(%rdi) vmovaps %ymm6, 0xc0(%rdi) vmovaps %ymm7, 0xe0(%rdi) vmovaps %ymm8, 0x100(%rdi) vmovaps %ymm9, 0x120(%rdi) vmovaps %ymm10, 0x140(%rdi) vmovaps %ymm11, 0x160(%rdi) vmovaps %ymm12, 0x180(%rdi) vmovaps %ymm13, 0x1a0(%rdi) vmovaps %ymm14, 0x1c0(%rdi) vmovaps %ymm15, 0x1e0(%rdi) /* advance pointers */ addq $0x200, %rsi addq $0x200, %rdi subq $1, %rcx jnz .repeat_ua jz .trailer .align 32 .repeat_a: prefetchnta 0x80(%rsi) prefetchnta 0x100(%rsi) prefetchnta 0x180(%rsi) vmovaps 0x0(%rsi), %ymm0 vmovaps 0x20(%rsi), %ymm1 vmovaps 0x40(%rsi), %ymm2 vmovaps 0x60(%rsi), %ymm3 vmovaps 0x80(%rsi), %ymm4 vmovaps 0xa0(%rsi), %ymm5 vmovaps 0xc0(%rsi), %ymm6 vmovaps 0xe0(%rsi), %ymm7 vmovaps 0x100(%rsi), %ymm8 vmovaps 0x120(%rsi), %ymm9 vmovaps 0x140(%rsi), %ymm10 vmovaps 0x160(%rsi), %ymm11 vmovaps 0x180(%rsi), %ymm12 vmovaps 0x1a0(%rsi), %ymm13 vmovaps 0x1c0(%rsi), %ymm14 vmovaps 0x1e0(%rsi), %ymm15 vmovaps %ymm0, 0x0(%rdi) vmovaps %ymm1, 0x20(%rdi) vmovaps %ymm2, 0x40(%rdi) vmovaps %ymm3, 0x60(%rdi) vmovaps %ymm4, 0x80(%rdi) vmovaps %ymm5, 0xa0(%rdi) vmovaps %ymm6, 0xc0(%rdi) vmovaps %ymm7, 0xe0(%rdi) vmovaps %ymm8, 0x100(%rdi) vmovaps %ymm9, 0x120(%rdi) vmovaps %ymm10, 0x140(%rdi) vmovaps %ymm11, 0x160(%rdi) vmovaps %ymm12, 0x180(%rdi) vmovaps %ymm13, 0x1a0(%rdi) vmovaps %ymm14, 0x1c0(%rdi) vmovaps %ymm15, 0x1e0(%rdi) /* advance pointers */ addq $0x200, %rsi addq $0x200, %rdi subq $1, %rcx jnz .repeat_a .align 32 .trailer: movq %rdx, %rcx shrq $3, %rcx rep; movsq movq %rdx, %rcx andq $0x7, %rcx rep; movsb retq ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-09-01 15:15 ` Maarten Lankhorst @ 2011-09-01 16:18 ` Linus Torvalds 2011-09-08 8:35 ` Borislav Petkov 2011-12-05 12:54 ` melwyn lobo 1 sibling, 1 reply; 40+ messages in thread From: Linus Torvalds @ 2011-09-01 16:18 UTC (permalink / raw) To: Maarten Lankhorst Cc: Borislav Petkov, Valdis.Kletnieks, Borislav Petkov, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst <m.b.lankhorst@gmail.com> wrote: > > This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy, > and I finally figured out why. I also extended the test to an optimized avx memcpy, > but I think the kernel memcpy will always win in the aligned case. "rep movs" is generally optimized in microcode on most modern Intel CPU's for some easyish cases, and it will outperform just about anything. Atom is a notable exception, but if you expect performance on any general loads from Atom, you need to get your head examined. Atom is a disaster for anything but tuned loops. The "easyish cases" depend on microarchitecture. They are improving, so long-term "rep movs" is the best way regardless, but for most current ones it's something like "source aligned to 8 bytes *and* source and destination are equal "mod 64"". And that's true in a lot of common situations. It's true for the page copy, for example, and it's often true for big user "read()/write()" calls (but "often" may not be "often enough" - high-performance userland should strive to align read/write buffers to 64 bytes, for example). Many other cases of "memcpy()" are the fairly small, constant-sized ones, where the optimal strategy tends to be "move words by hand". Linus ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-09-01 16:18 ` Linus Torvalds @ 2011-09-08 8:35 ` Borislav Petkov 2011-09-08 10:58 ` Maarten Lankhorst 0 siblings, 1 reply; 40+ messages in thread From: Borislav Petkov @ 2011-09-08 8:35 UTC (permalink / raw) To: Linus Torvalds Cc: Maarten Lankhorst, Borislav Petkov, Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote: > On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst > <m.b.lankhorst@gmail.com> wrote: > > > > This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy, > > and I finally figured out why. I also extended the test to an optimized avx memcpy, > > but I think the kernel memcpy will always win in the aligned case. > > "rep movs" is generally optimized in microcode on most modern Intel > CPU's for some easyish cases, and it will outperform just about > anything. > > Atom is a notable exception, but if you expect performance on any > general loads from Atom, you need to get your head examined. Atom is a > disaster for anything but tuned loops. > > The "easyish cases" depend on microarchitecture. They are improving, > so long-term "rep movs" is the best way regardless, but for most > current ones it's something like "source aligned to 8 bytes *and* > source and destination are equal "mod 64"". > > And that's true in a lot of common situations. It's true for the page > copy, for example, and it's often true for big user "read()/write()" > calls (but "often" may not be "often enough" - high-performance > userland should strive to align read/write buffers to 64 bytes, for > example). > > Many other cases of "memcpy()" are the fairly small, constant-sized > ones, where the optimal strategy tends to be "move words by hand". Yeah, this probably makes enabling SSE memcpy in the kernel a task with diminishing returns. There are also the additional costs of saving/restoring FPU context in the kernel which eat off from any SSE speedup. And then there's the additional I$ pressure because "rep movs" is much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the smallest (two-byte) instructions I could use - in the AVX case they can get up to 4 Bytes of length with the VEX prefix and the additional SIB, size override, etc. fields. Oh, and then there's copy_*_user which also does fault handling and replacing that with a SSE version of memcpy could get quite hairy quite fast. Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel when I get the time to see whether it still makes sense, at all. Thanks. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-09-08 8:35 ` Borislav Petkov @ 2011-09-08 10:58 ` Maarten Lankhorst 2011-09-09 8:14 ` Borislav Petkov 0 siblings, 1 reply; 40+ messages in thread From: Maarten Lankhorst @ 2011-09-08 10:58 UTC (permalink / raw) To: Borislav Petkov, Linus Torvalds, Borislav Petkov, Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 3330 bytes --] On 09/08/2011 10:35 AM, Borislav Petkov wrote: > On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote: >> On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst >> <m.b.lankhorst@gmail.com> wrote: >>> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy, >>> and I finally figured out why. I also extended the test to an optimized avx memcpy, >>> but I think the kernel memcpy will always win in the aligned case. >> "rep movs" is generally optimized in microcode on most modern Intel >> CPU's for some easyish cases, and it will outperform just about >> anything. >> >> Atom is a notable exception, but if you expect performance on any >> general loads from Atom, you need to get your head examined. Atom is a >> disaster for anything but tuned loops. >> >> The "easyish cases" depend on microarchitecture. They are improving, >> so long-term "rep movs" is the best way regardless, but for most >> current ones it's something like "source aligned to 8 bytes *and* >> source and destination are equal "mod 64"". >> >> And that's true in a lot of common situations. It's true for the page >> copy, for example, and it's often true for big user "read()/write()" >> calls (but "often" may not be "often enough" - high-performance >> userland should strive to align read/write buffers to 64 bytes, for >> example). >> >> Many other cases of "memcpy()" are the fairly small, constant-sized >> ones, where the optimal strategy tends to be "move words by hand". > Yeah, > > this probably makes enabling SSE memcpy in the kernel a task > with diminishing returns. There are also the additional costs of > saving/restoring FPU context in the kernel which eat off from any SSE > speedup. > > And then there's the additional I$ pressure because "rep movs" is > much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the > smallest (two-byte) instructions I could use - in the AVX case they can > get up to 4 Bytes of length with the VEX prefix and the additional SIB, > size override, etc. fields. > > Oh, and then there's copy_*_user which also does fault handling and > replacing that with a SSE version of memcpy could get quite hairy quite > fast. > > Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel > when I get the time to see whether it still makes sense, at all. > I have changed your sse memcpy to test various alignments with source/destination offsets instead of random, from that you can see that you don't really get a speedup at all. It seems to be more a case of 'kernel memcpy is significantly slower with some alignments', than 'avx memcpy is just that much faster'. For example 3754 with src misalignment 4 and target misalignment 20 takes 1185 units on avx memcpy, but 1480 units with kernel memcpy The modified testcase is attached, I did some optimizations in avx memcpy, but I fear I may be missing something, when I tried to put it in the kernel, it complained about sata errors I never had before, so I immediately went for the power button to prevent more errors, fortunately it only corrupted some kernel object files, and btrfs threw checksum errors. :) All in all I think testing in userspace is safer, you might want to run it on an idle cpu with schedtool, with a high fifo priority, and set cpufreq governor to performance. ~Maarten [-- Attachment #2: memcpy.tar.gz --] [-- Type: application/x-gzip, Size: 4352 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-09-08 10:58 ` Maarten Lankhorst @ 2011-09-09 8:14 ` Borislav Petkov 2011-09-09 10:12 ` Maarten Lankhorst 2011-09-09 14:39 ` Linus Torvalds 0 siblings, 2 replies; 40+ messages in thread From: Borislav Petkov @ 2011-09-09 8:14 UTC (permalink / raw) To: Maarten Lankhorst Cc: Linus Torvalds, Borislav Petkov, Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote: > I have changed your sse memcpy to test various alignments with > source/destination offsets instead of random, from that you can > see that you don't really get a speedup at all. It seems to be more > a case of 'kernel memcpy is significantly slower with some alignments', > than 'avx memcpy is just that much faster'. > > For example 3754 with src misalignment 4 and target misalignment 20 > takes 1185 units on avx memcpy, but 1480 units with kernel memcpy Right, so the idea is to check whether with the bigger buffer sizes (and misaligned, although this should not be that often the case in the kernel) the SSE version would outperform a "rep movs" with ucode optimizations not kicking in. With your version modified back to SSE memcpy (don't have an AVX box right now) I get on an AMD F10h: ... 16384(12/40) 4756.24 7867.74 1.654192552 16384(40/12) 5067.81 6068.71 1.197500008 16384(12/44) 4341.3 8474.96 1.952172387 16384(44/12) 4277.13 7107.64 1.661777347 16384(12/48) 4989.16 7964.54 1.596369011 16384(48/12) 4644.94 6499.5 1.399264281 ... which looks like pretty nice numbers to me. I can't say whether there ever is 16K buffer we copy in the kernel but if there were... But <16K buffers also show up to 1.5x speedup. So I'd say it's a uarch thing. As I said, best it would be to put it in the kernel and run a bunch of benchmarks... > The modified testcase is attached, I did some optimizations in avx > memcpy, but I fear I may be missing something, when I tried to put it > in the kernel, it complained about sata errors I never had before, > so I immediately went for the power button to prevent more errors, > fortunately it only corrupted some kernel object files, and btrfs > threw checksum errors. :) Well, your version should do something similar to what _mmx_memcpy does: save FPU state and not execute in IRQ context. > All in all I think testing in userspace is safer, you might want to > run it on an idle cpu with schedtool, with a high fifo priority, and > set cpufreq governor to performance. No, you need a generic system with default settings - otherwise it is blatant benchmark lying :-) -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-09-09 8:14 ` Borislav Petkov @ 2011-09-09 10:12 ` Maarten Lankhorst 2011-09-09 11:23 ` Maarten Lankhorst 2011-09-09 14:39 ` Linus Torvalds 1 sibling, 1 reply; 40+ messages in thread From: Maarten Lankhorst @ 2011-09-09 10:12 UTC (permalink / raw) To: Borislav Petkov, Linus Torvalds, Borislav Petkov, Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra Hey, On 09/09/2011 10:14 AM, Borislav Petkov wrote: > On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote: >> I have changed your sse memcpy to test various alignments with >> source/destination offsets instead of random, from that you can >> see that you don't really get a speedup at all. It seems to be more >> a case of 'kernel memcpy is significantly slower with some alignments', >> than 'avx memcpy is just that much faster'. >> >> For example 3754 with src misalignment 4 and target misalignment 20 >> takes 1185 units on avx memcpy, but 1480 units with kernel memcpy > Right, so the idea is to check whether with the bigger buffer sizes > (and misaligned, although this should not be that often the case in > the kernel) the SSE version would outperform a "rep movs" with ucode > optimizations not kicking in. > > With your version modified back to SSE memcpy (don't have an AVX box > right now) I get on an AMD F10h: > > ... > 16384(12/40) 4756.24 7867.74 1.654192552 > 16384(40/12) 5067.81 6068.71 1.197500008 > 16384(12/44) 4341.3 8474.96 1.952172387 > 16384(44/12) 4277.13 7107.64 1.661777347 > 16384(12/48) 4989.16 7964.54 1.596369011 > 16384(48/12) 4644.94 6499.5 1.399264281 > ... > > which looks like pretty nice numbers to me. I can't say whether there > ever is 16K buffer we copy in the kernel but if there were... But <16K > buffers also show up to 1.5x speedup. So I'd say it's a uarch thing. > As I said, best it would be to put it in the kernel and run a bunch of > benchmarks... I think for bigger memcpy's it might make sense to demand stricter alignment. What are your numbers for (0/0) ? In my case it seems that kernel memcpy is always faster for that. In fact, it seems src&63 == dst&63 is generally faster with kernel memcpy. Patching my tree to WARN_ON_ONCE for when this condition isn't true, I get the following warnings: WARNING: at arch/x86/kernel/head64.c:49 x86_64_start_reservations+0x3b/0x18d() WARNING: at arch/x86/kernel/head64.c:52 x86_64_start_reservations+0xcb/0x18d() WARNING: at arch/x86/kernel/e820.c:1077 setup_memory_map+0x3b/0x72() WARNING: at kernel/fork.c:938 copy_process+0x148f/0x1550() WARNING: at arch/x86/vdso/vdso32-setup.c:306 sysenter_setup+0xd4/0x301() WARNING: at mm/util.c:72 kmemdup+0x75/0x80() WARNING: at fs/btrfs/disk-io.c:1742 open_ctree+0x1ab5/0x1bb0() WARNING: at fs/btrfs/disk-io.c:1744 open_ctree+0x1b35/0x1bb0() WARNING: at fs/btrfs/extent_io.c:3634 write_extent_buffer+0x209/0x240() WARNING: at fs/exec.c:1002 flush_old_exec+0x6c3/0x750() WARNING: at fs/btrfs/extent_io.c:3496 read_extent_buffer+0x1b1/0x1e0() WARNING: at kernel/module.c:2585 load_module+0x1933/0x1c30() WARNING: at fs/btrfs/extent_io.c:3748 memcpy_extent_buffer+0x2aa/0x2f0() WARNING: at fs/btrfs/disk-io.c:2276 write_dev_supers+0x34e/0x360() WARNING: at lib/swiotlb.c:367 swiotlb_bounce+0xc6/0xe0() WARNING: at fs/btrfs/transaction.c:1387 btrfs_commit_transaction+0x867/0x8a0() WARNING: at drivers/tty/serial/serial_core.c:527 uart_write+0x14a/0x160() WARNING: at mm/memory.c:3830 __access_remote_vm+0x251/0x270() The most persistent one appears to be the btrfs' *_extent_buffer, it gets the most warnings on my system. Apart from that on my system there's not much to gain, since the alignment is already close to optimal. My ext4 /home doesn't throw warnings, so I'd gain the most by figuring out if I could improve btrfs/extent_io.c in some way. The patch for triggering those warnings is below, change to WARN_ON if you want to see which one happens the most for you. I was pleasantly surprised though. >> The modified testcase is attached, I did some optimizations in avx >> memcpy, but I fear I may be missing something, when I tried to put it >> in the kernel, it complained about sata errors I never had before, >> so I immediately went for the power button to prevent more errors, >> fortunately it only corrupted some kernel object files, and btrfs >> threw checksum errors. :) > Well, your version should do something similar to what _mmx_memcpy does: > save FPU state and not execute in IRQ context. > >> All in all I think testing in userspace is safer, you might want to >> run it on an idle cpu with schedtool, with a high fifo priority, and >> set cpufreq governor to performance. > No, you need a generic system with default settings - otherwise it is > blatant benchmark lying :-) diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h index 19e2c46..77180bb 100644 --- a/arch/x86/include/asm/string_64.h +++ b/arch/x86/include/asm/string_64.h @@ -30,6 +30,14 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t #ifndef CONFIG_KMEMCHECK #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4 extern void *memcpy(void *to, const void *from, size_t len); +#define memcpy(dst, src, len) \ +({ \ + size_t __len = (len); \ + const void *__src = (src); \ + void *__dst = (dst); \ + WARN_ON_ONCE(__len > 1024 && (((long)__src & 63) != ((long)__dst & 63))); \ + memcpy(__dst, __src, __len); \ +}) #else extern void *__memcpy(void *to, const void *from, size_t len); #define memcpy(dst, src, len) \ ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-09-09 10:12 ` Maarten Lankhorst @ 2011-09-09 11:23 ` Maarten Lankhorst 2011-09-09 13:42 ` Borislav Petkov 0 siblings, 1 reply; 40+ messages in thread From: Maarten Lankhorst @ 2011-09-09 11:23 UTC (permalink / raw) To: Borislav Petkov, Linus Torvalds, Borislav Petkov, Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra Hey just a followup on btrfs, On 09/09/2011 12:12 PM, Maarten Lankhorst wrote: > Hey, > > On 09/09/2011 10:14 AM, Borislav Petkov wrote: >> On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote: >>> I have changed your sse memcpy to test various alignments with >>> source/destination offsets instead of random, from that you can >>> see that you don't really get a speedup at all. It seems to be more >>> a case of 'kernel memcpy is significantly slower with some alignments', >>> than 'avx memcpy is just that much faster'. >>> >>> For example 3754 with src misalignment 4 and target misalignment 20 >>> takes 1185 units on avx memcpy, but 1480 units with kernel memcpy >> Right, so the idea is to check whether with the bigger buffer sizes >> (and misaligned, although this should not be that often the case in >> the kernel) the SSE version would outperform a "rep movs" with ucode >> optimizations not kicking in. >> >> With your version modified back to SSE memcpy (don't have an AVX box >> right now) I get on an AMD F10h: >> >> ... >> 16384(12/40) 4756.24 7867.74 1.654192552 >> 16384(40/12) 5067.81 6068.71 1.197500008 >> 16384(12/44) 4341.3 8474.96 1.952172387 >> 16384(44/12) 4277.13 7107.64 1.661777347 >> 16384(12/48) 4989.16 7964.54 1.596369011 >> 16384(48/12) 4644.94 6499.5 1.399264281 >> ... >> >> which looks like pretty nice numbers to me. I can't say whether there >> ever is 16K buffer we copy in the kernel but if there were... But <16K >> buffers also show up to 1.5x speedup. So I'd say it's a uarch thing. >> As I said, best it would be to put it in the kernel and run a bunch of >> benchmarks... > I think for bigger memcpy's it might make sense to demand stricter > alignment. What are your numbers for (0/0) ? In my case it seems > that kernel memcpy is always faster for that. In fact, it seems > src&63 == dst&63 is generally faster with kernel memcpy. > > Patching my tree to WARN_ON_ONCE for when this condition isn't true, I get the following warnings: > > WARNING: at arch/x86/kernel/head64.c:49 x86_64_start_reservations+0x3b/0x18d() > WARNING: at arch/x86/kernel/head64.c:52 x86_64_start_reservations+0xcb/0x18d() > WARNING: at arch/x86/kernel/e820.c:1077 setup_memory_map+0x3b/0x72() > WARNING: at kernel/fork.c:938 copy_process+0x148f/0x1550() > WARNING: at arch/x86/vdso/vdso32-setup.c:306 sysenter_setup+0xd4/0x301() > WARNING: at mm/util.c:72 kmemdup+0x75/0x80() > WARNING: at fs/btrfs/disk-io.c:1742 open_ctree+0x1ab5/0x1bb0() > WARNING: at fs/btrfs/disk-io.c:1744 open_ctree+0x1b35/0x1bb0() > WARNING: at fs/btrfs/extent_io.c:3634 write_extent_buffer+0x209/0x240() > WARNING: at fs/exec.c:1002 flush_old_exec+0x6c3/0x750() > WARNING: at fs/btrfs/extent_io.c:3496 read_extent_buffer+0x1b1/0x1e0() > WARNING: at kernel/module.c:2585 load_module+0x1933/0x1c30() > WARNING: at fs/btrfs/extent_io.c:3748 memcpy_extent_buffer+0x2aa/0x2f0() > WARNING: at fs/btrfs/disk-io.c:2276 write_dev_supers+0x34e/0x360() > WARNING: at lib/swiotlb.c:367 swiotlb_bounce+0xc6/0xe0() > WARNING: at fs/btrfs/transaction.c:1387 btrfs_commit_transaction+0x867/0x8a0() > WARNING: at drivers/tty/serial/serial_core.c:527 uart_write+0x14a/0x160() > WARNING: at mm/memory.c:3830 __access_remote_vm+0x251/0x270() > > The most persistent one appears to be the btrfs' *_extent_buffer, > it gets the most warnings on my system. Apart from that on my > system there's not much to gain, since the alignment is already > close to optimal. > > My ext4 /home doesn't throw warnings, so I'd gain the most > by figuring out if I could improve btrfs/extent_io.c in some way. > The patch for triggering those warnings is below, change to WARN_ON > if you want to see which one happens the most for you. > > I was pleasantly surprised though. The btrfs one which happens far more often than all others is read_extent_buffer, but most of them are page aligned on destination. This means that for me, avx memcpy might be 10% slower or 10% faster, depending on the specific source alignment, so avx memcpy wouldn't help much. This specific one happened far more than any of the other memcpy usages, and ignoring the check when destination is page aligned, most of them are gone. In short: I don't think I can get a speedup by using avx memcpy in-kernel. YMMV, if it does speed up for you, I'd love to see concrete numbers. And not only worst case, but for the common aligned cases too. Or some concrete numbers that misaligned happens a lot for you. ~Maarten ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-09-09 11:23 ` Maarten Lankhorst @ 2011-09-09 13:42 ` Borislav Petkov 0 siblings, 0 replies; 40+ messages in thread From: Borislav Petkov @ 2011-09-09 13:42 UTC (permalink / raw) To: Maarten Lankhorst Cc: Linus Torvalds, Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 2343 bytes --] On Fri, Sep 09, 2011 at 01:23:09PM +0200, Maarten Lankhorst wrote: > This specific one happened far more than any of the other memcpy usages, and > ignoring the check when destination is page aligned, most of them are gone. > > In short: I don't think I can get a speedup by using avx memcpy in-kernel. > > YMMV, if it does speed up for you, I'd love to see concrete numbers. And not only worst > case, but for the common aligned cases too. Or some concrete numbers that misaligned > happens a lot for you. Actually, assuming alignment matters, I'd need to redo the trace_printk run I did initially on buffer sizes: http://marc.info/?l=linux-kernel&m=131331602309340 (kernel_build.sizes attached) to get a more sensible grasp on the alignment of kernel buffers along with their sizes and to see whether we're doing a lot of unaligned large buffer copies in the kernel. I seriously doubt that, though, we should be doing everything pagewise anyway so... Concerning numbers, I ran your version again and sorted the output by speedup. The highest scores are: 30037(12/44) 5566.4 12797.2 2.299011642 28672(12/44) 5512.97 12588.7 2.283467991 30037(28/60) 5610.34 12732.7 2.269502799 27852(12/44) 5398.36 12242.4 2.267803859 30037(4/36) 5585.02 12598.6 2.25578257 28672(28/60) 5499.11 12317.5 2.239914033 27852(28/60) 5349.78 11918.9 2.227919527 27852(20/52) 5335.92 11750.7 2.202186795 24576(12/44) 4991.37 10987.2 2.201247446 and this is pretty cool. Here are the (0/0) cases: 8192(0/0) 2627.82 3038.43 1.156255766 12288(0/0) 3116.62 3675.98 1.179475031 13926(0/0) 3330.04 4077.08 1.224334839 14336(0/0) 3377.95 4067.24 1.204055286 15018(0/0) 3465.3 4215.3 1.216430725 16384(0/0) 3623.33 4442.38 1.226050715 24576(0/0) 4629.53 6021.81 1.300737559 27852(0/0) 5026.69 6619.26 1.316823133 28672(0/0) 5157.73 6831.39 1.324495749 30037(0/0) 5322.01 6978.36 1.3112261 It is not 2x anymore but still. Anyway, looking at the buffer sizes, they're rather ridiculous and even if we get them in some workload, they won't repeat n times per second to be relevant. So we'll see... Thanks. -- Regards/Gruss, Boris. [-- Attachment #2: kernel_build.sizes --] [-- Type: text/plain, Size: 925 bytes --] Bytes Count ===== ===== 0 5447 1 3850 2 16255 3 11113 4 68870 5 4256 6 30433 7 19188 8 50490 9 5999 10 78275 11 5628 12 6870 13 7371 14 4742 15 4911 16 143835 17 14096 18 1573 19 13603 20 424321 21 741 22 584 23 450 24 472 25 685 26 367 27 365 28 333 29 301 30 300 31 269 32 489 33 272 34 266 35 220 36 239 37 209 38 249 39 235 40 207 41 181 42 150 43 98 44 194 45 66 46 62 47 52 48 67226 49 138 50 171 51 26 52 20 53 12 54 15 55 4 56 13 57 8 58 6 59 6 60 115 61 10 62 5 63 12 64 67353 65 6 66 2363 67 9 68 11 69 6 70 5 71 6 72 10 73 4 74 9 75 8 76 4 77 6 78 3 79 4 80 3 81 4 82 4 83 4 84 4 85 8 86 6 87 2 88 3 89 2 90 2 91 1 92 9 93 1 94 2 96 2 97 2 98 3 100 2 102 1 104 1 105 1 106 1 107 2 109 1 110 1 111 1 112 1 113 2 115 2 117 1 118 1 119 1 120 14 127 1 128 1 130 1 131 2 134 2 137 1 144 100092 149 1 151 1 153 1 158 1 185 1 217 4 224 3 225 3 227 3 244 1 254 5 255 13 256 21708 512 21746 848 12907 1920 36536 2048 21708 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-09-09 8:14 ` Borislav Petkov 2011-09-09 10:12 ` Maarten Lankhorst @ 2011-09-09 14:39 ` Linus Torvalds 2011-09-09 15:35 ` Borislav Petkov 1 sibling, 1 reply; 40+ messages in thread From: Linus Torvalds @ 2011-09-09 14:39 UTC (permalink / raw) To: Borislav Petkov, Maarten Lankhorst, Linus Torvalds, Borislav Petkov, Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra On Fri, Sep 9, 2011 at 1:14 AM, Borislav Petkov <bp@alien8.de> wrote: > > which looks like pretty nice numbers to me. I can't say whether there > ever is 16K buffer we copy in the kernel but if there were... Kernel memcpy's are basically almost always smaller than a page size, because that tends to be the fundamental allocation size. Yes, there are exceptions that copy into big vmalloc'ed buffers, but they don't tend to matter. Things like module loading etc. Linus ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-09-09 14:39 ` Linus Torvalds @ 2011-09-09 15:35 ` Borislav Petkov 2011-12-05 12:20 ` melwyn lobo 0 siblings, 1 reply; 40+ messages in thread From: Borislav Petkov @ 2011-09-09 15:35 UTC (permalink / raw) To: Linus Torvalds Cc: Maarten Lankhorst, Borislav Petkov, Valdis.Kletnieks, Ingo Molnar, melwyn lobo, linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra On Fri, Sep 09, 2011 at 07:39:18AM -0700, Linus Torvalds wrote: > Kernel memcpy's are basically almost always smaller than a page size, > because that tends to be the fundamental allocation size. Yeah, this is what my trace of a kernel build showed too: Bytes Count ===== ===== ... 224 3 225 3 227 3 244 1 254 5 255 13 256 21708 512 21746 848 12907 1920 36536 2048 21708 OTOH, I keep thinking that copy_*_user might be doing bigger sizes, for example when shuffling network buffers to/from userspace. Converting those to SSE memcpy might not be as easy as memcpy itself, though. > Yes, there are exceptions that copy into big vmalloc'ed buffers, but > they don't tend to matter. Things like module loading etc. Too small a number of repetitions to matter, yes. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-09-09 15:35 ` Borislav Petkov @ 2011-12-05 12:20 ` melwyn lobo 0 siblings, 0 replies; 40+ messages in thread From: melwyn lobo @ 2011-12-05 12:20 UTC (permalink / raw) To: Borislav Petkov Cc: Linus Torvalds, Maarten Lankhorst, Borislav Petkov, Valdis.Kletnieks, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Peter Zijlstra The driver has a loop of memcpy the source and destination addresses based on a runtime computed value and confuses the compiler on the alignement. So instead of generating neat 32 bit memcpy, gcc generates "rep movsb" Example code snippet: src = (char *)kmap(bo->pages[idx]); src += offset; memcpy(des, src, len); Be replacing ssse3 only for memcpy of length larger than 1K bytes (for my driver typical length are 2k metadata from SRAM to DDR) I think overheads of FPU save and restore can be forgiven. Will SSSE3 work for unlaigned pointers as well ? If it doesn't I am lucky for past 6 months :) On Fri, Sep 9, 2011 at 9:05 PM, Borislav Petkov <bp@alien8.de> wrote: > On Fri, Sep 09, 2011 at 07:39:18AM -0700, Linus Torvalds wrote: >> Kernel memcpy's are basically almost always smaller than a page size, >> because that tends to be the fundamental allocation size. > > Yeah, this is what my trace of a kernel build showed too: > > Bytes Count > ===== ===== > > ... > > 224 3 > 225 3 > 227 3 > 244 1 > 254 5 > 255 13 > 256 21708 > 512 21746 > 848 12907 > 1920 36536 > 2048 21708 > > OTOH, I keep thinking that copy_*_user might be doing bigger sizes, for > example when shuffling network buffers to/from userspace. Converting > those to SSE memcpy might not be as easy as memcpy itself, though. > >> Yes, there are exceptions that copy into big vmalloc'ed buffers, but >> they don't tend to matter. Things like module loading etc. > > Too small a number of repetitions to matter, yes. > > -- > Regards/Gruss, > Boris. > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-09-01 15:15 ` Maarten Lankhorst 2011-09-01 16:18 ` Linus Torvalds @ 2011-12-05 12:54 ` melwyn lobo 2011-12-05 14:36 ` Alan Cox 1 sibling, 1 reply; 40+ messages in thread From: melwyn lobo @ 2011-12-05 12:54 UTC (permalink / raw) To: Maarten Lankhorst Cc: Borislav Petkov, Valdis.Kletnieks, Borislav Petkov, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra Will AVX work on Intel ATOM. I guess not. Then is this now not the time for having architecture dependant definitions for basic cpu intensive tasks On Thu, Sep 1, 2011 at 8:45 PM, Maarten Lankhorst <m.b.lankhorst@gmail.com> wrote: > Hey, > > 2011/8/16 Borislav Petkov <bp@amd64.org>: >> On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@vt.edu wrote: >>> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said: >>> >>> > Benchmarking with 10000 iterations, average results: >>> > size XM MM speedup >>> > 119 540.58 449.491 0.8314969419 >>> >>> > 12273 2307.86 4042.88 1.751787902 >>> > 13924 2431.8 4224.48 1.737184756 >>> > 14335 2469.4 4218.82 1.708440514 >>> > 15018 2675.67 1904.07 0.711622886 >>> > 16374 2989.75 5296.26 1.771470902 >>> > 24564 4262.15 7696.86 1.805863077 >>> > 27852 4362.53 3347.72 0.7673805572 >>> > 28672 5122.8 7113.14 1.388524413 >>> > 30033 4874.62 8740.04 1.792967931 >>> >>> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel >>> really good about this till we understand what happened for those two cases. >> >> Yep. >> >>> Also, anytime I see "10000 iterations", I ask myself if the benchmark >>> rigging took proper note of hot/cold cache issues. That *may* explain >>> the two oddball results we see above - but not knowing more about how >>> it was benched, it's hard to say. >> >> Yeah, the more scrutiny this gets the better. So I've cleaned up my >> setup and have attached it. >> >> xm_mem.c does the benchmarking and in bench_memcpy() there's the >> sse_memcpy call which is the SSE memcpy implementation using inline asm. >> It looks like gcc produces pretty crappy code here because if I replace >> the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the >> same function but in pure asm - I get much better numbers, sometimes >> even over 2x. It all depends on the alignment of the buffers though. >> Also, those numbers don't include the context saving/restoring which the >> kernel does for us. >> >> 7491 1509.89 2346.94 1.554378381 >> 8170 2166.81 2857.78 1.318890326 >> 12277 2659.03 4179.31 1.571744176 >> 13907 2571.24 4125.7 1.604558427 >> 14319 2638.74 5799.67 2.19789466 <---- >> 14993 2752.42 4413.85 1.603625603 >> 16371 3479.11 5562.65 1.59887055 > > This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy, > and I finally figured out why. I also extended the test to an optimized avx memcpy, > but I think the kernel memcpy will always win in the aligned case. > > Those numbers you posted aren't right it seems. It depends a lot on the alignment, > for example if both are aligned to 64 relative to each other, > kernel memcpy will win from avx memcpy on my machine. > > I replaced the malloc calls with memalign(65536, size + 256) so I could toy > around with the alignments a little. This explains why for some sizes, kernel > memcpy was faster than sse memcpy in the test results you had. > When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise > avx memcpy might. > > If you want to speed up memcpy, I think your best bet is to find out why it's > so much slower when src and dst aren't 64-byte aligned compared to each other. > > Cheers, > Maarten > > --- > Attached: my modified version of the sse memcpy you posted. > > I changed it a bit, and used avx, but some of the other changes might > be better for your sse memcpy too. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: x86 memcpy performance 2011-12-05 12:54 ` melwyn lobo @ 2011-12-05 14:36 ` Alan Cox 0 siblings, 0 replies; 40+ messages in thread From: Alan Cox @ 2011-12-05 14:36 UTC (permalink / raw) To: melwyn lobo Cc: Maarten Lankhorst, Borislav Petkov, Valdis.Kletnieks, Borislav Petkov, Ingo Molnar, linux-kernel, H. Peter Anvin, Thomas Gleixner, Linus Torvalds, Peter Zijlstra > Will AVX work on Intel ATOM. I guess not. Then is this now not the > time for having architecture dependant definitions for basic cpu > intensive tasks It's pretty much a necessity if you want to fine tune some of this. > > If you want to speed up memcpy, I think your best bet is to find out why it's > > so much slower when src and dst aren't 64-byte aligned compared to each other. rep mov on most x86 processors is an extremely optimised path. The 64 byte alignment behaviour is to be expected given the processor cache line size. Alan ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2011-12-05 14:35 UTC | newest] Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-08-15 14:55 x86 memcpy performance Borislav Petkov 2011-08-15 14:59 ` Andy Lutomirski 2011-08-15 15:29 ` Borislav Petkov 2011-08-15 15:36 ` Andrew Lutomirski 2011-08-15 16:12 ` Borislav Petkov 2011-08-15 17:04 ` Andrew Lutomirski 2011-08-15 18:49 ` Borislav Petkov 2011-08-15 19:11 ` Andrew Lutomirski 2011-08-15 20:05 ` Borislav Petkov 2011-08-15 20:08 ` Andrew Lutomirski 2011-08-15 16:12 ` H. Peter Anvin 2011-08-15 16:58 ` Andrew Lutomirski 2011-08-15 18:26 ` H. Peter Anvin 2011-08-15 18:35 ` Andrew Lutomirski 2011-08-15 18:52 ` H. Peter Anvin 2011-08-16 7:19 ` melwyn lobo 2011-08-16 7:43 ` Borislav Petkov -- strict thread matches above, loose matches on Subject: below -- 2011-08-12 17:59 melwyn lobo 2011-08-12 18:33 ` Andi Kleen 2011-08-12 19:52 ` Ingo Molnar 2011-08-14 9:59 ` Borislav Petkov 2011-08-14 11:13 ` Denys Vlasenko 2011-08-14 12:40 ` Borislav Petkov 2011-08-15 13:27 ` melwyn lobo 2011-08-15 13:44 ` Denys Vlasenko 2011-08-16 2:34 ` Valdis.Kletnieks 2011-08-16 12:16 ` Borislav Petkov 2011-09-01 15:15 ` Maarten Lankhorst 2011-09-01 16:18 ` Linus Torvalds 2011-09-08 8:35 ` Borislav Petkov 2011-09-08 10:58 ` Maarten Lankhorst 2011-09-09 8:14 ` Borislav Petkov 2011-09-09 10:12 ` Maarten Lankhorst 2011-09-09 11:23 ` Maarten Lankhorst 2011-09-09 13:42 ` Borislav Petkov 2011-09-09 14:39 ` Linus Torvalds 2011-09-09 15:35 ` Borislav Petkov 2011-12-05 12:20 ` melwyn lobo 2011-12-05 12:54 ` melwyn lobo 2011-12-05 14:36 ` Alan Cox
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.