* Question Regarding ERMS memcpy @ 2017-03-04 20:08 Logan Gunthorpe 2017-03-04 22:43 ` Borislav Petkov 0 siblings, 1 reply; 22+ messages in thread From: Logan Gunthorpe @ 2017-03-04 20:08 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tony Luck, Borislav Petkov, Al Viro Cc: the arch/x86 maintainers, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 1298 bytes --] Hi, I'm trying to chase down a performance issue with a driver I'm working on that does a repeated memcpy_fromio of about 1KB from a PCI device. I made a small change from a fixed size copy to a variable size only to be surprised with a performance decrease of about 1/3. I've looked through the code and discovered this is due to switching out __builtin_memcpy with __memcpy when the length is not constant. This makes sense and I've now been looking into the inner workings of memcpy_64.S. The CPU I'm testing on is a Sandy Bridge and according to /proc/cpuinfo, it does _not_ have the erms bit and does have the rep_good bit. So I expect to be using the "rep movsq" version of memcpy. However, I've done some testing with my driver by hacking in specific memcpy implementations and I've found the following results: __builtin_memcpy w/const length: 85KB/s memcpy_fromio: 26kB/s __builtin_memcpy: 26kB/s memcpy_movsq: 126kB/s memcpy_erms: 26kB/s Thus, based on these performance numbers, it almost seems like my platform is using the erms version when it probably shouldn't be. So my question is: how do I find out what version of memcpy my actual machine is using and fix it if it is wrong? I'm running with a 4.10.0 kernel and I've attached my cpuinfo. Thanks for any insights, Logan [-- Attachment #2: cpuinfo.txt --] [-- Type: text/plain, Size: 7488 bytes --] processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz stepping : 7 microcode : 0x710 cpu MHz : 1200.000 cache size : 10240 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts bugs : bogomips : 4799.90 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz stepping : 7 microcode : 0x710 cpu MHz : 1200.000 cache size : 10240 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts bugs : bogomips : 4801.06 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz stepping : 7 microcode : 0x710 cpu MHz : 1200.000 cache size : 10240 KB physical id : 0 siblings : 4 core id : 2 cpu cores : 4 apicid : 4 initial apicid : 4 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts bugs : bogomips : 4801.12 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz stepping : 7 microcode : 0x710 cpu MHz : 1200.000 cache size : 10240 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts bugs : bogomips : 4801.10 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz stepping : 7 microcode : 0x710 cpu MHz : 1200.000 cache size : 10240 KB physical id : 1 siblings : 4 core id : 0 cpu cores : 4 apicid : 32 initial apicid : 32 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts bugs : bogomips : 4801.43 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz stepping : 7 microcode : 0x710 cpu MHz : 1200.000 cache size : 10240 KB physical id : 1 siblings : 4 core id : 1 cpu cores : 4 apicid : 34 initial apicid : 34 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts bugs : bogomips : 4801.74 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz stepping : 7 microcode : 0x710 cpu MHz : 1200.000 cache size : 10240 KB physical id : 1 siblings : 4 core id : 2 cpu cores : 4 apicid : 36 initial apicid : 36 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts bugs : bogomips : 4801.76 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz stepping : 7 microcode : 0x710 cpu MHz : 1200.000 cache size : 10240 KB physical id : 1 siblings : 4 core id : 3 cpu cores : 4 apicid : 38 initial apicid : 38 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts bugs : bogomips : 4801.78 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-04 20:08 Question Regarding ERMS memcpy Logan Gunthorpe @ 2017-03-04 22:43 ` Borislav Petkov [not found] ` <c675a6c8-e87a-ece2-2350-cdd079b1b610@deltatee.com> 0 siblings, 1 reply; 22+ messages in thread From: Borislav Petkov @ 2017-03-04 22:43 UTC (permalink / raw) To: Logan Gunthorpe Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On Sat, Mar 04, 2017 at 01:08:15PM -0700, Logan Gunthorpe wrote: > So my question is: how do I find out what version of memcpy my actual > machine is using and fix it if it is wrong? You can boot with "debug-alternative" and look for those strings where it says "feat:" [ 0.261386] apply_alternatives: feat: 3*32+18, old: (ffffffff8101e3c4, len: 3), repl: (ffffffff81de4e7a, len: 3), pad: 3 ^^^^^^^ [ 0.264003] ffffffff8101e3c4: old_insn: 0f 1f 00 [ 0.265235] ffffffff81de4e7a: rpl_insn: 0f ae e8 [ 0.266469] ffffffff8101e3c4: final_insn: 0f ae e8 if it says 9*32+9 there, then it really does patch in the ERMS variant. If it says 3*32+16 and it patches a NOP, i.e., something like this: [ 2.022665] apply_alternatives: feat: 3*32+16, old: (ffffffff812dbfe0, len: 5), repl: (ffffffff81de5336, len: 0), pad: 3 [ 2.024003] ffffffff812dbfe0: old_insn: eb 0e 90 90 90 [ 2.025332] ffffffff812dbfe0: final_insn: 0f 1f 44 00 00 ^^^^ NOP then, you're using the rep; movsq variant. Also, do make <path-to-file>.s of the file where it does memcpy_fromio() and attach it here. I'd like to see what the compiler generates. HTH. -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <c675a6c8-e87a-ece2-2350-cdd079b1b610@deltatee.com>]
* Re: Question Regarding ERMS memcpy [not found] ` <c675a6c8-e87a-ece2-2350-cdd079b1b610@deltatee.com> @ 2017-03-04 23:55 ` hpa 2017-03-05 0:14 ` Borislav Petkov 0 siblings, 1 reply; 22+ messages in thread From: hpa @ 2017-03-04 23:55 UTC (permalink / raw) To: Logan Gunthorpe, Borislav Petkov Cc: Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On March 4, 2017 3:46:44 PM PST, Logan Gunthorpe <logang@deltatee.com> wrote: >Hi Borislav, > >Thanks for the help. > >On 04/03/17 03:43 PM, Borislav Petkov wrote: >> You can boot with "debug-alternative" and look for those strings >where > >Here's the symbols for memcpy and the corresponding apply_alternatives >lines: > >ffffffff8122df90 T __memcpy >ffffffff8122df90 W memcpy >ffffffff8122dfb0 T memcpy_erms >ffffffff8122dfc0 T memcpy_orig > >[ 0.076018] apply_alternatives: feat: 3*32+16, old: >(ffffffff8122df90, len: 5), repl: (ffffffff819d7ac5, len: 0), pad: 0 >[ 0.076019] ffffffff8122df90: old_insn: e9 2b 00 00 00 >[ 0.076021] ffffffff8122df90: final_insn: 66 66 90 66 90 > >So it looks like it's patching in a NOP as it is supposed to. > > >> Also, do >> >> make <path-to-file>.s >> >> of the file where it does memcpy_fromio() and attach it here. I'd >like to see >> what the compiler generates. > >I've attached the whole file but this is the code in question: > > .loc 5 219 0 > movq 936(%rbp), %rdx # stdev_3(D)->mmio_mrpc, tmp127 > leaq 48(%rbx), %rax #, tmp113 > movq 40(%rbx), %rcx # MEM[(struct switchtec_user *)__mptr_7 + >-56B].read_len, MEM[(struct switchtec_user *)__mptr_7 + -56B].read_len > movq %rax, %rdi # tmp113, tmp118 > leaq 1024(%rdx), %rsi #, tmp114 > rep movsb > >Strangely, it decided to inline a memcpy with rep movsb. Any idea why >the compiler would do that? Is this just gcc being stupid? My gcc >version is below. > >Thanks, > >Logan > > >Using built-in specs. >COLLECT_GCC=gcc >COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.9/lto-wrapper >Target: x86_64-linux-gnu >Configured with: ../src/configure -v --with-pkgversion='Debian >4.9.2-10' >--with-bugurl=file:///usr/share/doc/gcc-4.9/README.Bugs >--enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr >--program-suffix=-4.9 --enable-shared --enable-linker-build-id >--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix >--with-gxx-include-dir=/usr/include/c++/4.9 --libdir=/usr/lib >--enable-nls --with-sysroot=/ --enable-clocale=gnu >--enable-libstdcxx-debug --enable-libstdcxx-time=yes >--enable-gnu-unique-object --disable-vtable-verify --enable-plugin >--with-system-zlib --disable-browser-plugin --enable-java-awt=gtk >--enable-gtk-cairo >--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64/jre >--enable-java-home >--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64 >--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.9-amd64 >--with-arch-directory=amd64 >--with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc >--enable-multiarch --with-arch-32=i586 --with-abi=m64 >--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic >--enable-checking=release --build=x86_64-linux-gnu >--host=x86_64-linux-gnu --target=x86_64-linux-gnu >Thread model: posix >gcc version 4.9.2 (Debian 4.9.2-10) For newer processors, as determined by -mtune=, it is actually the best option for an arbitrary copy. -- Sent from my Android device with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-04 23:55 ` hpa @ 2017-03-05 0:14 ` Borislav Petkov 2017-03-05 0:23 ` hpa 0 siblings, 1 reply; 22+ messages in thread From: Borislav Petkov @ 2017-03-05 0:14 UTC (permalink / raw) To: hpa Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On Sat, Mar 04, 2017 at 03:55:27PM -0800, hpa@zytor.com wrote: > For newer processors, as determined by -mtune=, it is actually the > best option for an arbitrary copy. So his doesn't have ERMS - it is a SNB - so if for SNB REP_GOOD is the best option for memcpy, then we should probably build with -fno-builtin-memcpy unconditionally. Otherwise gcc apparently inserts its own memcpy variant. And this is probably wrong because we do the detection at boot time and not at build time. For example here it generates REP; MOVSL for the call in drivers/firmware/dmi_scan.c::dmi_scan_machine() which looks wrong to me. Length is 32 so it could just as well do REP; MOVSQ. IOW, we could do something like this: --- diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 2d449337a360..c1b68d147b8d 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -142,10 +142,7 @@ ifdef CONFIG_X86_X32 endif export CONFIG_X86_X32_ABI -# Don't unroll struct assignments with kmemcheck enabled -ifeq ($(CONFIG_KMEMCHECK),y) - KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy) -endif +KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy) # Stackpointer is addressed different for 32 bit and 64 bit x86 sp-$(CONFIG_X86_32) := esp -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-05 0:14 ` Borislav Petkov @ 2017-03-05 0:23 ` hpa 2017-03-05 0:33 ` Borislav Petkov 0 siblings, 1 reply; 22+ messages in thread From: hpa @ 2017-03-05 0:23 UTC (permalink / raw) To: Borislav Petkov Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On March 4, 2017 4:14:47 PM PST, Borislav Petkov <bp@suse.de> wrote: >On Sat, Mar 04, 2017 at 03:55:27PM -0800, hpa@zytor.com wrote: >> For newer processors, as determined by -mtune=, it is actually the >> best option for an arbitrary copy. > >So his doesn't have ERMS - it is a SNB - so if for SNB REP_GOOD is >the best option for memcpy, then we should probably build with >-fno-builtin-memcpy unconditionally. Otherwise gcc apparently inserts >its own memcpy variant. > >And this is probably wrong because we do the detection at boot time and >not at build time. > >For example here it generates REP; MOVSL for the call in >drivers/firmware/dmi_scan.c::dmi_scan_machine() which looks wrong to >me. >Length is 32 so it could just as well do REP; MOVSQ. > >IOW, we could do something like this: > >--- >diff --git a/arch/x86/Makefile b/arch/x86/Makefile >index 2d449337a360..c1b68d147b8d 100644 >--- a/arch/x86/Makefile >+++ b/arch/x86/Makefile >@@ -142,10 +142,7 @@ ifdef CONFIG_X86_X32 > endif > export CONFIG_X86_X32_ABI > >-# Don't unroll struct assignments with kmemcheck enabled >-ifeq ($(CONFIG_KMEMCHECK),y) >- KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy) >-endif >+KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy) > > # Stackpointer is addressed different for 32 bit and 64 bit x86 > sp-$(CONFIG_X86_32) := esp What are the compilation flags? It may be that gcc still does TRT depending on this call site. I'd check what gcc6 or 7 generates, though. -- Sent from my Android device with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-05 0:23 ` hpa @ 2017-03-05 0:33 ` Borislav Petkov 2017-03-05 0:56 ` hpa 2017-03-05 4:58 ` Logan Gunthorpe 0 siblings, 2 replies; 22+ messages in thread From: Borislav Petkov @ 2017-03-05 0:33 UTC (permalink / raw) To: hpa Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On Sat, Mar 04, 2017 at 04:23:17PM -0800, hpa@zytor.com wrote: > What are the compilation flags? It may be that gcc still does TRT > depending on this call site. I'd check what gcc6 or 7 generates, > though. Well, I don't think that matters: if you're building a kernel on one machine to boot on another machine, the compiler can't know at build time what the best MOVS* variant would be for the target machine. That's why we're doing the alternatives patching at *boot* time. However, if the size is small enough, the CALL/patch overhead would be definitely too much. Hmm, I wish we were able to say, "let gcc decide for small sizes and let us do the patching for larger ones." -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-05 0:33 ` Borislav Petkov @ 2017-03-05 0:56 ` hpa 2017-03-05 9:50 ` Borislav Petkov 2017-03-05 4:58 ` Logan Gunthorpe 1 sibling, 1 reply; 22+ messages in thread From: hpa @ 2017-03-05 0:56 UTC (permalink / raw) To: Borislav Petkov Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On March 4, 2017 4:33:49 PM PST, Borislav Petkov <bp@suse.de> wrote: >On Sat, Mar 04, 2017 at 04:23:17PM -0800, hpa@zytor.com wrote: >> What are the compilation flags? It may be that gcc still does TRT >> depending on this call site. I'd check what gcc6 or 7 generates, >> though. > >Well, I don't think that matters: if you're building a kernel on one >machine to boot on another machine, the compiler can't know at build >time what the best MOVS* variant would be for the target machine. >That's >why we're doing the alternatives patching at *boot* time. > >However, if the size is small enough, the CALL/patch overhead would be >definitely too much. > >Hmm, I wish we were able to say, "let gcc decide for small sizes and >let >us do the patching for larger ones." That's what the -march= and -mtune= option do! -- Sent from my Android device with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-05 0:56 ` hpa @ 2017-03-05 9:50 ` Borislav Petkov 2017-03-05 11:18 ` Borislav Petkov 2017-03-05 19:19 ` Linus Torvalds 0 siblings, 2 replies; 22+ messages in thread From: Borislav Petkov @ 2017-03-05 9:50 UTC (permalink / raw) To: hpa Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On Sat, Mar 04, 2017 at 04:56:38PM -0800, hpa@zytor.com wrote: > That's what the -march= and -mtune= option do! How does that even help with a distro kernel built with -mtune=generic ? gcc can't possibly know on what targets is that kernel going to be booted on. So it probably does some universally optimal things, like in the dmi_scan_machine() case: memcpy_fromio(buf, p, 32); turns into: .loc 3 219 0 movl $8, %ecx #, tmp79 movq %rax, %rsi # p, p movq %rsp, %rdi #, tmp77 rep movsl Apparently it thinks it is fine to do 8*4-byte MOVS. But why not 4*8-byte MOVS? That's half the loops. [ It is a whole different story what the machine actually does underneath. It being a half cacheline probably doesn't help and it really does the separate MOVs but then it would be cheaper if it did 4 8-byte ones. ] One thing's for sure - both variants are certainly cheaper than to CALL a memcpy variant. What we probably should try to do, though, is simply patch in the body of REP; MOVSQ or REP; MOVSB into the call sites and only have a call to memcpy_orig() because that last one if fat. I remember we did talk about it at some point but don't remember why we didn't do it. -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-05 9:50 ` Borislav Petkov @ 2017-03-05 11:18 ` Borislav Petkov 2017-03-05 13:05 ` Borislav Petkov 2017-03-05 19:19 ` Linus Torvalds 1 sibling, 1 reply; 22+ messages in thread From: Borislav Petkov @ 2017-03-05 11:18 UTC (permalink / raw) To: hpa Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List, Linus Torvalds On Sun, Mar 05, 2017 at 10:50:59AM +0100, Borislav Petkov wrote: > On Sat, Mar 04, 2017 at 04:56:38PM -0800, hpa@zytor.com wrote: > > That's what the -march= and -mtune= option do! > > How does that even help with a distro kernel built with -mtune=generic ? > > gcc can't possibly know on what targets is that kernel going to be > booted on. So it probably does some universally optimal things, like in > the dmi_scan_machine() case: > > memcpy_fromio(buf, p, 32); > > turns into: > > .loc 3 219 0 > movl $8, %ecx #, tmp79 > movq %rax, %rsi # p, p > movq %rsp, %rdi #, tmp77 > rep movsl > > Apparently it thinks it is fine to do 8*4-byte MOVS. But why not > 4*8-byte MOVS? > > That's half the loops. > > [ It is a whole different story what the machine actually does underneath. It > being a half cacheline probably doesn't help and it really does the separate > MOVs but then it would be cheaper if it did 4 8-byte ones. ] > > One thing's for sure - both variants are certainly cheaper than to CALL > a memcpy variant. > > What we probably should try to do, though, is simply patch in the body > of REP; MOVSQ or REP; MOVSB into the call sites and only have a call to > memcpy_orig() because that last one if fat. > > I remember we did talk about it at some point but don't remember why we > didn't do it. Right, and I believe it was Linus who mentioned we should simply patch in the small REP_GOOD and ERMS memcpy body into the call sites. CCed. Anyway, something like below. It almost builds here - I still need to figure out that arch/x86/boot/compressed/misc.c including decompression logic which does memcpy and it turns into the alternative version which we don't want in arch/x86/boot/compressed/. Also, I need to check what vmlinuz size bloat we're talking: with the diff below, we do add padding which looks like this: 3d: 90 nop 3e: 90 nop 3f: 90 nop 40: 90 nop 41: 90 nop 42: 90 nop 43: 90 nop 44: 90 nop 45: 90 nop 46: 90 nop 47: 90 nop 48: 90 nop 49: 90 nop 4a: 90 nop 4b: 90 nop 4c: 90 nop 4d: 90 nop 4e: 90 nop 4f: 90 nop and that at every call site. But at least we save ourselves the call overhead and use the optimal memcpy variant on each machine. --- diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 2d449337a360..24711ca4033b 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -143,9 +143,7 @@ endif export CONFIG_X86_X32_ABI # Don't unroll struct assignments with kmemcheck enabled -ifeq ($(CONFIG_KMEMCHECK),y) - KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy) -endif +KBUILD_CFLAGS += $(call cc-option,-fno-builtin) # Stackpointer is addressed different for 32 bit and 64 bit x86 sp-$(CONFIG_X86_32) := esp diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h index 1b020381ab38..8216c2e610ae 100644 --- a/arch/x86/include/asm/alternative.h +++ b/arch/x86/include/asm/alternative.h @@ -204,6 +204,12 @@ static inline int alternatives_text_reserved(void *start, void *end) asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) \ : output : "i" (0), ## input) +#define alternative_io_2(oldinstr, newinstr1, feature1, newinstr2, feature2, \ + output, input...) \ + asm volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \ + : output \ + : "i" (0), ## input) + /* Like alternative_io, but for replacing a direct call with another one. */ #define alternative_call(oldfunc, newfunc, feature, output, input...) \ asm volatile (ALTERNATIVE("call %P[old]", "call %P[new]", feature) \ diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h index a164862d77e3..a529a12a35ed 100644 --- a/arch/x86/include/asm/string_64.h +++ b/arch/x86/include/asm/string_64.h @@ -3,6 +3,8 @@ #ifdef __KERNEL__ #include <linux/jump_label.h> +#include <asm/cpufeatures.h> +#include <asm/alternative.h> /* Written 2002 by Andi Kleen */ @@ -28,8 +30,37 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t function. */ #define __HAVE_ARCH_MEMCPY 1 -extern void *memcpy(void *to, const void *from, size_t len); -extern void *__memcpy(void *to, const void *from, size_t len); +void *memcpy_orig(void *to, const void *from, size_t len); + +/* + * We build a call to memcpy_orig by default which replaced on + * the majority of x86 CPUs which set REP_GOOD. In addition, CPUs which + * have the enhanced REP MOVSB/STOSB feature (ERMS), change the REP_GOOD + * routine to a REP; MOVSB mem copy. + */ +static inline void *memcpy(void *to, const void *from, size_t len) +{ + void *ret; + + alternative_io_2("call memcpy_orig\n\t", + "movq %%rdi, %%rax\n\t" + "movq %%rdx, %%rcx\n\t" + "shrq $3, %%rcx\n\t" + "andl $7, %%edx\n\t" + "rep movsq\n\t" + "movl %%edx, %%ecx\n\t" + "rep movsb\n\t", + X86_FEATURE_REP_GOOD, + "movq %%rdi, %%rax\n\t" + "movq %%rdx, %%rcx\n\t" + "rep movsb\n\t", + X86_FEATURE_ERMS, + ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), "=d" (len)), + "1" (to), "2" (from), "3" (len) + : "memory", "rcx"); + + return ret; +} #ifndef CONFIG_KMEMCHECK #if (__GNUC__ == 4 && __GNUC_MINOR__ < 3) || __GNUC__ < 4 diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S index 779782f58324..c3c8890b155c 100644 --- a/arch/x86/lib/memcpy_64.S +++ b/arch/x86/lib/memcpy_64.S @@ -6,13 +6,6 @@ #include <asm/alternative-asm.h> #include <asm/export.h> -/* - * We build a jump to memcpy_orig by default which gets NOPped out on - * the majority of x86 CPUs which set REP_GOOD. In addition, CPUs which - * have the enhanced REP MOVSB/STOSB feature (ERMS), change those NOPs - * to a jmp to memcpy_erms which does the REP; MOVSB mem copy. - */ - .weak memcpy /* @@ -26,36 +19,12 @@ * Output: * rax original destination */ -ENTRY(__memcpy) -ENTRY(memcpy) - ALTERNATIVE_2 "jmp memcpy_orig", "", X86_FEATURE_REP_GOOD, \ - "jmp memcpy_erms", X86_FEATURE_ERMS - - movq %rdi, %rax - movq %rdx, %rcx - shrq $3, %rcx - andl $7, %edx - rep movsq - movl %edx, %ecx - rep movsb - ret -ENDPROC(memcpy) -ENDPROC(__memcpy) -EXPORT_SYMBOL(memcpy) -EXPORT_SYMBOL(__memcpy) - -/* - * memcpy_erms() - enhanced fast string memcpy. This is faster and - * simpler than memcpy. Use memcpy_erms when possible. - */ -ENTRY(memcpy_erms) - movq %rdi, %rax - movq %rdx, %rcx - rep movsb - ret -ENDPROC(memcpy_erms) - ENTRY(memcpy_orig) + pushq %r8 + pushq %r9 + pushq %r10 + pushq %r11 + movq %rdi, %rax cmpq $0x20, %rdx @@ -179,8 +148,14 @@ ENTRY(memcpy_orig) movb %cl, (%rdi) .Lend: + popq %r11 + popq %r10 + popq %r9 + popq %r8 + retq ENDPROC(memcpy_orig) +EXPORT_SYMBOL_GPL(memcpy_orig) #ifndef CONFIG_UML /* -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-05 11:18 ` Borislav Petkov @ 2017-03-05 13:05 ` Borislav Petkov 0 siblings, 0 replies; 22+ messages in thread From: Borislav Petkov @ 2017-03-05 13:05 UTC (permalink / raw) To: hpa Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List, Linus Torvalds On Sun, Mar 05, 2017 at 12:18:23PM +0100, Borislav Petkov wrote: > Also, I need to check what vmlinuz size bloat we're talking: with the > diff below, we do add padding which looks like this: Yeah, even a tailored config adds ~67K: text data bss dec hex filename 7567290 4040894 2560000 14168184 d83078 vmlinux 7635332 4041694 2560000 14237026 d93d62 vmlinux.memcpy I'm assuming allmodconfig would become even bigger. -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-05 9:50 ` Borislav Petkov 2017-03-05 11:18 ` Borislav Petkov @ 2017-03-05 19:19 ` Linus Torvalds 2017-03-05 19:54 ` Borislav Petkov 1 sibling, 1 reply; 22+ messages in thread From: Linus Torvalds @ 2017-03-05 19:19 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Anvin, Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On Sun, Mar 5, 2017 at 1:50 AM, Borislav Petkov <bp@suse.de> wrote: > > gcc can't possibly know on what targets is that kernel going to be > booted on. So it probably does some universally optimal things, like in > the dmi_scan_machine() case: > > memcpy_fromio(buf, p, 32); > > turns into: > > .loc 3 219 0 > movl $8, %ecx #, tmp79 > movq %rax, %rsi # p, p > movq %rsp, %rdi #, tmp77 > rep movsl > > Apparently it thinks it is fine to do 8*4-byte MOVS. But why not > 4*8-byte MOVS? Actually, the "fromio/toio" code should never use regular memcpy(). There used to be devices that literally broke on 64-bit accesses due to broken PCI crud. We seem to have broken this *really* long ago, though. On x86-64 we used to have a special __inline_memcpy() that copies our historical 32-bit thing, and was used for memcpy_fromio() and memcpy_toio(). That was then undone by commit 6175ddf06b61 ("x86: Clean up mem*io functions") That commit says "Iomem has no special significance on x86" but that's not strictly true. iomem is in the same address space and uses the same access instructions as regular memory, but iomem _is_ special. And I think it's a bug that we use "memcpy()" on it. Not because of any gcc issues, but simply because our own memcpy() optimizations are not appropriate for iomem. For example, "rep movsb" really is the right thing to use on normal memory on modern CPU's. But it is *not* the right thing to use on IO memory, because the CPU only does the magic cacheline access optimizations on cacheable memory! So I think we should re-introduce that old "__inline_memcpy()" as that special "safe memcpy" thing. Not just for KMEMCHECK, and not just for 64-bit. Hmm? Linus ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-05 19:19 ` Linus Torvalds @ 2017-03-05 19:54 ` Borislav Petkov 2017-03-05 20:19 ` Linus Torvalds 2017-03-06 7:01 ` Logan Gunthorpe 0 siblings, 2 replies; 22+ messages in thread From: Borislav Petkov @ 2017-03-05 19:54 UTC (permalink / raw) To: Linus Torvalds, Logan Gunthorpe Cc: Peter Anvin, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On Sun, Mar 05, 2017 at 11:19:42AM -0800, Linus Torvalds wrote: > Actually, the "fromio/toio" code should never use regular memcpy(). > There used to be devices that literally broke on 64-bit accesses due > to broken PCI crud. > > We seem to have broken this *really* long ago, though. I wonder why nothing blew up or failed strangely by now... > On x86-64 we used to have a special __inline_memcpy() that copies our > historical It is still there in arch/x86/include/asm/string_64.h. The comment "Only used for special circumstances." over it is grossly understating it. > 32-bit thing, and was used for memcpy_fromio() and memcpy_toio(). That > was then undone by commit 6175ddf06b61 ("x86: Clean up mem*io > functions") > > That commit says > > "Iomem has no special significance on x86" > > but that's not strictly true. iomem is in the same address space and > uses the same access instructions as regular memory, but iomem _is_ > special. > > And I think it's a bug that we use "memcpy()" on it. Not because of > any gcc issues, but simply because our own memcpy() optimizations are > not appropriate for iomem. > > For example, "rep movsb" really is the right thing to use on normal > memory on modern CPU's. So Logan's box is a SNB and it doesn't have the ERMS optimizations. Are you saying, regardless, we should let gcc put REP; MOVSB for smaller sizes? Because gcc does generate a REP; MOVSB there when it puts its own memcpy, see mail upthread. (Even though that is wrong to do on iomem.) > But it is *not* the right thing to use on IO memory, because the CPU > only does the magic cacheline access optimizations on cacheable > memory! Ah, yes. > So I think we should re-introduce that old "__inline_memcpy()" as that > special "safe memcpy" thing. Not just for KMEMCHECK, and not just for > 64-bit. Logan, wanna give that a try, see if it takes care of your issue? Oh, and along with the revert we would need a big fat warning explaining why we need that special memcpy for IO memory. -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-05 19:54 ` Borislav Petkov @ 2017-03-05 20:19 ` Linus Torvalds 2017-03-06 7:01 ` Logan Gunthorpe 1 sibling, 0 replies; 22+ messages in thread From: Linus Torvalds @ 2017-03-05 20:19 UTC (permalink / raw) To: Borislav Petkov Cc: Logan Gunthorpe, Peter Anvin, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On Sun, Mar 5, 2017 at 11:54 AM, Borislav Petkov <bp@suse.de> wrote: >> >> We seem to have broken this *really* long ago, though. > > I wonder why nothing blew up or failed strangely by now... The hardware that cared was pretty broken to begin with, and I think it was mainly some really odd graphics cards. And from memory, they had issues with 64-bit writes. We actually have a slow 16-bit word at a time copy for exactly these kinds of issues: scr_memcpyw() and friends. I'd like to say that it was one of those shit server-only cards that nobody sane would ever use (but "server hardware is validated and better quality!"), but that might have been another issue. >> For example, "rep movsb" really is the right thing to use on normal >> memory on modern CPU's. > > So Logan's box is a SNB and it doesn't have the ERMS optimizations. Are > you saying, regardless, we should let gcc put REP; MOVSB for smaller > sizes? I think gcc makes bad choices, but they are gcc';s choices to make. I have up on gcc's "-Os" because the choices were so bad and not getting fixed. .. and none of that has _anything_ to do with accesses to IO memory, which is fundamentally different. > Because gcc does generate a REP; MOVSB there when it puts its own > memcpy, see mail upthread. (Even though that is wrong to do on iomem.) No, *THAT* is not wrong to do on iomem. If we tell gcc that "memcpy()" works on iomem, then gcc can damn well do whatever it wants. "rep stosb" isn't wrong for memcpy(). Gcc may do stupid things with it, but that's completely immaterial. > Oh, and along with the revert we would need a big fat warning explaining > why we need that special memcpy for IO memory. Well, quite frankly, just a simple "IO memory is different from cached memory" should be sufficient. Linus ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-05 19:54 ` Borislav Petkov 2017-03-05 20:19 ` Linus Torvalds @ 2017-03-06 7:01 ` Logan Gunthorpe 2017-03-06 7:28 ` H. Peter Anvin 2017-03-06 13:33 ` Borislav Petkov 1 sibling, 2 replies; 22+ messages in thread From: Logan Gunthorpe @ 2017-03-06 7:01 UTC (permalink / raw) To: Borislav Petkov, Linus Torvalds Cc: Peter Anvin, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On Sun, Mar 05, 2017 at 11:19:42AM -0800, Linus Torvalds wrote: >> But it is *not* the right thing to use on IO memory, because the CPU >> only does the magic cacheline access optimizations on cacheable >> memory! Yes, and actually this is where I started. I thought my memcpy was using byte accesses on purpose and I needed to create a patch for a different IO memcpy because obviously byte accesses over the PCI bus would be very un-ideal. However, when I found my system wasn't intentionally using that implementation that was no longer my focus. So, I have no way to test this, but it sounds like any Ivy bridge system using the ERMS version of memcpy could have the same slow PCI memcpy performance I've been seeing (unless the microcode fixes it up?). So it sounds like it would be a good idea to revert the change Linus is talking about. >> So I think we should re-introduce that old "__inline_memcpy()" as that >> special "safe memcpy" thing. Not just for KMEMCHECK, and not just for >> 64-bit. On 05/03/17 12:54 PM, Borislav Petkov wrote: > Logan, wanna give that a try, see if it takes care of your issue? Well honestly my issue was solved by fixing my kernel config. I have no idea why I had optimize for size in there in the first place. Thanks, Logan ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-06 7:01 ` Logan Gunthorpe @ 2017-03-06 7:28 ` H. Peter Anvin 2017-03-06 17:12 ` Logan Gunthorpe 2017-03-06 13:33 ` Borislav Petkov 1 sibling, 1 reply; 22+ messages in thread From: H. Peter Anvin @ 2017-03-06 7:28 UTC (permalink / raw) To: Logan Gunthorpe, Borislav Petkov, Linus Torvalds Cc: Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On 03/05/17 23:01, Logan Gunthorpe wrote: > > On 05/03/17 12:54 PM, Borislav Petkov wrote: >> Logan, wanna give that a try, see if it takes care of your issue? > > Well honestly my issue was solved by fixing my kernel config. I have no > idea why I had optimize for size in there in the first place. > Yes, to gcc "optimize for size" means exactly that... intended for cases where saving storage (e.g. ROM) or code download time is paramount. We have frequently asked them for an optimization option that is not strictly byte-count-based but still tries to reduce cache footprint, but it is a nontrivial project. -hpa ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-06 7:28 ` H. Peter Anvin @ 2017-03-06 17:12 ` Logan Gunthorpe 2017-03-06 18:53 ` hpa 0 siblings, 1 reply; 22+ messages in thread From: Logan Gunthorpe @ 2017-03-06 17:12 UTC (permalink / raw) To: H. Peter Anvin, Borislav Petkov, Linus Torvalds Cc: Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On 06/03/17 12:28 AM, H. Peter Anvin wrote: > On 03/05/17 23:01, Logan Gunthorpe wrote: >> >> On 05/03/17 12:54 PM, Borislav Petkov wrote: >>> Logan, wanna give that a try, see if it takes care of your issue? >> >> Well honestly my issue was solved by fixing my kernel config. I have no >> idea why I had optimize for size in there in the first place. >> > > Yes, to gcc "optimize for size" means exactly that... intended for cases > where saving storage (e.g. ROM) or code download time is paramount. I agree and understand, however placing a poorly performing _inline_ memcpy instead of a single call instruction to a performant memcopy probably took more code space in the end. So like Linus, I just have to scratch my head at the -Os optimization option. Logan ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-06 17:12 ` Logan Gunthorpe @ 2017-03-06 18:53 ` hpa 0 siblings, 0 replies; 22+ messages in thread From: hpa @ 2017-03-06 18:53 UTC (permalink / raw) To: Logan Gunthorpe, Borislav Petkov, Linus Torvalds Cc: Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On March 6, 2017 9:12:41 AM PST, Logan Gunthorpe <logang@deltatee.com> wrote: > > >On 06/03/17 12:28 AM, H. Peter Anvin wrote: >> On 03/05/17 23:01, Logan Gunthorpe wrote: >>> >>> On 05/03/17 12:54 PM, Borislav Petkov wrote: >>>> Logan, wanna give that a try, see if it takes care of your issue? >>> >>> Well honestly my issue was solved by fixing my kernel config. I have >no >>> idea why I had optimize for size in there in the first place. >>> >> >> Yes, to gcc "optimize for size" means exactly that... intended for >cases >> where saving storage (e.g. ROM) or code download time is paramount. > >I agree and understand, however placing a poorly performing _inline_ >memcpy instead of a single call instruction to a performant memcopy >probably took more code space in the end. So like Linus, I just have to >scratch my head at the -Os optimization option. > >Logan No, it will be smaller: -Os counts bytes. If you think about it, there is no way that replacing a five-byte subroutine call with a two-byte instruction opcode can make it bigger! The only other difference between the two from a size perspective is that the compiler doesn't have to worry about clobbered registers other than the argument registers. -- Sent from my Android device with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-06 7:01 ` Logan Gunthorpe 2017-03-06 7:28 ` H. Peter Anvin @ 2017-03-06 13:33 ` Borislav Petkov 2017-03-06 13:41 ` hpa 1 sibling, 1 reply; 22+ messages in thread From: Borislav Petkov @ 2017-03-06 13:33 UTC (permalink / raw) To: Logan Gunthorpe, Linus Torvalds Cc: Peter Anvin, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On Mon, Mar 06, 2017 at 12:01:10AM -0700, Logan Gunthorpe wrote: > Well honestly my issue was solved by fixing my kernel config. I have no > idea why I had optimize for size in there in the first place. I still think that we should address the iomem memcpy Linus mentioned. So how about this partial revert. I've made 32-bit use the same special __memcpy() version. Hmmm? --- diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h index 7afb0e2f07f4..9e378a10796d 100644 --- a/arch/x86/include/asm/io.h +++ b/arch/x86/include/asm/io.h @@ -201,6 +201,7 @@ extern void set_iounmap_nonlazy(void); #ifdef __KERNEL__ #include <asm-generic/iomap.h> +#include <asm/string.h> /* * Convert a virtual cached pointer to an uncached pointer @@ -227,12 +228,13 @@ memset_io(volatile void __iomem *addr, unsigned char val, size_t count) * @src: The (I/O memory) source for the data * @count: The number of bytes to copy * - * Copy a block of data from I/O memory. + * Copy a block of data from I/O memory. IO memory is different from + * cached memory so we use special memcpy version. */ static inline void memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count) { - memcpy(dst, (const void __force *)src, count); + __inline_memcpy(dst, (const void __force *)src, count); } /** @@ -241,12 +243,13 @@ memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count) * @src: The (RAM) source for the data * @count: The number of bytes to copy * - * Copy a block of data to I/O memory. + * Copy a block of data to I/O memory. IO memory is different from + * cached memory so we use special memcpy version. */ static inline void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count) { - memcpy((void __force *)dst, src, count); + __inline_memcpy((void __force *)dst, src, count); } /* diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h index 3d3e8353ee5c..556fa4a975ff 100644 --- a/arch/x86/include/asm/string_32.h +++ b/arch/x86/include/asm/string_32.h @@ -29,6 +29,7 @@ extern char *strchr(const char *s, int c); #define __HAVE_ARCH_STRLEN extern size_t strlen(const char *s); +#define __inline_memcpy __memcpy static __always_inline void *__memcpy(void *to, const void *from, size_t n) { int d0, d1, d2; -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-06 13:33 ` Borislav Petkov @ 2017-03-06 13:41 ` hpa 2017-03-06 14:00 ` Borislav Petkov 0 siblings, 1 reply; 22+ messages in thread From: hpa @ 2017-03-06 13:41 UTC (permalink / raw) To: Borislav Petkov, Logan Gunthorpe, Linus Torvalds Cc: Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On March 6, 2017 5:33:28 AM PST, Borislav Petkov <bp@suse.de> wrote: >On Mon, Mar 06, 2017 at 12:01:10AM -0700, Logan Gunthorpe wrote: >> Well honestly my issue was solved by fixing my kernel config. I have >no >> idea why I had optimize for size in there in the first place. > >I still think that we should address the iomem memcpy Linus mentioned. >So how about this partial revert. I've made 32-bit use the same special >__memcpy() version. > >Hmmm? > >--- >diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h >index 7afb0e2f07f4..9e378a10796d 100644 >--- a/arch/x86/include/asm/io.h >+++ b/arch/x86/include/asm/io.h >@@ -201,6 +201,7 @@ extern void set_iounmap_nonlazy(void); > #ifdef __KERNEL__ > > #include <asm-generic/iomap.h> >+#include <asm/string.h> > > /* > * Convert a virtual cached pointer to an uncached pointer >@@ -227,12 +228,13 @@ memset_io(volatile void __iomem *addr, unsigned >char val, size_t count) > * @src: The (I/O memory) source for the data > * @count: The number of bytes to copy > * >- * Copy a block of data from I/O memory. >+ * Copy a block of data from I/O memory. IO memory is different from >+ * cached memory so we use special memcpy version. > */ > static inline void >memcpy_fromio(void *dst, const volatile void __iomem *src, size_t >count) > { >- memcpy(dst, (const void __force *)src, count); >+ __inline_memcpy(dst, (const void __force *)src, count); > } > > /** >@@ -241,12 +243,13 @@ memcpy_fromio(void *dst, const volatile void >__iomem *src, size_t count) > * @src: The (RAM) source for the data > * @count: The number of bytes to copy > * >- * Copy a block of data to I/O memory. >+ * Copy a block of data to I/O memory. IO memory is different from >+ * cached memory so we use special memcpy version. > */ > static inline void > memcpy_toio(volatile void __iomem *dst, const void *src, size_t count) > { >- memcpy((void __force *)dst, src, count); >+ __inline_memcpy((void __force *)dst, src, count); > } > > /* >diff --git a/arch/x86/include/asm/string_32.h >b/arch/x86/include/asm/string_32.h >index 3d3e8353ee5c..556fa4a975ff 100644 >--- a/arch/x86/include/asm/string_32.h >+++ b/arch/x86/include/asm/string_32.h >@@ -29,6 +29,7 @@ extern char *strchr(const char *s, int c); > #define __HAVE_ARCH_STRLEN > extern size_t strlen(const char *s); > >+#define __inline_memcpy __memcpy >static __always_inline void *__memcpy(void *to, const void *from, >size_t n) > { > int d0, d1, d2; It isn't really that straightforward IMO. For UC memory transaction size really needs to be specified explicitly at all times and should be part of the API, rather than implicit. For WC/WT/WB device memory, the ordinary memcpy is valid and preferred. -- Sent from my Android device with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-06 13:41 ` hpa @ 2017-03-06 14:00 ` Borislav Petkov 0 siblings, 0 replies; 22+ messages in thread From: Borislav Petkov @ 2017-03-06 14:00 UTC (permalink / raw) To: hpa Cc: Logan Gunthorpe, Linus Torvalds, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On Mon, Mar 06, 2017 at 05:41:22AM -0800, hpa@zytor.com wrote: > It isn't really that straightforward IMO. > > For UC memory transaction size really needs to be specified explicitly > at all times and should be part of the API, rather than implicit. > > For WC/WT/WB device memory, the ordinary memcpy is valid and > preferred. I'm practically partially reverting 6175ddf06b61 ("x86: Clean up mem*io functions.") Are you saying, this was wrong before too? Maybe it was wrong, strictly speaking, but maybe that was good enough for our purposes... -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-05 0:33 ` Borislav Petkov 2017-03-05 0:56 ` hpa @ 2017-03-05 4:58 ` Logan Gunthorpe 2017-03-05 9:54 ` Borislav Petkov 1 sibling, 1 reply; 22+ messages in thread From: Logan Gunthorpe @ 2017-03-05 4:58 UTC (permalink / raw) To: Borislav Petkov, hpa Cc: Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List Hey, On 04/03/17 05:33 PM, Borislav Petkov wrote: > On Sat, Mar 04, 2017 at 04:23:17PM -0800, hpa@zytor.com wrote: >> What are the compilation flags? It may be that gcc still does TRT >> depending on this call site. I'd check what gcc6 or 7 generates, >> though. > Hmm, I wish we were able to say, "let gcc decide for small sizes and let > us do the patching for larger ones." So, I've found that my kernel config had the OPTIMIZE_FOR_SIZE selected instead of OPTIMIZE_FOR_PERFORMANCE. I'm not sure why that is but switching to the latter option fixes my problem. A memcpy call is used instead of the poor inline solution. (I'm not really sure how the inline solution even makes any sense as it almost certainly makes things larger in the grand scheme of things.) It still might make sense to apply your patch asking gcc to never use the broken memcpy but I'll leave that in your capable hands to decide. Anyway, thanks for the help with this. Logan ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question Regarding ERMS memcpy 2017-03-05 4:58 ` Logan Gunthorpe @ 2017-03-05 9:54 ` Borislav Petkov 0 siblings, 0 replies; 22+ messages in thread From: Borislav Petkov @ 2017-03-05 9:54 UTC (permalink / raw) To: Logan Gunthorpe Cc: hpa, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List On Sat, Mar 04, 2017 at 09:58:14PM -0700, Logan Gunthorpe wrote: > So, I've found that my kernel config had the OPTIMIZE_FOR_SIZE selected > instead of OPTIMIZE_FOR_PERFORMANCE. I'm not sure why that is but > switching to the latter option fixes my problem. A memcpy call is used > instead of the poor inline solution. (I'm not really sure how the inline > solution even makes any sense as it almost certainly makes things larger > in the grand scheme of things.) Probably some gcc heuristics don't work as expected... In any case, I have # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set here and it still generates REP; MOVSL in dmi_scan_machine(). -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2017-03-06 19:00 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-04 20:08 Question Regarding ERMS memcpy Logan Gunthorpe 2017-03-04 22:43 ` Borislav Petkov [not found] ` <c675a6c8-e87a-ece2-2350-cdd079b1b610@deltatee.com> 2017-03-04 23:55 ` hpa 2017-03-05 0:14 ` Borislav Petkov 2017-03-05 0:23 ` hpa 2017-03-05 0:33 ` Borislav Petkov 2017-03-05 0:56 ` hpa 2017-03-05 9:50 ` Borislav Petkov 2017-03-05 11:18 ` Borislav Petkov 2017-03-05 13:05 ` Borislav Petkov 2017-03-05 19:19 ` Linus Torvalds 2017-03-05 19:54 ` Borislav Petkov 2017-03-05 20:19 ` Linus Torvalds 2017-03-06 7:01 ` Logan Gunthorpe 2017-03-06 7:28 ` H. Peter Anvin 2017-03-06 17:12 ` Logan Gunthorpe 2017-03-06 18:53 ` hpa 2017-03-06 13:33 ` Borislav Petkov 2017-03-06 13:41 ` hpa 2017-03-06 14:00 ` Borislav Petkov 2017-03-05 4:58 ` Logan Gunthorpe 2017-03-05 9:54 ` Borislav Petkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).