linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Question Regarding ERMS memcpy
@ 2017-03-04 20:08 Logan Gunthorpe
  2017-03-04 22:43 ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Logan Gunthorpe @ 2017-03-04 20:08 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tony Luck,
	Borislav Petkov, Al Viro
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1298 bytes --]

Hi,

I'm trying to chase down a performance issue with a driver I'm working
on that does a repeated memcpy_fromio of about 1KB from a PCI device. I
made a small change from a fixed size copy to a variable size only to be
surprised with a performance decrease of about 1/3.

I've looked through the code and discovered this is due to switching out
__builtin_memcpy with __memcpy when the length is not constant. This
makes sense and I've now been looking into the inner workings of
memcpy_64.S.

The CPU I'm testing on is a Sandy Bridge and according to /proc/cpuinfo,
it does _not_ have the erms bit and does have the rep_good bit. So I
expect to be using the "rep movsq" version of memcpy. However, I've done
some testing with my driver by hacking in specific memcpy
implementations and I've found the following results:

__builtin_memcpy w/const length: 85KB/s
memcpy_fromio: 26kB/s
__builtin_memcpy: 26kB/s
memcpy_movsq: 126kB/s
memcpy_erms: 26kB/s

Thus, based on these performance numbers, it almost seems like my
platform is using the erms version when it probably shouldn't be.

So my question is: how do I find out what version of memcpy my actual
machine is using and fix it if it is wrong?

I'm running with a 4.10.0 kernel and I've attached my cpuinfo.

Thanks for any insights,

Logan


[-- Attachment #2: cpuinfo.txt --]
[-- Type: text/plain, Size: 7488 bytes --]

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
stepping	: 7
microcode	: 0x710
cpu MHz		: 1200.000
cache size	: 10240 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts
bugs		:
bogomips	: 4799.90
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
stepping	: 7
microcode	: 0x710
cpu MHz		: 1200.000
cache size	: 10240 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts
bugs		:
bogomips	: 4801.06
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
stepping	: 7
microcode	: 0x710
cpu MHz		: 1200.000
cache size	: 10240 KB
physical id	: 0
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts
bugs		:
bogomips	: 4801.12
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
stepping	: 7
microcode	: 0x710
cpu MHz		: 1200.000
cache size	: 10240 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts
bugs		:
bogomips	: 4801.10
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
stepping	: 7
microcode	: 0x710
cpu MHz		: 1200.000
cache size	: 10240 KB
physical id	: 1
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 32
initial apicid	: 32
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts
bugs		:
bogomips	: 4801.43
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
stepping	: 7
microcode	: 0x710
cpu MHz		: 1200.000
cache size	: 10240 KB
physical id	: 1
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 34
initial apicid	: 34
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts
bugs		:
bogomips	: 4801.74
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
stepping	: 7
microcode	: 0x710
cpu MHz		: 1200.000
cache size	: 10240 KB
physical id	: 1
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 36
initial apicid	: 36
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts
bugs		:
bogomips	: 4801.76
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
stepping	: 7
microcode	: 0x710
cpu MHz		: 1200.000
cache size	: 10240 KB
physical id	: 1
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 38
initial apicid	: 38
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm arat pln pts
bugs		:
bogomips	: 4801.78
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-04 20:08 Question Regarding ERMS memcpy Logan Gunthorpe
@ 2017-03-04 22:43 ` Borislav Petkov
       [not found]   ` <c675a6c8-e87a-ece2-2350-cdd079b1b610@deltatee.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2017-03-04 22:43 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Tony Luck, Al Viro,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Sat, Mar 04, 2017 at 01:08:15PM -0700, Logan Gunthorpe wrote:
> So my question is: how do I find out what version of memcpy my actual
> machine is using and fix it if it is wrong?

You can boot with "debug-alternative" and look for those strings where
it says "feat:"

[    0.261386] apply_alternatives: feat: 3*32+18, old: (ffffffff8101e3c4, len: 3), repl: (ffffffff81de4e7a, len: 3), pad: 3
					 ^^^^^^^

[    0.264003] ffffffff8101e3c4: old_insn: 0f 1f 00
[    0.265235] ffffffff81de4e7a: rpl_insn: 0f ae e8
[    0.266469] ffffffff8101e3c4: final_insn: 0f ae e8

if it says 9*32+9 there, then it really does patch in the ERMS variant.

If it says 3*32+16 and it patches a NOP, i.e., something like this:

[    2.022665] apply_alternatives: feat: 3*32+16, old: (ffffffff812dbfe0, len: 5), repl: (ffffffff81de5336, len: 0), pad: 3
[    2.024003] ffffffff812dbfe0: old_insn: eb 0e 90 90 90
[    2.025332] ffffffff812dbfe0: final_insn: 0f 1f 44 00 00
					     ^^^^ NOP

then, you're using the rep; movsq variant.

Also, do

make <path-to-file>.s

of the file where it does memcpy_fromio() and attach it here. I'd like to see
what the compiler generates.

HTH.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
       [not found]   ` <c675a6c8-e87a-ece2-2350-cdd079b1b610@deltatee.com>
@ 2017-03-04 23:55     ` hpa
  2017-03-05  0:14       ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: hpa @ 2017-03-04 23:55 UTC (permalink / raw)
  To: Logan Gunthorpe, Borislav Petkov
  Cc: Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro,
	the arch/x86 maintainers, Linux Kernel Mailing List

On March 4, 2017 3:46:44 PM PST, Logan Gunthorpe <logang@deltatee.com> wrote:
>Hi Borislav,
>
>Thanks for the help.
>
>On 04/03/17 03:43 PM, Borislav Petkov wrote:
>> You can boot with "debug-alternative" and look for those strings
>where
>
>Here's the symbols for memcpy and the corresponding apply_alternatives
>lines:
>
>ffffffff8122df90 T __memcpy
>ffffffff8122df90 W memcpy
>ffffffff8122dfb0 T memcpy_erms
>ffffffff8122dfc0 T memcpy_orig
>
>[    0.076018] apply_alternatives: feat: 3*32+16, old:
>(ffffffff8122df90, len: 5), repl: (ffffffff819d7ac5, len: 0), pad: 0
>[    0.076019] ffffffff8122df90: old_insn: e9 2b 00 00 00
>[    0.076021] ffffffff8122df90: final_insn: 66 66 90 66 90
>
>So it looks like it's patching in a NOP as it is supposed to.
>
>
>> Also, do
>> 
>> make <path-to-file>.s
>> 
>> of the file where it does memcpy_fromio() and attach it here. I'd
>like to see
>> what the compiler generates.
>
>I've attached the whole file but this is the code in question:
>
>	.loc 5 219 0
>	movq	936(%rbp), %rdx	# stdev_3(D)->mmio_mrpc, tmp127
>	leaq	48(%rbx), %rax	#, tmp113
>	movq	40(%rbx), %rcx	# MEM[(struct switchtec_user *)__mptr_7 +
>-56B].read_len, MEM[(struct switchtec_user *)__mptr_7 + -56B].read_len
>	movq	%rax, %rdi	# tmp113, tmp118
>	leaq	1024(%rdx), %rsi	#, tmp114
>	rep movsb
>
>Strangely, it decided to inline a memcpy with rep movsb. Any idea why
>the compiler would do that? Is this just gcc being stupid? My gcc
>version is below.
>
>Thanks,
>
>Logan
>
>
>Using built-in specs.
>COLLECT_GCC=gcc
>COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.9/lto-wrapper
>Target: x86_64-linux-gnu
>Configured with: ../src/configure -v --with-pkgversion='Debian
>4.9.2-10'
>--with-bugurl=file:///usr/share/doc/gcc-4.9/README.Bugs
>--enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr
>--program-suffix=-4.9 --enable-shared --enable-linker-build-id
>--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
>--with-gxx-include-dir=/usr/include/c++/4.9 --libdir=/usr/lib
>--enable-nls --with-sysroot=/ --enable-clocale=gnu
>--enable-libstdcxx-debug --enable-libstdcxx-time=yes
>--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
>--with-system-zlib --disable-browser-plugin --enable-java-awt=gtk
>--enable-gtk-cairo
>--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64/jre
>--enable-java-home
>--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64
>--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.9-amd64
>--with-arch-directory=amd64
>--with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc
>--enable-multiarch --with-arch-32=i586 --with-abi=m64
>--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
>--enable-checking=release --build=x86_64-linux-gnu
>--host=x86_64-linux-gnu --target=x86_64-linux-gnu
>Thread model: posix
>gcc version 4.9.2 (Debian 4.9.2-10)

For newer processors, as determined by -mtune=, it is actually the best option for an arbitrary copy.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-04 23:55     ` hpa
@ 2017-03-05  0:14       ` Borislav Petkov
  2017-03-05  0:23         ` hpa
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2017-03-05  0:14 UTC (permalink / raw)
  To: hpa
  Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck,
	Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List

On Sat, Mar 04, 2017 at 03:55:27PM -0800, hpa@zytor.com wrote:
> For newer processors, as determined by -mtune=, it is actually the
> best option for an arbitrary copy.

So his doesn't have ERMS - it is a SNB - so if for SNB REP_GOOD is
the best option for memcpy, then we should probably build with
-fno-builtin-memcpy unconditionally. Otherwise gcc apparently inserts
its own memcpy variant.

And this is probably wrong because we do the detection at boot time and
not at build time.

For example here it generates REP; MOVSL for the call in
drivers/firmware/dmi_scan.c::dmi_scan_machine() which looks wrong to me.
Length is 32 so it could just as well do REP; MOVSQ.

IOW, we could do something like this:

---
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 2d449337a360..c1b68d147b8d 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -142,10 +142,7 @@ ifdef CONFIG_X86_X32
 endif
 export CONFIG_X86_X32_ABI
 
-# Don't unroll struct assignments with kmemcheck enabled
-ifeq ($(CONFIG_KMEMCHECK),y)
-	KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy)
-endif
+KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy)
 
 # Stackpointer is addressed different for 32 bit and 64 bit x86
 sp-$(CONFIG_X86_32) := esp

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-05  0:14       ` Borislav Petkov
@ 2017-03-05  0:23         ` hpa
  2017-03-05  0:33           ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: hpa @ 2017-03-05  0:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck,
	Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List

On March 4, 2017 4:14:47 PM PST, Borislav Petkov <bp@suse.de> wrote:
>On Sat, Mar 04, 2017 at 03:55:27PM -0800, hpa@zytor.com wrote:
>> For newer processors, as determined by -mtune=, it is actually the
>> best option for an arbitrary copy.
>
>So his doesn't have ERMS - it is a SNB - so if for SNB REP_GOOD is
>the best option for memcpy, then we should probably build with
>-fno-builtin-memcpy unconditionally. Otherwise gcc apparently inserts
>its own memcpy variant.
>
>And this is probably wrong because we do the detection at boot time and
>not at build time.
>
>For example here it generates REP; MOVSL for the call in
>drivers/firmware/dmi_scan.c::dmi_scan_machine() which looks wrong to
>me.
>Length is 32 so it could just as well do REP; MOVSQ.
>
>IOW, we could do something like this:
>
>---
>diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>index 2d449337a360..c1b68d147b8d 100644
>--- a/arch/x86/Makefile
>+++ b/arch/x86/Makefile
>@@ -142,10 +142,7 @@ ifdef CONFIG_X86_X32
> endif
> export CONFIG_X86_X32_ABI
> 
>-# Don't unroll struct assignments with kmemcheck enabled
>-ifeq ($(CONFIG_KMEMCHECK),y)
>-	KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy)
>-endif
>+KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy)
> 
> # Stackpointer is addressed different for 32 bit and 64 bit x86
> sp-$(CONFIG_X86_32) := esp

What are the compilation flags?  It may be that gcc still does TRT depending on this call site.  I'd check what gcc6 or 7 generates, though.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-05  0:23         ` hpa
@ 2017-03-05  0:33           ` Borislav Petkov
  2017-03-05  0:56             ` hpa
  2017-03-05  4:58             ` Logan Gunthorpe
  0 siblings, 2 replies; 22+ messages in thread
From: Borislav Petkov @ 2017-03-05  0:33 UTC (permalink / raw)
  To: hpa
  Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck,
	Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List

On Sat, Mar 04, 2017 at 04:23:17PM -0800, hpa@zytor.com wrote:
> What are the compilation flags? It may be that gcc still does TRT
> depending on this call site. I'd check what gcc6 or 7 generates,
> though.

Well, I don't think that matters: if you're building a kernel on one
machine to boot on another machine, the compiler can't know at build
time what the best MOVS* variant would be for the target machine. That's
why we're doing the alternatives patching at *boot* time.

However, if the size is small enough, the CALL/patch overhead would be
definitely too much.

Hmm, I wish we were able to say, "let gcc decide for small sizes and let
us do the patching for larger ones."

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-05  0:33           ` Borislav Petkov
@ 2017-03-05  0:56             ` hpa
  2017-03-05  9:50               ` Borislav Petkov
  2017-03-05  4:58             ` Logan Gunthorpe
  1 sibling, 1 reply; 22+ messages in thread
From: hpa @ 2017-03-05  0:56 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck,
	Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List

On March 4, 2017 4:33:49 PM PST, Borislav Petkov <bp@suse.de> wrote:
>On Sat, Mar 04, 2017 at 04:23:17PM -0800, hpa@zytor.com wrote:
>> What are the compilation flags? It may be that gcc still does TRT
>> depending on this call site. I'd check what gcc6 or 7 generates,
>> though.
>
>Well, I don't think that matters: if you're building a kernel on one
>machine to boot on another machine, the compiler can't know at build
>time what the best MOVS* variant would be for the target machine.
>That's
>why we're doing the alternatives patching at *boot* time.
>
>However, if the size is small enough, the CALL/patch overhead would be
>definitely too much.
>
>Hmm, I wish we were able to say, "let gcc decide for small sizes and
>let
>us do the patching for larger ones."

That's what the -march= and -mtune= option do!
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-05  0:33           ` Borislav Petkov
  2017-03-05  0:56             ` hpa
@ 2017-03-05  4:58             ` Logan Gunthorpe
  2017-03-05  9:54               ` Borislav Petkov
  1 sibling, 1 reply; 22+ messages in thread
From: Logan Gunthorpe @ 2017-03-05  4:58 UTC (permalink / raw)
  To: Borislav Petkov, hpa
  Cc: Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro,
	the arch/x86 maintainers, Linux Kernel Mailing List

Hey,

On 04/03/17 05:33 PM, Borislav Petkov wrote:
> On Sat, Mar 04, 2017 at 04:23:17PM -0800, hpa@zytor.com wrote:
>> What are the compilation flags? It may be that gcc still does TRT
>> depending on this call site. I'd check what gcc6 or 7 generates,
>> though.
> Hmm, I wish we were able to say, "let gcc decide for small sizes and let
> us do the patching for larger ones."

So, I've found that my kernel config had the OPTIMIZE_FOR_SIZE selected
instead of OPTIMIZE_FOR_PERFORMANCE. I'm not sure why that is but
switching to the latter option fixes my problem. A memcpy call is used
instead of the poor inline solution. (I'm not really sure how the inline
solution even makes any sense as it almost certainly makes things larger
in the grand scheme of things.)

It still might make sense to apply your patch asking gcc to never use
the broken memcpy but I'll leave that in your capable hands to decide.

Anyway, thanks for the help with this.

Logan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-05  0:56             ` hpa
@ 2017-03-05  9:50               ` Borislav Petkov
  2017-03-05 11:18                 ` Borislav Petkov
  2017-03-05 19:19                 ` Linus Torvalds
  0 siblings, 2 replies; 22+ messages in thread
From: Borislav Petkov @ 2017-03-05  9:50 UTC (permalink / raw)
  To: hpa
  Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck,
	Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List

On Sat, Mar 04, 2017 at 04:56:38PM -0800, hpa@zytor.com wrote:
> That's what the -march= and -mtune= option do!

How does that even help with a distro kernel built with -mtune=generic ?

gcc can't possibly know on what targets is that kernel going to be
booted on. So it probably does some universally optimal things, like in
the dmi_scan_machine() case:

	memcpy_fromio(buf, p, 32);

turns into:

        .loc 3 219 0
        movl    $8, %ecx        #, tmp79
        movq    %rax, %rsi      # p, p
        movq    %rsp, %rdi      #, tmp77
        rep movsl

Apparently it thinks it is fine to do 8*4-byte MOVS. But why not
4*8-byte MOVS?

That's half the loops.

 [ It is a whole different story what the machine actually does underneath. It
    being a half cacheline probably doesn't help and it really does the separate
    MOVs but then it would be cheaper if it did 4 8-byte ones. ]

One thing's for sure - both variants are certainly cheaper than to CALL
a memcpy variant.

What we probably should try to do, though, is simply patch in the body
of REP; MOVSQ or REP; MOVSB into the call sites and only have a call to
memcpy_orig() because that last one if fat.

I remember we did talk about it at some point but don't remember why we
didn't do it.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-05  4:58             ` Logan Gunthorpe
@ 2017-03-05  9:54               ` Borislav Petkov
  0 siblings, 0 replies; 22+ messages in thread
From: Borislav Petkov @ 2017-03-05  9:54 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: hpa, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Sat, Mar 04, 2017 at 09:58:14PM -0700, Logan Gunthorpe wrote:
> So, I've found that my kernel config had the OPTIMIZE_FOR_SIZE selected
> instead of OPTIMIZE_FOR_PERFORMANCE. I'm not sure why that is but
> switching to the latter option fixes my problem. A memcpy call is used
> instead of the poor inline solution. (I'm not really sure how the inline
> solution even makes any sense as it almost certainly makes things larger
> in the grand scheme of things.)

Probably some gcc heuristics don't work as expected...

In any case, I have

# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set

here and it still generates REP; MOVSL in dmi_scan_machine().

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-05  9:50               ` Borislav Petkov
@ 2017-03-05 11:18                 ` Borislav Petkov
  2017-03-05 13:05                   ` Borislav Petkov
  2017-03-05 19:19                 ` Linus Torvalds
  1 sibling, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2017-03-05 11:18 UTC (permalink / raw)
  To: hpa
  Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck,
	Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List,
	Linus Torvalds

On Sun, Mar 05, 2017 at 10:50:59AM +0100, Borislav Petkov wrote:
> On Sat, Mar 04, 2017 at 04:56:38PM -0800, hpa@zytor.com wrote:
> > That's what the -march= and -mtune= option do!
> 
> How does that even help with a distro kernel built with -mtune=generic ?
> 
> gcc can't possibly know on what targets is that kernel going to be
> booted on. So it probably does some universally optimal things, like in
> the dmi_scan_machine() case:
> 
> 	memcpy_fromio(buf, p, 32);
> 
> turns into:
> 
>         .loc 3 219 0
>         movl    $8, %ecx        #, tmp79
>         movq    %rax, %rsi      # p, p
>         movq    %rsp, %rdi      #, tmp77
>         rep movsl
> 
> Apparently it thinks it is fine to do 8*4-byte MOVS. But why not
> 4*8-byte MOVS?
> 
> That's half the loops.
> 
>  [ It is a whole different story what the machine actually does underneath. It
>     being a half cacheline probably doesn't help and it really does the separate
>     MOVs but then it would be cheaper if it did 4 8-byte ones. ]
> 
> One thing's for sure - both variants are certainly cheaper than to CALL
> a memcpy variant.
> 
> What we probably should try to do, though, is simply patch in the body
> of REP; MOVSQ or REP; MOVSB into the call sites and only have a call to
> memcpy_orig() because that last one if fat.
> 
> I remember we did talk about it at some point but don't remember why we
> didn't do it.

Right, and I believe it was Linus who mentioned we should simply patch
in the small REP_GOOD and ERMS memcpy body into the call sites. CCed.

Anyway, something like below.

It almost builds here - I still need to figure out that
arch/x86/boot/compressed/misc.c including decompression logic which does
memcpy and it turns into the alternative version which we don't want in
arch/x86/boot/compressed/.

Also, I need to check what vmlinuz size bloat we're talking: with the
diff below, we do add padding which looks like this:

      3d:       90                      nop
      3e:       90                      nop
      3f:       90                      nop
      40:       90                      nop
      41:       90                      nop
      42:       90                      nop
      43:       90                      nop
      44:       90                      nop
      45:       90                      nop
      46:       90                      nop
      47:       90                      nop
      48:       90                      nop
      49:       90                      nop
      4a:       90                      nop
      4b:       90                      nop
      4c:       90                      nop
      4d:       90                      nop
      4e:       90                      nop
      4f:       90                      nop

and that at every call site.

But at least we save ourselves the call overhead and use the optimal
memcpy variant on each machine.

---
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 2d449337a360..24711ca4033b 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -143,9 +143,7 @@ endif
 export CONFIG_X86_X32_ABI
 
 # Don't unroll struct assignments with kmemcheck enabled
-ifeq ($(CONFIG_KMEMCHECK),y)
-	KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy)
-endif
+KBUILD_CFLAGS += $(call cc-option,-fno-builtin)
 
 # Stackpointer is addressed different for 32 bit and 64 bit x86
 sp-$(CONFIG_X86_32) := esp
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 1b020381ab38..8216c2e610ae 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -204,6 +204,12 @@ static inline int alternatives_text_reserved(void *start, void *end)
 	asm volatile (ALTERNATIVE(oldinstr, newinstr, feature)		\
 		: output : "i" (0), ## input)
 
+#define alternative_io_2(oldinstr, newinstr1, feature1, newinstr2, feature2,	\
+			 output, input...) \
+	asm volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2)	\
+			: output						\
+			: "i" (0), ## input)
+
 /* Like alternative_io, but for replacing a direct call with another one. */
 #define alternative_call(oldfunc, newfunc, feature, output, input...)	\
 	asm volatile (ALTERNATIVE("call %P[old]", "call %P[new]", feature) \
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index a164862d77e3..a529a12a35ed 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -3,6 +3,8 @@
 
 #ifdef __KERNEL__
 #include <linux/jump_label.h>
+#include <asm/cpufeatures.h>
+#include <asm/alternative.h>
 
 /* Written 2002 by Andi Kleen */
 
@@ -28,8 +30,37 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
    function. */
 
 #define __HAVE_ARCH_MEMCPY 1
-extern void *memcpy(void *to, const void *from, size_t len);
-extern void *__memcpy(void *to, const void *from, size_t len);
+void *memcpy_orig(void *to, const void *from, size_t len);
+
+/*
+ * We build a call to memcpy_orig by default which replaced on
+ * the majority of x86 CPUs which set REP_GOOD. In addition, CPUs which
+ * have the enhanced REP MOVSB/STOSB feature (ERMS), change the REP_GOOD
+ * routine to a REP; MOVSB mem copy.
+ */
+static inline void *memcpy(void *to, const void *from, size_t len)
+{
+	void *ret;
+
+	alternative_io_2("call memcpy_orig\n\t",
+			 "movq %%rdi, %%rax\n\t"
+			 "movq %%rdx, %%rcx\n\t"
+			 "shrq $3, %%rcx\n\t"
+			 "andl $7, %%edx\n\t"
+			 "rep movsq\n\t"
+			 "movl %%edx, %%ecx\n\t"
+			 "rep movsb\n\t",
+			 X86_FEATURE_REP_GOOD,
+			 "movq %%rdi, %%rax\n\t"
+			 "movq %%rdx, %%rcx\n\t"
+			 "rep movsb\n\t",
+			 X86_FEATURE_ERMS,
+			 ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), "=d" (len)),
+			 "1" (to), "2" (from), "3" (len)
+			 : "memory", "rcx");
+
+	return ret;
+}
 
 #ifndef CONFIG_KMEMCHECK
 #if (__GNUC__ == 4 && __GNUC_MINOR__ < 3) || __GNUC__ < 4
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 779782f58324..c3c8890b155c 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -6,13 +6,6 @@
 #include <asm/alternative-asm.h>
 #include <asm/export.h>
 
-/*
- * We build a jump to memcpy_orig by default which gets NOPped out on
- * the majority of x86 CPUs which set REP_GOOD. In addition, CPUs which
- * have the enhanced REP MOVSB/STOSB feature (ERMS), change those NOPs
- * to a jmp to memcpy_erms which does the REP; MOVSB mem copy.
- */
-
 .weak memcpy
 
 /*
@@ -26,36 +19,12 @@
  * Output:
  * rax original destination
  */
-ENTRY(__memcpy)
-ENTRY(memcpy)
-	ALTERNATIVE_2 "jmp memcpy_orig", "", X86_FEATURE_REP_GOOD, \
-		      "jmp memcpy_erms", X86_FEATURE_ERMS
-
-	movq %rdi, %rax
-	movq %rdx, %rcx
-	shrq $3, %rcx
-	andl $7, %edx
-	rep movsq
-	movl %edx, %ecx
-	rep movsb
-	ret
-ENDPROC(memcpy)
-ENDPROC(__memcpy)
-EXPORT_SYMBOL(memcpy)
-EXPORT_SYMBOL(__memcpy)
-
-/*
- * memcpy_erms() - enhanced fast string memcpy. This is faster and
- * simpler than memcpy. Use memcpy_erms when possible.
- */
-ENTRY(memcpy_erms)
-	movq %rdi, %rax
-	movq %rdx, %rcx
-	rep movsb
-	ret
-ENDPROC(memcpy_erms)
-
 ENTRY(memcpy_orig)
+	pushq %r8
+	pushq %r9
+	pushq %r10
+	pushq %r11
+
 	movq %rdi, %rax
 
 	cmpq $0x20, %rdx
@@ -179,8 +148,14 @@ ENTRY(memcpy_orig)
 	movb %cl, (%rdi)
 
 .Lend:
+	popq %r11
+	popq %r10
+	popq %r9
+	popq %r8
+
 	retq
 ENDPROC(memcpy_orig)
+EXPORT_SYMBOL_GPL(memcpy_orig)
 
 #ifndef CONFIG_UML
 /*

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-05 11:18                 ` Borislav Petkov
@ 2017-03-05 13:05                   ` Borislav Petkov
  0 siblings, 0 replies; 22+ messages in thread
From: Borislav Petkov @ 2017-03-05 13:05 UTC (permalink / raw)
  To: hpa
  Cc: Logan Gunthorpe, Thomas Gleixner, Ingo Molnar, Tony Luck,
	Al Viro, the arch/x86 maintainers, Linux Kernel Mailing List,
	Linus Torvalds

On Sun, Mar 05, 2017 at 12:18:23PM +0100, Borislav Petkov wrote:
> Also, I need to check what vmlinuz size bloat we're talking: with the
> diff below, we do add padding which looks like this:

Yeah, even a tailored config adds ~67K:

   text    data     bss     dec     hex filename
7567290 4040894 2560000 14168184         d83078 vmlinux
7635332 4041694 2560000 14237026         d93d62 vmlinux.memcpy

I'm assuming allmodconfig would become even bigger.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-05  9:50               ` Borislav Petkov
  2017-03-05 11:18                 ` Borislav Petkov
@ 2017-03-05 19:19                 ` Linus Torvalds
  2017-03-05 19:54                   ` Borislav Petkov
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2017-03-05 19:19 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Anvin, Logan Gunthorpe, Thomas Gleixner, Ingo Molnar,
	Tony Luck, Al Viro, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Sun, Mar 5, 2017 at 1:50 AM, Borislav Petkov <bp@suse.de> wrote:
>
> gcc can't possibly know on what targets is that kernel going to be
> booted on. So it probably does some universally optimal things, like in
> the dmi_scan_machine() case:
>
>         memcpy_fromio(buf, p, 32);
>
> turns into:
>
>         .loc 3 219 0
>         movl    $8, %ecx        #, tmp79
>         movq    %rax, %rsi      # p, p
>         movq    %rsp, %rdi      #, tmp77
>         rep movsl
>
> Apparently it thinks it is fine to do 8*4-byte MOVS. But why not
> 4*8-byte MOVS?

Actually, the "fromio/toio" code should never use regular memcpy().
There used to be devices that literally broke on 64-bit accesses due
to broken PCI crud.

We seem to have broken this *really* long ago, though. On x86-64 we
used to have a special __inline_memcpy() that copies our historical
32-bit thing, and was used for memcpy_fromio() and memcpy_toio(). That
was then undone by commit 6175ddf06b61 ("x86: Clean up mem*io
functions")

That commit says

   "Iomem has no special significance on x86"

but that's not strictly true. iomem is in the same address space and
uses the same access instructions as regular memory, but iomem _is_
special.

And I think it's a bug that we use "memcpy()" on it. Not because of
any gcc issues, but simply because our own memcpy() optimizations are
not appropriate for iomem.

For example, "rep movsb" really is the right thing to use on normal
memory on modern CPU's.

But it is *not* the right thing to use on IO memory, because the CPU
only does the magic cacheline access optimizations on cacheable
memory!

So I think we should re-introduce that old "__inline_memcpy()" as that
special "safe memcpy" thing. Not just for KMEMCHECK, and not just for
64-bit.

Hmm?

                   Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-05 19:19                 ` Linus Torvalds
@ 2017-03-05 19:54                   ` Borislav Petkov
  2017-03-05 20:19                     ` Linus Torvalds
  2017-03-06  7:01                     ` Logan Gunthorpe
  0 siblings, 2 replies; 22+ messages in thread
From: Borislav Petkov @ 2017-03-05 19:54 UTC (permalink / raw)
  To: Linus Torvalds, Logan Gunthorpe
  Cc: Peter Anvin, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Sun, Mar 05, 2017 at 11:19:42AM -0800, Linus Torvalds wrote:
> Actually, the "fromio/toio" code should never use regular memcpy().
> There used to be devices that literally broke on 64-bit accesses due
> to broken PCI crud.
> 
> We seem to have broken this *really* long ago, though.

I wonder why nothing blew up or failed strangely by now...

> On x86-64 we used to have a special __inline_memcpy() that copies our
> historical

It is still there in arch/x86/include/asm/string_64.h.

The comment "Only used for special circumstances." over it is grossly
understating it.

> 32-bit thing, and was used for memcpy_fromio() and memcpy_toio(). That
> was then undone by commit 6175ddf06b61 ("x86: Clean up mem*io
> functions")
> 
> That commit says
> 
>    "Iomem has no special significance on x86"
> 
> but that's not strictly true. iomem is in the same address space and
> uses the same access instructions as regular memory, but iomem _is_
> special.
> 
> And I think it's a bug that we use "memcpy()" on it. Not because of
> any gcc issues, but simply because our own memcpy() optimizations are
> not appropriate for iomem.
> 
> For example, "rep movsb" really is the right thing to use on normal
> memory on modern CPU's.

So Logan's box is a SNB and it doesn't have the ERMS optimizations. Are
you saying, regardless, we should let gcc put REP; MOVSB for smaller
sizes?

Because gcc does generate a REP; MOVSB there when it puts its own
memcpy, see mail upthread. (Even though that is wrong to do on iomem.)

> But it is *not* the right thing to use on IO memory, because the CPU
> only does the magic cacheline access optimizations on cacheable
> memory!

Ah, yes.

> So I think we should re-introduce that old "__inline_memcpy()" as that
> special "safe memcpy" thing. Not just for KMEMCHECK, and not just for
> 64-bit.

Logan, wanna give that a try, see if it takes care of your issue?

Oh, and along with the revert we would need a big fat warning explaining
why we need that special memcpy for IO memory.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-05 19:54                   ` Borislav Petkov
@ 2017-03-05 20:19                     ` Linus Torvalds
  2017-03-06  7:01                     ` Logan Gunthorpe
  1 sibling, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2017-03-05 20:19 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Logan Gunthorpe, Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Tony Luck, Al Viro, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Sun, Mar 5, 2017 at 11:54 AM, Borislav Petkov <bp@suse.de> wrote:
>>
>> We seem to have broken this *really* long ago, though.
>
> I wonder why nothing blew up or failed strangely by now...

The hardware that cared was pretty broken to begin with, and I think
it was mainly some really odd graphics cards.

And from memory, they had issues with 64-bit writes. We actually have
a slow 16-bit word at a time copy for exactly these kinds of issues:
scr_memcpyw() and friends.

I'd like to say that it was one of those shit server-only cards that
nobody sane would ever use (but "server hardware is validated and
better quality!"), but that might have been another issue.

>> For example, "rep movsb" really is the right thing to use on normal
>> memory on modern CPU's.
>
> So Logan's box is a SNB and it doesn't have the ERMS optimizations. Are
> you saying, regardless, we should let gcc put REP; MOVSB for smaller
> sizes?

I think gcc makes bad choices, but they are gcc';s choices to make.

I have up on gcc's "-Os" because the choices were so bad and not getting fixed.

.. and none of that has _anything_ to do with accesses to IO memory,
which is fundamentally different.


> Because gcc does generate a REP; MOVSB there when it puts its own
> memcpy, see mail upthread. (Even though that is wrong to do on iomem.)

No, *THAT* is not wrong to do on iomem. If we tell gcc that "memcpy()"
works on iomem, then gcc can damn well do whatever it wants.

"rep stosb" isn't wrong for memcpy(). Gcc may do stupid things with
it, but that's completely immaterial.

> Oh, and along with the revert we would need a big fat warning explaining
> why we need that special memcpy for IO memory.

Well, quite frankly, just a simple "IO memory is different from cached
memory" should be sufficient.

               Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-05 19:54                   ` Borislav Petkov
  2017-03-05 20:19                     ` Linus Torvalds
@ 2017-03-06  7:01                     ` Logan Gunthorpe
  2017-03-06  7:28                       ` H. Peter Anvin
  2017-03-06 13:33                       ` Borislav Petkov
  1 sibling, 2 replies; 22+ messages in thread
From: Logan Gunthorpe @ 2017-03-06  7:01 UTC (permalink / raw)
  To: Borislav Petkov, Linus Torvalds
  Cc: Peter Anvin, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro,
	the arch/x86 maintainers, Linux Kernel Mailing List


On Sun, Mar 05, 2017 at 11:19:42AM -0800, Linus Torvalds wrote:
>> But it is *not* the right thing to use on IO memory, because the CPU
>> only does the magic cacheline access optimizations on cacheable
>> memory!

Yes, and actually this is where I started. I thought my memcpy was using
byte accesses on purpose and I needed to create a patch for a different
IO memcpy because obviously byte accesses over the PCI bus would be very
un-ideal. However, when I found my system wasn't intentionally using
that implementation that was no longer my focus.

So, I have no way to test this, but it sounds like any Ivy bridge system
using the ERMS version of memcpy could have the same slow PCI memcpy
performance I've been seeing (unless the microcode fixes it up?). So it
sounds like it would be a good idea to revert the change Linus is
talking about.

>> So I think we should re-introduce that old "__inline_memcpy()" as that
>> special "safe memcpy" thing. Not just for KMEMCHECK, and not just for
>> 64-bit.

On 05/03/17 12:54 PM, Borislav Petkov wrote:
> Logan, wanna give that a try, see if it takes care of your issue?

Well honestly my issue was solved by fixing my kernel config. I have no
idea why I had optimize for size in there in the first place.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-06  7:01                     ` Logan Gunthorpe
@ 2017-03-06  7:28                       ` H. Peter Anvin
  2017-03-06 17:12                         ` Logan Gunthorpe
  2017-03-06 13:33                       ` Borislav Petkov
  1 sibling, 1 reply; 22+ messages in thread
From: H. Peter Anvin @ 2017-03-06  7:28 UTC (permalink / raw)
  To: Logan Gunthorpe, Borislav Petkov, Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 03/05/17 23:01, Logan Gunthorpe wrote:
> 
> On 05/03/17 12:54 PM, Borislav Petkov wrote:
>> Logan, wanna give that a try, see if it takes care of your issue?
> 
> Well honestly my issue was solved by fixing my kernel config. I have no
> idea why I had optimize for size in there in the first place.
> 

Yes, to gcc "optimize for size" means exactly that... intended for cases
where saving storage (e.g. ROM) or code download time is paramount.

We have frequently asked them for an optimization option that is not
strictly byte-count-based but still tries to reduce cache footprint, but
it is a nontrivial project.

	-hpa

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-06  7:01                     ` Logan Gunthorpe
  2017-03-06  7:28                       ` H. Peter Anvin
@ 2017-03-06 13:33                       ` Borislav Petkov
  2017-03-06 13:41                         ` hpa
  1 sibling, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2017-03-06 13:33 UTC (permalink / raw)
  To: Logan Gunthorpe, Linus Torvalds
  Cc: Peter Anvin, Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Mar 06, 2017 at 12:01:10AM -0700, Logan Gunthorpe wrote:
> Well honestly my issue was solved by fixing my kernel config. I have no
> idea why I had optimize for size in there in the first place.

I still think that we should address the iomem memcpy Linus mentioned.
So how about this partial revert. I've made 32-bit use the same special
__memcpy() version.

Hmmm?

---
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 7afb0e2f07f4..9e378a10796d 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -201,6 +201,7 @@ extern void set_iounmap_nonlazy(void);
 #ifdef __KERNEL__
 
 #include <asm-generic/iomap.h>
+#include <asm/string.h>
 
 /*
  * Convert a virtual cached pointer to an uncached pointer
@@ -227,12 +228,13 @@ memset_io(volatile void __iomem *addr, unsigned char val, size_t count)
  * @src:		The (I/O memory) source for the data
  * @count:		The number of bytes to copy
  *
- * Copy a block of data from I/O memory.
+ * Copy a block of data from I/O memory. IO memory is different from
+ * cached memory so we use special memcpy version.
  */
 static inline void
 memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count)
 {
-	memcpy(dst, (const void __force *)src, count);
+	__inline_memcpy(dst, (const void __force *)src, count);
 }
 
 /**
@@ -241,12 +243,13 @@ memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count)
  * @src:		The (RAM) source for the data
  * @count:		The number of bytes to copy
  *
- * Copy a block of data to I/O memory.
+ * Copy a block of data to I/O memory. IO memory is different from
+ * cached memory so we use special memcpy version.
  */
 static inline void
 memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
 {
-	memcpy((void __force *)dst, src, count);
+	__inline_memcpy((void __force *)dst, src, count);
 }
 
 /*
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 3d3e8353ee5c..556fa4a975ff 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -29,6 +29,7 @@ extern char *strchr(const char *s, int c);
 #define __HAVE_ARCH_STRLEN
 extern size_t strlen(const char *s);
 
+#define __inline_memcpy __memcpy
 static __always_inline void *__memcpy(void *to, const void *from, size_t n)
 {
 	int d0, d1, d2;

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-06 13:33                       ` Borislav Petkov
@ 2017-03-06 13:41                         ` hpa
  2017-03-06 14:00                           ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: hpa @ 2017-03-06 13:41 UTC (permalink / raw)
  To: Borislav Petkov, Logan Gunthorpe, Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro,
	the arch/x86 maintainers, Linux Kernel Mailing List

On March 6, 2017 5:33:28 AM PST, Borislav Petkov <bp@suse.de> wrote:
>On Mon, Mar 06, 2017 at 12:01:10AM -0700, Logan Gunthorpe wrote:
>> Well honestly my issue was solved by fixing my kernel config. I have
>no
>> idea why I had optimize for size in there in the first place.
>
>I still think that we should address the iomem memcpy Linus mentioned.
>So how about this partial revert. I've made 32-bit use the same special
>__memcpy() version.
>
>Hmmm?
>
>---
>diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>index 7afb0e2f07f4..9e378a10796d 100644
>--- a/arch/x86/include/asm/io.h
>+++ b/arch/x86/include/asm/io.h
>@@ -201,6 +201,7 @@ extern void set_iounmap_nonlazy(void);
> #ifdef __KERNEL__
> 
> #include <asm-generic/iomap.h>
>+#include <asm/string.h>
> 
> /*
>  * Convert a virtual cached pointer to an uncached pointer
>@@ -227,12 +228,13 @@ memset_io(volatile void __iomem *addr, unsigned
>char val, size_t count)
>  * @src:		The (I/O memory) source for the data
>  * @count:		The number of bytes to copy
>  *
>- * Copy a block of data from I/O memory.
>+ * Copy a block of data from I/O memory. IO memory is different from
>+ * cached memory so we use special memcpy version.
>  */
> static inline void
>memcpy_fromio(void *dst, const volatile void __iomem *src, size_t
>count)
> {
>-	memcpy(dst, (const void __force *)src, count);
>+	__inline_memcpy(dst, (const void __force *)src, count);
> }
> 
> /**
>@@ -241,12 +243,13 @@ memcpy_fromio(void *dst, const volatile void
>__iomem *src, size_t count)
>  * @src:		The (RAM) source for the data
>  * @count:		The number of bytes to copy
>  *
>- * Copy a block of data to I/O memory.
>+ * Copy a block of data to I/O memory. IO memory is different from
>+ * cached memory so we use special memcpy version.
>  */
> static inline void
> memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
> {
>-	memcpy((void __force *)dst, src, count);
>+	__inline_memcpy((void __force *)dst, src, count);
> }
> 
> /*
>diff --git a/arch/x86/include/asm/string_32.h
>b/arch/x86/include/asm/string_32.h
>index 3d3e8353ee5c..556fa4a975ff 100644
>--- a/arch/x86/include/asm/string_32.h
>+++ b/arch/x86/include/asm/string_32.h
>@@ -29,6 +29,7 @@ extern char *strchr(const char *s, int c);
> #define __HAVE_ARCH_STRLEN
> extern size_t strlen(const char *s);
> 
>+#define __inline_memcpy __memcpy
>static __always_inline void *__memcpy(void *to, const void *from,
>size_t n)
> {
> 	int d0, d1, d2;

It isn't really that straightforward IMO.

For UC memory transaction size really needs to be specified explicitly at all times and should be part of the API, rather than implicit.

For WC/WT/WB device memory, the ordinary memcpy is valid and preferred.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-06 13:41                         ` hpa
@ 2017-03-06 14:00                           ` Borislav Petkov
  0 siblings, 0 replies; 22+ messages in thread
From: Borislav Petkov @ 2017-03-06 14:00 UTC (permalink / raw)
  To: hpa
  Cc: Logan Gunthorpe, Linus Torvalds, Thomas Gleixner, Ingo Molnar,
	Tony Luck, Al Viro, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Mon, Mar 06, 2017 at 05:41:22AM -0800, hpa@zytor.com wrote:
> It isn't really that straightforward IMO.
>
> For UC memory transaction size really needs to be specified explicitly
> at all times and should be part of the API, rather than implicit.
>
> For WC/WT/WB device memory, the ordinary memcpy is valid and
> preferred.

I'm practically partially reverting

6175ddf06b61 ("x86: Clean up mem*io functions.")

Are you saying, this was wrong before too?

Maybe it was wrong, strictly speaking, but maybe that was good enough
for our purposes...

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-06  7:28                       ` H. Peter Anvin
@ 2017-03-06 17:12                         ` Logan Gunthorpe
  2017-03-06 18:53                           ` hpa
  0 siblings, 1 reply; 22+ messages in thread
From: Logan Gunthorpe @ 2017-03-06 17:12 UTC (permalink / raw)
  To: H. Peter Anvin, Borislav Petkov, Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro,
	the arch/x86 maintainers, Linux Kernel Mailing List



On 06/03/17 12:28 AM, H. Peter Anvin wrote:
> On 03/05/17 23:01, Logan Gunthorpe wrote:
>>
>> On 05/03/17 12:54 PM, Borislav Petkov wrote:
>>> Logan, wanna give that a try, see if it takes care of your issue?
>>
>> Well honestly my issue was solved by fixing my kernel config. I have no
>> idea why I had optimize for size in there in the first place.
>>
> 
> Yes, to gcc "optimize for size" means exactly that... intended for cases
> where saving storage (e.g. ROM) or code download time is paramount.

I agree and understand, however placing a poorly performing _inline_
memcpy instead of a single call instruction to a performant memcopy
probably took more code space in the end. So like Linus, I just have to
scratch my head at the -Os optimization option.

Logan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question Regarding ERMS memcpy
  2017-03-06 17:12                         ` Logan Gunthorpe
@ 2017-03-06 18:53                           ` hpa
  0 siblings, 0 replies; 22+ messages in thread
From: hpa @ 2017-03-06 18:53 UTC (permalink / raw)
  To: Logan Gunthorpe, Borislav Petkov, Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, Tony Luck, Al Viro,
	the arch/x86 maintainers, Linux Kernel Mailing List

On March 6, 2017 9:12:41 AM PST, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>On 06/03/17 12:28 AM, H. Peter Anvin wrote:
>> On 03/05/17 23:01, Logan Gunthorpe wrote:
>>>
>>> On 05/03/17 12:54 PM, Borislav Petkov wrote:
>>>> Logan, wanna give that a try, see if it takes care of your issue?
>>>
>>> Well honestly my issue was solved by fixing my kernel config. I have
>no
>>> idea why I had optimize for size in there in the first place.
>>>
>> 
>> Yes, to gcc "optimize for size" means exactly that... intended for
>cases
>> where saving storage (e.g. ROM) or code download time is paramount.
>
>I agree and understand, however placing a poorly performing _inline_
>memcpy instead of a single call instruction to a performant memcopy
>probably took more code space in the end. So like Linus, I just have to
>scratch my head at the -Os optimization option.
>
>Logan

No, it will be smaller: -Os counts bytes.

If you think about it, there is no way that replacing a five-byte subroutine call with a two-byte instruction opcode can make it bigger!  The only other difference between the two from a size perspective is that the compiler doesn't have to worry about clobbered registers other than the argument registers.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-03-06 19:00 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-04 20:08 Question Regarding ERMS memcpy Logan Gunthorpe
2017-03-04 22:43 ` Borislav Petkov
     [not found]   ` <c675a6c8-e87a-ece2-2350-cdd079b1b610@deltatee.com>
2017-03-04 23:55     ` hpa
2017-03-05  0:14       ` Borislav Petkov
2017-03-05  0:23         ` hpa
2017-03-05  0:33           ` Borislav Petkov
2017-03-05  0:56             ` hpa
2017-03-05  9:50               ` Borislav Petkov
2017-03-05 11:18                 ` Borislav Petkov
2017-03-05 13:05                   ` Borislav Petkov
2017-03-05 19:19                 ` Linus Torvalds
2017-03-05 19:54                   ` Borislav Petkov
2017-03-05 20:19                     ` Linus Torvalds
2017-03-06  7:01                     ` Logan Gunthorpe
2017-03-06  7:28                       ` H. Peter Anvin
2017-03-06 17:12                         ` Logan Gunthorpe
2017-03-06 18:53                           ` hpa
2017-03-06 13:33                       ` Borislav Petkov
2017-03-06 13:41                         ` hpa
2017-03-06 14:00                           ` Borislav Petkov
2017-03-05  4:58             ` Logan Gunthorpe
2017-03-05  9:54               ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).