Re: [PATCH] x86: only use ERMS for user copies for larger sizes

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
       [not found] <02bfc577-32a5-66be-64bf-d476b7d447d2@kernel.dk>
@ 2018-11-20 20:24 ` Jens Axboe
  2018-11-21  6:36 ` Ingo Molnar
  1 sibling, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2018-11-20 20:24 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
  Cc: the arch/x86 maintainers, linux-kernel

Forgot to CC the mailing list...


On 11/20/18 1:18 PM, Jens Axboe wrote:
> Hi,
> 
> So this is a fun one... While I was doing the aio polled work, I noticed
> that the submitting process spent a substantial amount of time copying
> data to/from userspace. For aio, that's iocb and io_event, which are 64
> and 32 bytes respectively. Looking closer at this, and it seems that
> ERMS rep movsb is SLOWER for smaller copies, due to a higher startup
> cost.
> 
> I came up with this hack to test it out, and low and behold, we now cut
> the time spent in copying in half. 50% less.
> 
> Since these kinds of patches tend to lend themselves to bike shedding, I
> also ran a string of kernel compilations out of RAM. Results are as
> follows:
> 
> Patched	: 62.86s avg, stddev 0.65s
> Stock	: 63.73s avg, stddev 0.67s
> 
> which would also seem to indicate that we're faster punting smaller
> (< 128 byte) copies.
> 
> CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
> 
> Interestingly, text size is smaller with the patch as well?!
> 
> I'm sure there are smarter ways to do this, but results look fairly
> conclusive. FWIW, the behaviorial change was introduced by:
> 
> commit 954e482bde20b0e208fd4d34ef26e10afd194600
> Author: Fenghua Yu <fenghua.yu@intel.com>
> Date:   Thu May 24 18:19:45 2012 -0700
> 
>     x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
> 
> which contains nothing in terms of benchmarking or results, just claims
> that the new hotness is better.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> 
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index a9d637bc301d..7dbb78827e64 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len)
>  {
>  	unsigned ret;
>  
> +	/*
> +	 * For smaller copies, don't use ERMS as it's slower.
> +	 */
> +	if (len < 128) {
> +		alternative_call(copy_user_generic_unrolled,
> +				 copy_user_generic_string, X86_FEATURE_REP_GOOD,
> +				 ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
> +					     "=d" (len)),
> +				 "1" (to), "2" (from), "3" (len)
> +				 : "memory", "rcx", "r8", "r9", "r10", "r11");
> +		return ret;
> +	}
> +
>  	/*
>  	 * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
>  	 * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
>  	 * Otherwise, use copy_user_generic_unrolled.
>  	 */
>  	alternative_call_2(copy_user_generic_unrolled,
> -			 copy_user_generic_string,
> -			 X86_FEATURE_REP_GOOD,
> -			 copy_user_enhanced_fast_string,
> -			 X86_FEATURE_ERMS,
> +			 copy_user_generic_string, X86_FEATURE_REP_GOOD,
> +			 copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
>  			 ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>  				     "=d" (len)),
>  			 "1" (to), "2" (from), "3" (len)
> 


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
       [not found] <02bfc577-32a5-66be-64bf-d476b7d447d2@kernel.dk>
  2018-11-20 20:24 ` [PATCH] x86: only use ERMS for user copies for larger sizes Jens Axboe
@ 2018-11-21  6:36 ` Ingo Molnar
  2018-11-21 13:32   ` Jens Axboe
  1 sibling, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2018-11-21  6:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	the arch/x86 maintainers, Linus Torvalds, Andrew Morton,
	Andy Lutomirski, Peter Zijlstra, Denys Vlasenko, Brian Gerst,
	linux-kernel


[ Cc:-ed a few other gents and lkml. ]

* Jens Axboe <axboe@kernel.dk> wrote:

> Hi,
> 
> So this is a fun one... While I was doing the aio polled work, I noticed
> that the submitting process spent a substantial amount of time copying
> data to/from userspace. For aio, that's iocb and io_event, which are 64
> and 32 bytes respectively. Looking closer at this, and it seems that
> ERMS rep movsb is SLOWER for smaller copies, due to a higher startup
> cost.
> 
> I came up with this hack to test it out, and low and behold, we now cut
> the time spent in copying in half. 50% less.
> 
> Since these kinds of patches tend to lend themselves to bike shedding, I
> also ran a string of kernel compilations out of RAM. Results are as
> follows:
> 
> Patched	: 62.86s avg, stddev 0.65s
> Stock	: 63.73s avg, stddev 0.67s
> 
> which would also seem to indicate that we're faster punting smaller
> (< 128 byte) copies.
> 
> CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
> 
> Interestingly, text size is smaller with the patch as well?!
> 
> I'm sure there are smarter ways to do this, but results look fairly
> conclusive. FWIW, the behaviorial change was introduced by:
> 
> commit 954e482bde20b0e208fd4d34ef26e10afd194600
> Author: Fenghua Yu <fenghua.yu@intel.com>
> Date:   Thu May 24 18:19:45 2012 -0700
> 
>     x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
> 
> which contains nothing in terms of benchmarking or results, just claims
> that the new hotness is better.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
> 
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index a9d637bc301d..7dbb78827e64 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len)
>  {
>  	unsigned ret;
>  
> +	/*
> +	 * For smaller copies, don't use ERMS as it's slower.
> +	 */
> +	if (len < 128) {
> +		alternative_call(copy_user_generic_unrolled,
> +				 copy_user_generic_string, X86_FEATURE_REP_GOOD,
> +				 ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
> +					     "=d" (len)),
> +				 "1" (to), "2" (from), "3" (len)
> +				 : "memory", "rcx", "r8", "r9", "r10", "r11");
> +		return ret;
> +	}
> +
>  	/*
>  	 * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
>  	 * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
>  	 * Otherwise, use copy_user_generic_unrolled.
>  	 */
>  	alternative_call_2(copy_user_generic_unrolled,
> -			 copy_user_generic_string,
> -			 X86_FEATURE_REP_GOOD,
> -			 copy_user_enhanced_fast_string,
> -			 X86_FEATURE_ERMS,
> +			 copy_user_generic_string, X86_FEATURE_REP_GOOD,
> +			 copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
>  			 ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>  				     "=d" (len)),
>  			 "1" (to), "2" (from), "3" (len)

So I'm inclined to do something like yours, because clearly the changelog 
of 954e482bde20 was at least partly false: Intel can say whatever they 
want, it's a fact that ERMS has high setup costs for low buffer sizes - 
ERMS is optimized for large size, cache-aligned copies mainly.

But the result is counter-intuitive in terms of kernel text footprint, 
plus the '128' is pretty arbitrary - we should at least try to come up 
with a break-even point where manual copy is about as fast as ERMS - on 
at least a single CPU ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21  6:36 ` Ingo Molnar
@ 2018-11-21 13:32   ` Jens Axboe
  2018-11-21 13:44     ` Denys Vlasenko
  2018-11-21 13:45     ` Paolo Abeni
  0 siblings, 2 replies; 39+ messages in thread
From: Jens Axboe @ 2018-11-21 13:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	the arch/x86 maintainers, Linus Torvalds, Andrew Morton,
	Andy Lutomirski, Peter Zijlstra, Denys Vlasenko, Brian Gerst,
	linux-kernel, pabeni

On 11/20/18 11:36 PM, Ingo Molnar wrote:
> 
> [ Cc:-ed a few other gents and lkml. ]
> 
> * Jens Axboe <axboe@kernel.dk> wrote:
> 
>> Hi,
>>
>> So this is a fun one... While I was doing the aio polled work, I noticed
>> that the submitting process spent a substantial amount of time copying
>> data to/from userspace. For aio, that's iocb and io_event, which are 64
>> and 32 bytes respectively. Looking closer at this, and it seems that
>> ERMS rep movsb is SLOWER for smaller copies, due to a higher startup
>> cost.
>>
>> I came up with this hack to test it out, and low and behold, we now cut
>> the time spent in copying in half. 50% less.
>>
>> Since these kinds of patches tend to lend themselves to bike shedding, I
>> also ran a string of kernel compilations out of RAM. Results are as
>> follows:
>>
>> Patched	: 62.86s avg, stddev 0.65s
>> Stock	: 63.73s avg, stddev 0.67s
>>
>> which would also seem to indicate that we're faster punting smaller
>> (< 128 byte) copies.
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
>>
>> Interestingly, text size is smaller with the patch as well?!
>>
>> I'm sure there are smarter ways to do this, but results look fairly
>> conclusive. FWIW, the behaviorial change was introduced by:
>>
>> commit 954e482bde20b0e208fd4d34ef26e10afd194600
>> Author: Fenghua Yu <fenghua.yu@intel.com>
>> Date:   Thu May 24 18:19:45 2012 -0700
>>
>>     x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
>>
>> which contains nothing in terms of benchmarking or results, just claims
>> that the new hotness is better.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>
>> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
>> index a9d637bc301d..7dbb78827e64 100644
>> --- a/arch/x86/include/asm/uaccess_64.h
>> +++ b/arch/x86/include/asm/uaccess_64.h
>> @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len)
>>  {
>>  	unsigned ret;
>>  
>> +	/*
>> +	 * For smaller copies, don't use ERMS as it's slower.
>> +	 */
>> +	if (len < 128) {
>> +		alternative_call(copy_user_generic_unrolled,
>> +				 copy_user_generic_string, X86_FEATURE_REP_GOOD,
>> +				 ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>> +					     "=d" (len)),
>> +				 "1" (to), "2" (from), "3" (len)
>> +				 : "memory", "rcx", "r8", "r9", "r10", "r11");
>> +		return ret;
>> +	}
>> +
>>  	/*
>>  	 * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
>>  	 * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
>>  	 * Otherwise, use copy_user_generic_unrolled.
>>  	 */
>>  	alternative_call_2(copy_user_generic_unrolled,
>> -			 copy_user_generic_string,
>> -			 X86_FEATURE_REP_GOOD,
>> -			 copy_user_enhanced_fast_string,
>> -			 X86_FEATURE_ERMS,
>> +			 copy_user_generic_string, X86_FEATURE_REP_GOOD,
>> +			 copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
>>  			 ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>>  				     "=d" (len)),
>>  			 "1" (to), "2" (from), "3" (len)
> 
> So I'm inclined to do something like yours, because clearly the changelog 
> of 954e482bde20 was at least partly false: Intel can say whatever they 
> want, it's a fact that ERMS has high setup costs for low buffer sizes - 
> ERMS is optimized for large size, cache-aligned copies mainly.

I'm actually surprised that something like that was accepted, I guess
2012 was a simpler time :-)

> But the result is counter-intuitive in terms of kernel text footprint, 
> plus the '128' is pretty arbitrary - we should at least try to come up 
> with a break-even point where manual copy is about as fast as ERMS - on 
> at least a single CPU ...

I did some more investigation yesterday, and found this:

commit 236222d39347e0e486010f10c1493e83dbbdfba8
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Thu Jun 29 15:55:58 2017 +0200

    x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings

which does attempt to rectify it, but only using ERMS for >= 64 byte copies.
At least for me, looks like the break even point is higher than that, which
would mean that something like the below would be more appropriate. Adding
Paolo, in case he actually wrote a test app for this. In terms of test app,
I'm always weary of using microbenchmarking only for this type of thing,
real world testing (if stable enough) is much more useful.


diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index db4e5aa0858b..21c4d68c5fac 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -175,8 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string)
  */
 ENTRY(copy_user_enhanced_fast_string)
 	ASM_STAC
-	cmpl $64,%edx
-	jb .L_copy_short_string	/* less then 64 bytes, avoid the costly 'rep' */
+	cmpl $128,%edx
+	jb .L_copy_short_string	/* less then 128 bytes, avoid costly 'rep' */
 	movl %edx,%ecx
 1:	rep
 	movsb

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21 13:32   ` Jens Axboe
@ 2018-11-21 13:44     ` Denys Vlasenko
  2018-11-22 17:36       ` David Laight
  2018-11-21 13:45     ` Paolo Abeni
  1 sibling, 1 reply; 39+ messages in thread
From: Denys Vlasenko @ 2018-11-21 13:44 UTC (permalink / raw)
  To: Jens Axboe, Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	the arch/x86 maintainers, Linus Torvalds, Andrew Morton,
	Andy Lutomirski, Peter Zijlstra, Brian Gerst, linux-kernel,
	pabeni

On 11/21/2018 02:32 PM, Jens Axboe wrote:
> On 11/20/18 11:36 PM, Ingo Molnar wrote:
>> * Jens Axboe <axboe@kernel.dk> wrote:
>>> So this is a fun one... While I was doing the aio polled work, I noticed
>>> that the submitting process spent a substantial amount of time copying
>>> data to/from userspace. For aio, that's iocb and io_event, which are 64
>>> and 32 bytes respectively. Looking closer at this, and it seems that
>>> ERMS rep movsb is SLOWER for smaller copies, due to a higher startup
>>> cost.
>>>
>>> I came up with this hack to test it out, and low and behold, we now cut
>>> the time spent in copying in half. 50% less.
>>>
>>> Since these kinds of patches tend to lend themselves to bike shedding, I
>>> also ran a string of kernel compilations out of RAM. Results are as
>>> follows:
>>>
>>> Patched	: 62.86s avg, stddev 0.65s
>>> Stock	: 63.73s avg, stddev 0.67s
>>>
>>> which would also seem to indicate that we're faster punting smaller
>>> (< 128 byte) copies.
>>>
>>> CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
>>>
>>> Interestingly, text size is smaller with the patch as well?!
>>>
>>> I'm sure there are smarter ways to do this, but results look fairly
>>> conclusive. FWIW, the behaviorial change was introduced by:
>>>
>>> commit 954e482bde20b0e208fd4d34ef26e10afd194600
>>> Author: Fenghua Yu <fenghua.yu@intel.com>
>>> Date:   Thu May 24 18:19:45 2012 -0700
>>>
>>>      x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
>>>
>>> which contains nothing in terms of benchmarking or results, just claims
>>> that the new hotness is better.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>
>>> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
>>> index a9d637bc301d..7dbb78827e64 100644
>>> --- a/arch/x86/include/asm/uaccess_64.h
>>> +++ b/arch/x86/include/asm/uaccess_64.h
>>> @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len)
>>>   {
>>>   	unsigned ret;
>>>   
>>> +	/*
>>> +	 * For smaller copies, don't use ERMS as it's slower.
>>> +	 */
>>> +	if (len < 128) {
>>> +		alternative_call(copy_user_generic_unrolled,
>>> +				 copy_user_generic_string, X86_FEATURE_REP_GOOD,
>>> +				 ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>>> +					     "=d" (len)),
>>> +				 "1" (to), "2" (from), "3" (len)
>>> +				 : "memory", "rcx", "r8", "r9", "r10", "r11");
>>> +		return ret;
>>> +	}
>>> +
>>>   	/*
>>>   	 * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
>>>   	 * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
>>>   	 * Otherwise, use copy_user_generic_unrolled.
>>>   	 */
>>>   	alternative_call_2(copy_user_generic_unrolled,
>>> -			 copy_user_generic_string,
>>> -			 X86_FEATURE_REP_GOOD,
>>> -			 copy_user_enhanced_fast_string,
>>> -			 X86_FEATURE_ERMS,
>>> +			 copy_user_generic_string, X86_FEATURE_REP_GOOD,
>>> +			 copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
>>>   			 ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>>>   				     "=d" (len)),
>>>   			 "1" (to), "2" (from), "3" (len)
>>
>> So I'm inclined to do something like yours, because clearly the changelog
>> of 954e482bde20 was at least partly false: Intel can say whatever they
>> want, it's a fact that ERMS has high setup costs for low buffer sizes -
>> ERMS is optimized for large size, cache-aligned copies mainly.
> 
> I'm actually surprised that something like that was accepted, I guess
> 2012 was a simpler time :-)
> 
>> But the result is counter-intuitive in terms of kernel text footprint,
>> plus the '128' is pretty arbitrary - we should at least try to come up
>> with a break-even point where manual copy is about as fast as ERMS - on
>> at least a single CPU ...
> 
> I did some more investigation yesterday, and found this:
> 
> commit 236222d39347e0e486010f10c1493e83dbbdfba8
> Author: Paolo Abeni <pabeni@redhat.com>
> Date:   Thu Jun 29 15:55:58 2017 +0200
> 
>      x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
> 
> which does attempt to rectify it, but only using ERMS for >= 64 byte copies.
> At least for me, looks like the break even point is higher than that, which
> would mean that something like the below would be more appropriate.

I also tested this while working for string ops code in musl.

I think at least 128 bytes would be the minimum where "REP insn"
are more efficient. In my testing, it's more like 256 bytes...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21 13:44     ` Denys Vlasenko
@ 2018-11-22 17:36       ` David Laight
  2018-11-22 17:52         ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: David Laight @ 2018-11-22 17:36 UTC (permalink / raw)
  To: 'Denys Vlasenko', Jens Axboe, Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	the arch/x86 maintainers, Linus Torvalds, Andrew Morton,
	Andy Lutomirski, Peter Zijlstra, Brian Gerst, linux-kernel,
	pabeni

From: Denys Vlasenko
> Sent: 21 November 2018 13:44
...
> I also tested this while working for string ops code in musl.
> 
> I think at least 128 bytes would be the minimum where "REP insn"
> are more efficient. In my testing, it's more like 256 bytes...

What happens for misaligned copies?
I had a feeling that the ERMS 'reb movsb' code used some kind
of barrel shifter in that case.

The other problem with the ERMS copy is that it gets used
for copy_to/from_io() - and the 'rep movsb' on uncached
locations has to do byte copies.
Byte reads on PCIe are really horrid.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-22 17:36       ` David Laight
@ 2018-11-22 17:52         ` Linus Torvalds
  2018-11-22 18:06           ` Andy Lutomirski
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2018-11-22 17:52 UTC (permalink / raw)
  To: David.Laight
  Cc: dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar,
	bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton,
	Andrew Lutomirski, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

On Thu, Nov 22, 2018 at 9:36 AM David Laight <David.Laight@aculab.com> wrote:
>
> The other problem with the ERMS copy is that it gets used
> for copy_to/from_io() - and the 'rep movsb' on uncached
> locations has to do byte copies.

Ugh. I thought we changed that *long* ago, because even our non-ERMS
copy is broken for PCI (it does overlapping stores for the small tail
cases).

But looking at "memcpy_{from,to}io()", I don't see x86 overriding it
with anything better.

I suspect nobody uses those functions for anything critical any more.
The fbcon people have their own copy functions, iirc.

But we definitely should fix this. *NONE* of the regular memcpy
functions actually work right for PCI space any more, and haven't for
a long time.

                 Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-22 17:52         ` Linus Torvalds
@ 2018-11-22 18:06           ` Andy Lutomirski
  2018-11-22 18:58             ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2018-11-22 18:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Laight, Denys Vlasenko, Jens Axboe, Ingo Molnar,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	X86 ML, Andrew Morton, Andrew Lutomirski, Peter Zijlstra,
	Brian Gerst, LKML, pabeni

On Thu, Nov 22, 2018 at 9:53 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Nov 22, 2018 at 9:36 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > The other problem with the ERMS copy is that it gets used
> > for copy_to/from_io() - and the 'rep movsb' on uncached
> > locations has to do byte copies.
>
> Ugh. I thought we changed that *long* ago, because even our non-ERMS
> copy is broken for PCI (it does overlapping stores for the small tail
> cases).
>
> But looking at "memcpy_{from,to}io()", I don't see x86 overriding it
> with anything better.
>
> I suspect nobody uses those functions for anything critical any more.
> The fbcon people have their own copy functions, iirc.
>
> But we definitely should fix this. *NONE* of the regular memcpy
> functions actually work right for PCI space any more, and haven't for
> a long time.

I'm not personally volunteering, but I suspect we can do much better
than we do now:

 - The new MOVDIRI and MOVDIR64B instructions can do big writes to WC
and UC memory.  I assume those would be safe to use in ...toio()
functions, unless there are quirky devices out there that blow up if
their MMIO space is written in 64-byte chunks.

 - MOVNTDQA can, I think, do 64-byte loads, but only from WC memory.
For sufficiently large copies, it could plausibly be faster to create
a WC alias and use MOVNTDQA than it is to copy in 8- for 16-byte
chunks.  The i915 driver has a copy implementation using MOVNTDQA --
maybe this should get promoted to something in arch/x86 called
memcpy_from_wc().

--Andy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-22 18:06           ` Andy Lutomirski
@ 2018-11-22 18:58             ` Linus Torvalds
  2018-11-23  9:34               ` David Laight
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2018-11-22 18:58 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: David.Laight, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner,
	Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers,
	Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

On Thu, Nov 22, 2018 at 10:07 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> I'm not personally volunteering, but I suspect we can do much better
> than we do now:
>
>  - The new MOVDIRI and MOVDIR64B instructions can do big writes to WC
> and UC memory.
>
>  - MOVNTDQA can, I think, do 64-byte loads, but only from WC memory.

No, performance isn't the _primary_ issue. Nobody uses MMIO and
expects high performance from the generic functions (but people may
then tweak individual drivers to do tricks).

And we've historically had various broken hardware that cares deeply
about access size. Trying to be clever and do big accesses could
easily break something.

The fact that nobody has complained about the generic memcpy routines
probably means that the broken hardware isn't in use any more, or it
just works anyway. And nobody has complained about performance either,
so it's clearly not a huge issue. "rep movs" probably works ok on WC
memory writes anyway, it's the UC case that is bad, but I don't think
anybody uses UC and then does the "memcp_to/fromio()" things. If you
have UC memory, you tend to do the accesses properly.

So I suspect we should just write memcpy_{to,from}io() in terms of writel/readl.

Oh, and I just noticed that on x86 we expressly use our old "safe and
sane" functions: see __inline_memcpy(), and its use in
__memcpy_{from,to}io().

So the "falls back to memcpy" was always a red herring. We don't
actually do that.

Which explains why things work.

            Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-22 18:58             ` Linus Torvalds
@ 2018-11-23  9:34               ` David Laight
  2018-11-23 10:12                 ` David Laight
  0 siblings, 1 reply; 39+ messages in thread
From: David Laight @ 2018-11-23  9:34 UTC (permalink / raw)
  To: 'Linus Torvalds', Andrew Lutomirski
  Cc: dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar,
	bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton,
	Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni

From: Linus Torvalds
> Sent: 22 November 2018 18:58
...
> Oh, and I just noticed that on x86 we expressly use our old "safe and
> sane" functions: see __inline_memcpy(), and its use in
> __memcpy_{from,to}io().
> 
> So the "falls back to memcpy" was always a red herring. We don't
> actually do that.
> 
> Which explains why things work.

It doesn't explain why I've seen single byte PCIe TLP generated
by memcpy_to/fromio().

I've had to change code to use readq/writeq loops because the
byte accesses are so slow - even when PIO performance should
be 'good enough'.

It might have been changed since last time I tested it.
But I don't remember seeing a commit go by.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-23  9:34               ` David Laight
@ 2018-11-23 10:12                 ` David Laight
  2018-11-23 16:36                   ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: David Laight @ 2018-11-23 10:12 UTC (permalink / raw)
  To: David Laight, 'Linus Torvalds', 'Andrew Lutomirski'
  Cc: 'dvlasenk@redhat.com', 'Jens Axboe',
	'Ingo Molnar', 'Thomas Gleixner',
	'Ingo Molnar', 'bp@alien8.de',
	'Peter Anvin', 'the arch/x86 maintainers',
	'Andrew Morton', 'Peter Zijlstra',
	'brgerst@gmail.com', 'Linux List Kernel Mailing',
	'pabeni@redhat.com'

From: David Laight
> Sent: 23 November 2018 09:35
> From: Linus Torvalds
> > Sent: 22 November 2018 18:58
> ...
> > Oh, and I just noticed that on x86 we expressly use our old "safe and
> > sane" functions: see __inline_memcpy(), and its use in
> > __memcpy_{from,to}io().
> >
> > So the "falls back to memcpy" was always a red herring. We don't
> > actually do that.
> >
> > Which explains why things work.
> 
> It doesn't explain why I've seen single byte PCIe TLP generated
> by memcpy_to/fromio().
> 
> I've had to change code to use readq/writeq loops because the
> byte accesses are so slow - even when PIO performance should
> be 'good enough'.
> 
> It might have been changed since last time I tested it.
> But I don't remember seeing a commit go by.

I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel.
Calling memcpy_fromio(kernel_buffer, PCIe_address, length)
generates a lot of single byte TLP.

What the code normally does is 64bit aligned PCIe reads with
multiple writes and shifts to avoid writing beyond the end of
the kernel buffer for 'odd' length transfers.

Most of our PIO copies are actually direct to/from userspace.
While copy_to/from_user() will work on PCIe memory, it is 'rep mosvb'.
We also mmap() the PCIe space into process memory - and have be
careful not to use memcpy() in usespace.

On suitable systems userspace can use the AVX256 instructions to
get wide reads. Much harder and more expensive in the kernel.

In practise most of the bulk data transfers are requested by the PCIe slave.
But there are times when PIO ones are needed, and 64 bit transfers are
8 times faster than 8 bit ones.
This is all made more significant because it takes our fpga about 500ns
to complete a single word PCIe read.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-23 10:12                 ` David Laight
@ 2018-11-23 16:36                   ` Linus Torvalds
  2018-11-23 17:42                     ` Linus Torvalds
                                       ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Linus Torvalds @ 2018-11-23 16:36 UTC (permalink / raw)
  To: David.Laight
  Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar,
	Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

On Fri, Nov 23, 2018 at 2:12 AM David Laight <David.Laight@aculab.com> wrote:
>
> I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel.
> Calling memcpy_fromio(kernel_buffer, PCIe_address, length)
> generates a lot of single byte TLP.

I just tested it too - it turns out that the __inline_memcpy() code
never triggers, and "memcpy_toio()" just generates a memcpy.

So that code seems entirely dead.

And, in fact, the codebase I looked at was the historical one, because
I had been going back and looking at the history. The modern tree
*does* have the "__inline_memcpy()" function I pointed at, but it's
not actually hooked up to anything!

This actually has been broken for _ages_. The breakage goes back to
2010, and commit 6175ddf06b61 ("x86: Clean up mem*io functions"), and
it seems nobody really ever noticed - or thought that it was ok.

That commit claims that iomem has no special significance on x86, but
that really really isn't true, exactly because the access size does
matter.

And as mentioned, the generic memory copy routines are not at all
appropriate, and that has nothing to do with ERMS. Our "do it by hand"
memory copy routine does things like this:

.Lless_16bytes:
        cmpl $8,        %edx
        jb   .Lless_8bytes
        /*
         * Move data from 8 bytes to 15 bytes.
         */
        movq 0*8(%rsi), %r8
        movq -1*8(%rsi, %rdx),  %r9
        movq %r8,       0*8(%rdi)
        movq %r9,       -1*8(%rdi, %rdx)
        retq

and note how for a 8-byte copy, it will do *two* reads of the same 8
bytes, and *two* writes of the same 8 byte destination. That's
perfectly ok for regular memory, and it means that the code can handle
an arbitrary 8-15 byte copy without any conditionals or loop counts,
but it is *not* ok for iomem.

Of course, in practice it all just happens to work in almost all
situations (because a lot of iomem devices simply won't care), and
manual access to iomem is basically extremely rare these days anyway,
but it's definitely entirely and utterly broken.

End result: we *used* to do this right. For the last eight years our
"memcpy_{to,from}io()" has been entirely broken, and apparently even
the people who noticed oddities like David, never reported it as
breakage but instead just worked around it in drivers.

Ho humm.

Let me write a generic routine in lib/iomap_copy.c (which already does
the "user specifies chunk size" cases), and hook it up for x86.

David, are you using a bus analyzer or something to do your testing?
I'll have a trial patch for you asap.

               Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-23 16:36                   ` Linus Torvalds
@ 2018-11-23 17:42                     ` Linus Torvalds
  2018-11-23 18:39                       ` Andy Lutomirski
  2018-11-26 10:01                     ` David Laight
  2018-11-26 10:26                     ` David Laight
  2 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2018-11-23 17:42 UTC (permalink / raw)
  To: David.Laight
  Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar,
	Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

[-- Attachment #1: Type: text/plain, Size: 1291 bytes --]

On Fri, Nov 23, 2018 at 8:36 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Let me write a generic routine in lib/iomap_copy.c (which already does
> the "user specifies chunk size" cases), and hook it up for x86.

Something like this?

ENTIRELY UNTESTED! It might not compile. Seriously. And if it does
compile, it might not work.

And this doesn't actually do the memset_io() function at all, just the
memcpy ones.

Finally, it's worth noting that on x86, we have this:

  /*
   * override generic version in lib/iomap_copy.c
   */
  ENTRY(__iowrite32_copy)
          movl %edx,%ecx
          rep movsd
          ret
  ENDPROC(__iowrite32_copy)

because back in 2006, we did this:

    [PATCH] Add faster __iowrite32_copy routine for x86_64

    This assembly version is measurably faster than the generic version in
    lib/iomap_copy.c.

which actually implies that "rep movsd" is faster than doing
__raw_writel() by hand.

So it is possible that this should all be arch-specific code rather
than that butt-ugly "generic" code I wrote in this patch.

End result: I'm not really all that  happy about this patch, but it's
perhaps worth testing, and it's definitely worth discussing. Because
our current memcpy_{to,from}io() is truly broken garbage.

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 5129 bytes --]

 arch/x86/include/asm/io.h |   6 ++
 include/linux/io.h        |   2 +
 lib/iomap_copy.c          | 153 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 161 insertions(+)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 832da8229cc7..3b9206ee25b8 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -92,6 +92,12 @@ build_mmio_write(__writel, "l", unsigned int, "r", )
 
 #define mmiowb() barrier()
 
+void __iowrite_copy(void __iomem *to, const void *from, size_t count);
+void __ioread_copy(void *to, const void __iomem *from, size_t count);
+
+#define memcpy_toio __iowrite_copy
+#define memcpy_fromio __ioread_copy
+
 #ifdef CONFIG_X86_64
 
 build_mmio_read(readq, "q", u64, "=r", :"memory")
diff --git a/include/linux/io.h b/include/linux/io.h
index 32e30e8fb9db..642f78970018 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -28,6 +28,8 @@
 struct device;
 struct resource;
 
+void __ioread_copy(void *to, const void __iomem *from, size_t count);
+void __iowrite_copy(void __iomem *to, const void *from, size_t count);
 __visible void __iowrite32_copy(void __iomem *to, const void *from, size_t count);
 void __ioread32_copy(void *to, const void __iomem *from, size_t count);
 void __iowrite64_copy(void __iomem *to, const void *from, size_t count);
diff --git a/lib/iomap_copy.c b/lib/iomap_copy.c
index b8f1d6cbb200..8edc359dda62 100644
--- a/lib/iomap_copy.c
+++ b/lib/iomap_copy.c
@@ -17,6 +17,159 @@
 
 #include <linux/export.h>
 #include <linux/io.h>
+#include <asm/unaligned.h>
+
+static inline bool iomem_align(const void __iomem *ptr, int size, int count)
+{
+	return count >= size && (__force unsigned long)ptr & size;
+}
+
+
+/**
+ * __iowrite_copy - copy data to MMIO space
+ * @to: destination, in MMIO space
+ * @from: source
+ * @count: number of bytes to copy.
+ *
+ * Copy arbitrarily aligned data from kernel space to MMIO space,
+ * using reasonable chunking.
+ */
+void __attribute__((weak)) __iowrite_copy(void __iomem *to,
+					  const void *from,
+					  size_t count)
+{
+	if (iomem_align(to, 1, count)) {
+		unsigned char data = *(unsigned char *)from;
+		__raw_writeb(data, to);
+		from++;
+		to++;
+		count--;
+	}
+	if (iomem_align(to, 2, count)) {
+		unsigned short data = get_unaligned((unsigned short *)from);
+		__raw_writew(data, to);
+		from += 2;
+		to += 2;
+		count -= 2;
+	}
+#ifdef CONFIG_64BIT
+	if (iomem_align(to, 4, count)) {
+		unsigned int data = get_unaligned((unsigned int *)from);
+		__raw_writel(data, to);
+		from += 4;
+		to += 4;
+		count -= 4;
+	}
+#endif
+	while (count >= sizeof(unsigned long)) {
+		unsigned long data = get_unaligned((unsigned long *)from);
+#ifdef CONFIG_64BIT
+		__raw_writeq(data, to);
+#else
+		__raw_writel(data, to);
+#endif
+		from += sizeof(unsigned long);
+		to += sizeof(unsigned long);
+		count -= sizeof(unsigned long);
+	}
+
+#ifdef CONFIG_64BIT
+	if (count >= 4) {
+		unsigned int data = get_unaligned((unsigned int *)from);
+		__raw_writel(data, to);
+		from += 4;
+		to += 4;
+		count -= 4;
+	}
+#endif
+
+	if (count >= 2) {
+		unsigned short data = get_unaligned((unsigned short *)from);
+		__raw_writew(data, to);
+		from += 2;
+		to += 2;
+		count -= 2;
+	}
+
+	if (count) {
+		unsigned char data = *(unsigned char *)from;
+		__raw_writeb(data, to);
+	}
+}
+EXPORT_SYMBOL_GPL(__iowrite_copy);
+
+/**
+ * __ioread_copy - copy data from MMIO space
+ * @to: destination
+ * @from: source, in MMIO space
+ * @count: number of bytes to copy.
+ *
+ * Copy arbitrarily aligned data from MMIO space to kernel space,
+ * using reasonable chunking.
+ */
+void __attribute__((weak)) __ioread_copy(void *to,
+					 const void __iomem *from,
+					 size_t count)
+{
+	if (iomem_align(from, 1, count)) {
+		unsigned char data = __raw_readb(from);
+		put_unaligned(data, (unsigned char *) to);
+		from++;
+		to++;
+		count--;
+	}
+	if (iomem_align(to, 2, count)) {
+		unsigned short data = __raw_readw(from);
+		put_unaligned(data, (unsigned short *) to);
+		from += 2;
+		to += 2;
+		count -= 2;
+	}
+#ifdef CONFIG_64BIT
+	if (iomem_align(to, 4, count)) {
+		unsigned int data = __raw_readl(from);
+		put_unaligned(data, (unsigned int *) to);
+		from += 4;
+		to += 4;
+		count -= 4;
+	}
+#endif
+	while (count >= sizeof(unsigned long)) {
+#ifdef CONFIG_64BIT
+		unsigned long data = __raw_readq(from);
+#else
+		unsigned long data = __raw_readl(from);
+#endif
+		put_unaligned(data, (unsigned long *) to);
+		from += sizeof(unsigned long);
+		to += sizeof(unsigned long);
+		count -= sizeof(unsigned long);
+	}
+
+#ifdef CONFIG_64BIT
+	if (count >= 4) {
+		unsigned int data = __raw_readl(from);
+		put_unaligned(data, (unsigned int *) to);
+		from += 4;
+		to += 4;
+		count -= 4;
+	}
+#endif
+
+	if (count >= 2) {
+		unsigned short data = __raw_readw(from);
+		put_unaligned(data, (unsigned short *) to);
+		from += 2;
+		to += 2;
+		count -= 2;
+	}
+
+	if (count) {
+		unsigned char data = __raw_readb(from);
+		put_unaligned(data, (unsigned char *) to);
+	}
+}
+EXPORT_SYMBOL_GPL(__ioread_copy);
 
 /**
  * __iowrite32_copy - copy data to MMIO space, in 32-bit units

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-23 17:42                     ` Linus Torvalds
@ 2018-11-23 18:39                       ` Andy Lutomirski
  2018-11-23 18:44                         ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2018-11-23 18:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David.Laight, Andrew Lutomirski, dvlasenk, Jens Axboe,
	Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni



> On Nov 23, 2018, at 10:42 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Fri, Nov 23, 2018 at 8:36 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> 
>> Let me write a generic routine in lib/iomap_copy.c (which already does
>> the "user specifies chunk size" cases), and hook it up for x86.
> 
> Something like this?
> 
> ENTIRELY UNTESTED! It might not compile. Seriously. And if it does
> compile, it might not work.
> 
> And this doesn't actually do the memset_io() function at all, just the
> memcpy ones.
> 
> Finally, it's worth noting that on x86, we have this:
> 
>  /*
>   * override generic version in lib/iomap_copy.c
>   */
>  ENTRY(__iowrite32_copy)
>          movl %edx,%ecx
>          rep movsd
>          ret
>  ENDPROC(__iowrite32_copy)
> 
> because back in 2006, we did this:
> 
>    [PATCH] Add faster __iowrite32_copy routine for x86_64
> 
>    This assembly version is measurably faster than the generic version in
>    lib/iomap_copy.c.
> 
> which actually implies that "rep movsd" is faster than doing
> __raw_writel() by hand.
> 
> So it is possible that this should all be arch-specific code rather
> than that butt-ugly "generic" code I wrote in this patch.
> 
> End result: I'm not really all that  happy about this patch, but it's
> perhaps worth testing, and it's definitely worth discussing. Because
> our current memcpy_{to,from}io() is truly broken garbage.
> 
>                   

What is memcpy_to_io even supposed to do?  I’m guessing it’s defined as something like “copy this data to IO space using at most long-sized writes, all aligned, and writing each byte exactly once, in order.”  That sounds... dubiously useful.  I could see a function that writes to aligned memory in specified-sized chunks.  And I can see a use for a function to just write it in whatever size chunks the architecture thinks is fastest, and *that* should probably use MOVDIR64B.

Or is there some subtlety I’m missing?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-23 18:39                       ` Andy Lutomirski
@ 2018-11-23 18:44                         ` Linus Torvalds
  2018-11-23 19:11                           ` Andy Lutomirski
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2018-11-23 18:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David.Laight, Andrew Lutomirski, dvlasenk, Jens Axboe,
	Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

On Fri, Nov 23, 2018 at 10:39 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> What is memcpy_to_io even supposed to do?  I’m guessing it’s defined as something like “copy this data to IO space using at most long-sized writes, all aligned, and writing each byte exactly once, in order.”  That sounds... dubiously useful.

We've got hundreds of users of it, so it's fairly common..

> I could see a function that writes to aligned memory in specified-sized chunks.

We have that. It's called "__iowrite{32,64}_copy()". It has very few users.

                  Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-23 18:44                         ` Linus Torvalds
@ 2018-11-23 19:11                           ` Andy Lutomirski
  2018-11-26 10:12                             ` David Laight
  0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2018-11-23 19:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David.Laight, Andrew Lutomirski, dvlasenk, Jens Axboe,
	Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni



> On Nov 23, 2018, at 11:44 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Fri, Nov 23, 2018 at 10:39 AM Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>> What is memcpy_to_io even supposed to do?  I’m guessing it’s defined as something like “copy this data to IO space using at most long-sized writes, all aligned, and writing each byte exactly once, in order.”  That sounds... dubiously useful.
> 
> We've got hundreds of users of it, so it's fairly common..
> 

I’m wondering if the “at most long-sizes” restriction matters, especially given that we’re apparently accessing some of the same bytes more than once. I would believe that trying to encourage 16-byte writes (with AVX, ugh) or 64-byte writes (with MOVDIR64B) would be safe and could meaningfully speed up some workloads.

>> I could see a function that writes to aligned memory in specified-sized chunks.
> 
> We have that. It's called "__iowrite{32,64}_copy()". It has very few users.
> 
>                  Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-23 19:11                           ` Andy Lutomirski
@ 2018-11-26 10:12                             ` David Laight
  0 siblings, 0 replies; 39+ messages in thread
From: David Laight @ 2018-11-26 10:12 UTC (permalink / raw)
  To: 'Andy Lutomirski', Linus Torvalds
  Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar,
	Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

From: Andy Lutomirski
> Sent: 23 November 2018 19:11
> > On Nov 23, 2018, at 11:44 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> >> On Fri, Nov 23, 2018 at 10:39 AM Andy Lutomirski <luto@amacapital.net> wrote:
> >>
> >> What is memcpy_to_io even supposed to do?  I’m guessing it’s defined as
> >> something like “copy this data to IO space using at most long-sized writes,
> >> all aligned, and writing each byte exactly once, in order.”
> >> That sounds... dubiously useful.
> >
> > We've got hundreds of users of it, so it's fairly common..
> 
> I’m wondering if the “at most long-sizes” restriction matters, especially
> given that we’re apparently accessing some of the same bytes more than once.
> I would believe that trying to encourage 16-byte writes (with AVX, ugh) or
> 64-byte writes (with MOVDIR64B) would be safe and could meaningfully speed
> up some workloads.

The real gains come from increasing the width of IO reads, not IO writes.
None of the x86 cpus I've got issue multiple concurrent PCIe reads
(the PCIe completion tag seems to match the core number).
PCIe writes are all 'posted' so there aren't big gaps between them.

> >> I could see a function that writes to aligned memory in specified-sized chunks.
> >
> > We have that. It's called "__iowrite{32,64}_copy()". It has very few users.

For x86 you want separate entry points for the 'rep movq' copy
and one using an instruction loop.
(Perhaps with guidance to the cutover length.)
In most places the driver will know whether the size is above or below
the cutover - which might be 256.
Certainly transfers below 64 bytes are 'short'.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-23 16:36                   ` Linus Torvalds
  2018-11-23 17:42                     ` Linus Torvalds
@ 2018-11-26 10:01                     ` David Laight
  2018-11-26 10:26                     ` David Laight
  2 siblings, 0 replies; 39+ messages in thread
From: David Laight @ 2018-11-26 10:01 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar,
	Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

From: Linus Torvalds 
> Sent: 23 November 2018 16:36
> 
> On Fri, Nov 23, 2018 at 2:12 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel.
> > Calling memcpy_fromio(kernel_buffer, PCIe_address, length)
> > generates a lot of single byte TLP.
> 
> I just tested it too - it turns out that the __inline_memcpy() code
> never triggers, and "memcpy_toio()" just generates a memcpy.
> 
> So that code seems entirely dead.
> 
> And, in fact, the codebase I looked at was the historical one, because
> I had been going back and looking at the history. The modern tree
> *does* have the "__inline_memcpy()" function I pointed at, but it's
> not actually hooked up to anything!
> 
> This actually has been broken for _ages_. The breakage goes back to
> 2010, and commit 6175ddf06b61 ("x86: Clean up mem*io functions"), and
> it seems nobody really ever noticed - or thought that it was ok.

It probably was ok in 2010 - that predates ERMS copy.

> That commit claims that iomem has no special significance on x86, but
> that really really isn't true, exactly because the access size does
> matter.

I suspect that memcpy_to/fromio() should only be used for IO space
that has 'memory-like' semantics.
So it shouldn't really matter what size accesses are done.
OTOH the 'io' side is likely to be slow so you want to limit the
number of io cycles (reads in particular).
With memory-like semantics it is ok to read full words at both ends
of the buffer to avoid extra transfers.
Indeed, on PCIe, the misaligned transfer for the last 8 bytes is
probably optimal.

> And as mentioned, the generic memory copy routines are not at all
> appropriate, and that has nothing to do with ERMS. Our "do it by hand"
> memory copy routine does things like this:
> 
> .Lless_16bytes:
>         cmpl $8,        %edx
>         jb   .Lless_8bytes
>         /*
>          * Move data from 8 bytes to 15 bytes.
>          */
>         movq 0*8(%rsi), %r8
>         movq -1*8(%rsi, %rdx),  %r9
>         movq %r8,       0*8(%rdi)
>         movq %r9,       -1*8(%rdi, %rdx)
>         retq
> 
> and note how for a 8-byte copy, it will do *two* reads of the same 8
> bytes, and *two* writes of the same 8 byte destination. That's
> perfectly ok for regular memory, and it means that the code can handle
> an arbitrary 8-15 byte copy without any conditionals or loop counts,
> but it is *not* ok for iomem.

I'd say it is ok for memcpy_to/fromio() since that should only really
be used for targets with memory-like semantics - and could be documented
as such.
But doing the same transfers twice is definitely sub-optimal.

> Of course, in practice it all just happens to work in almost all
> situations (because a lot of iomem devices simply won't care), and
> manual access to iomem is basically extremely rare these days anyway,
> but it's definitely entirely and utterly broken.
> 
> End result: we *used* to do this right. For the last eight years our
> "memcpy_{to,from}io()" has been entirely broken, and apparently even
> the people who noticed oddities like David, never reported it as
> breakage but instead just worked around it in drivers.
> 
> Ho humm.
> 
> Let me write a generic routine in lib/iomap_copy.c (which already does
> the "user specifies chunk size" cases), and hook it up for x86.
> 
> David, are you using a bus analyzer or something to do your testing?
> I'll have a trial patch for you asap.

We've a TLP monitor built into our fpga image so can look at the last
few TLPs (IIRC a 32kB buffer).

Testing patches is a bit harder.
The test system isn't one I build kernels on.

Is there a 'sensible' amd64 kernel config that contains most of the drivers
a modern system might need?
It is a PITA trying to build kernels that will load on all my test systems
(since I tend to move the disks between systems).
Rebuilding the ubuntu config just takes too long and generates a ramdisk
that doesn't fit in /boot.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-23 16:36                   ` Linus Torvalds
  2018-11-23 17:42                     ` Linus Torvalds
  2018-11-26 10:01                     ` David Laight
@ 2018-11-26 10:26                     ` David Laight
  2019-01-05  2:38                       ` Linus Torvalds
  2 siblings, 1 reply; 39+ messages in thread
From: David Laight @ 2018-11-26 10:26 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar,
	Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

From: Linus Torvalds
> Sent: 23 November 2018 16:36
...
> End result: we *used* to do this right. For the last eight years our
> "memcpy_{to,from}io()" has been entirely broken, and apparently even
> the people who noticed oddities like David, never reported it as
> breakage but instead just worked around it in drivers.

I've mentioned it several times...

Probably no one else noticed lots of single byte transfers while
testing a TLP monitor he was writing for an FPGA :-)
They are far too expensive to buy, and would never be connected
to the right system at the right time - so we (I) wrote one.

Unfortunately we don't really get to see what happens when the
link comes up (or rather doesn't come up). We only get the
LTSSM state transitions.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-26 10:26                     ` David Laight
@ 2019-01-05  2:38                       ` Linus Torvalds
  2019-01-07  9:55                         ` David Laight
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2019-01-05  2:38 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar,
	Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

Coming back to this old thread, because I've spent most of the day
resurrecting some of my old core x86 patches, and one of them was for
the issue David Laight complained about: horrible memcpy_toio()
performance.

Yes, I should have done this before the merge window instead of at the
end of it, but I didn't want to delay things for yet another release
just because it fell through some cracks.

Anyway, it would be lovely to hear whether memcpy_toio() now works
reasonably. I just picked our very old legacy function for this, so it
will do things in 32-bit chunks (even on x86-64), and I'm certainly
open to somebody doing something smarter, but considering that nobody
else seemed to show any interest in this at all, I just went
"whatever, good enough".

I tried to make it easy to improve on things if people want to.

The other ancient patch I resurrected was the old "use asm goto for
put_user" which I've had in a private branch for the last almost three
years.

I've tried to test this (it turns out I had completely screwed up the
32-bit case for put_user, for example), but I only have 64-bit user
space, so the i386 build ended up being just about building and
looking superficially at the code generation in a couple of places.

More testing and comments appreciated.

Now I have no ancient patches in any branches, or any known pending
issue. Except for all the pull requests that are piling up because I
didn't do them today since I was spending time on my own patches.

Oh well. There's always tomorrow.

                    Linus

On Mon, Nov 26, 2018 at 2:26 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Linus Torvalds
> > Sent: 23 November 2018 16:36
> ...
> > End result: we *used* to do this right. For the last eight years our
> > "memcpy_{to,from}io()" has been entirely broken, and apparently even
> > the people who noticed oddities like David, never reported it as
> > breakage but instead just worked around it in drivers.
>
> I've mentioned it several times...
>
> Probably no one else noticed lots of single byte transfers while
> testing a TLP monitor he was writing for an FPGA :-)
> They are far too expensive to buy, and would never be connected
> to the right system at the right time - so we (I) wrote one.
>
> Unfortunately we don't really get to see what happens when the
> link comes up (or rather doesn't come up). We only get the
> LTSSM state transitions.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH] x86: only use ERMS for user copies for larger sizes
  2019-01-05  2:38                       ` Linus Torvalds
@ 2019-01-07  9:55                         ` David Laight
  2019-01-07 17:43                           ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: David Laight @ 2019-01-07  9:55 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar,
	Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

From: Linus Torvalds
> Sent: 05 January 2019 02:39
...
> Anyway, it would be lovely to hear whether memcpy_toio() now works
> reasonably. I just picked our very old legacy function for this, so it
> will do things in 32-bit chunks (even on x86-64), and I'm certainly
> open to somebody doing something smarter, but considering that nobody
> else seemed to show any interest in this at all, I just went
> "whatever, good enough".
> 
> I tried to make it easy to improve on things if people want to.

I'll do some tests once the merge has had time to settle.

I needed to open-code one part because it wants to do copy_to_user()
from a PCIe address buffer (which has to work).

Using 64bit chunks for reads is probably worth while on x86-64.
I might cook up a patch.
Actually, if the AVX registers are available without an fpu save
using larger chunks would be worthwhile - especially for io reads.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2019-01-07  9:55                         ` David Laight
@ 2019-01-07 17:43                           ` Linus Torvalds
  2019-01-08  9:10                             ` David Laight
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2019-01-07 17:43 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar,
	Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

On Mon, Jan 7, 2019 at 1:55 AM David Laight <David.Laight@aculab.com> wrote:
>
> I needed to open-code one part because it wants to do copy_to_user()
> from a PCIe address buffer (which has to work).

It will never work for memcpy_fromio(). Any driver that thinks it will
copy from io space to user space absolutely *has* to do it by hand. No
questions, and no exceptions. Some loop like

   for (..)
      put_user(readl(iomem++), uaddr++);

because neither copy_to_user() nor memcpy_fromio() will *ever* handle
that correctly.

They might randomly happen to work on x86, but absolutely nowhere else.

               Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH] x86: only use ERMS for user copies for larger sizes
  2019-01-07 17:43                           ` Linus Torvalds
@ 2019-01-08  9:10                             ` David Laight
  2019-01-08 18:01                               ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: David Laight @ 2019-01-08  9:10 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar,
	Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

From: Linus Torvalds
> Sent: 07 January 2019 17:44
> On Mon, Jan 7, 2019 at 1:55 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > I needed to open-code one part because it wants to do copy_to_user()
> > from a PCIe address buffer (which has to work).
> 
> It will never work for memcpy_fromio(). Any driver that thinks it will
> copy from io space to user space absolutely *has* to do it by hand. No
> questions, and no exceptions. Some loop like
> 
>    for (..)
>       put_user(readl(iomem++), uaddr++);
> 
> because neither copy_to_user() nor memcpy_fromio() will *ever* handle
> that correctly.
> 
> They might randomly happen to work on x86, but absolutely nowhere else.

Actually they tend to handle it on a lot of systems.
(But I don't do it.)
Probably most of those where vm_iomap_memory() (to map IO memory
space directly into user pages) works.

It might be 'interesting' to build an amd64 kernel where all the IO
memory addresses (eg returned by pci_iomap()) are offset by a
large constant so direct accesses all fault and all the readl()
macros (etc) add it back in.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2019-01-08  9:10                             ` David Laight
@ 2019-01-08 18:01                               ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2019-01-08 18:01 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar,
	Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst,
	Linux List Kernel Mailing, pabeni

On Tue, Jan 8, 2019 at 1:10 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > It will never work for memcpy_fromio(). Any driver that thinks it will
> > copy from io space to user space absolutely *has* to do it by hand. No
> > questions, and no exceptions. Some loop like
> >
> >    for (..)
> >       put_user(readl(iomem++), uaddr++);
> >
> > because neither copy_to_user() nor memcpy_fromio() will *ever* handle
> > that correctly.
> >
> > They might randomly happen to work on x86, but absolutely nowhere else.
>
> Actually they tend to handle it on a lot of systems.

Not with memcpy_fromio(), at least.

That doesn't work even on x86. Try it. If the user space page is
swapped out (or not mapped), you'd get a kernel page fault.

And if you do "copy_to_user()" from a mmio region, you get what you
get. If somebody complains about it doing a byte-at-a-time copy, I'll
laugh in their face and tell them to fix their broken driver. It might
work on about half the architectures out there, but it's still
complete garbage, and it's not a bug in copy_to_user().

               Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21 13:32   ` Jens Axboe
  2018-11-21 13:44     ` Denys Vlasenko
@ 2018-11-21 13:45     ` Paolo Abeni
  2018-11-21 17:27       ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Paolo Abeni @ 2018-11-21 13:45 UTC (permalink / raw)
  To: Jens Axboe, Ingo Molnar
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	the arch/x86 maintainers, Linus Torvalds, Andrew Morton,
	Andy Lutomirski, Peter Zijlstra, Denys Vlasenko, Brian Gerst,
	linux-kernel

On Wed, 2018-11-21 at 06:32 -0700, Jens Axboe wrote:
> I did some more investigation yesterday, and found this:
> 
> commit 236222d39347e0e486010f10c1493e83dbbdfba8
> Author: Paolo Abeni <pabeni@redhat.com>
> Date:   Thu Jun 29 15:55:58 2017 +0200
> 
>     x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
> 
> which does attempt to rectify it, but only using ERMS for >= 64 byte copies.
> At least for me, looks like the break even point is higher than that, which
> would mean that something like the below would be more appropriate. 

Back then I used a custom kernel module and the tsc to micro-benchmark
the patched function and the original one with different buffer sizes.
I'm sorry, the relevant code has been lost.

In my experiments 64 bytes was the break even point for all the CPUs I
had handy, but I guess that may change with other models.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21 13:45     ` Paolo Abeni
@ 2018-11-21 17:27       ` Linus Torvalds
  2018-11-21 18:04         ` Jens Axboe
  2018-11-21 18:16         ` Linus Torvalds
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2018-11-21 17:27 UTC (permalink / raw)
  To: pabeni
  Cc: Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp,
	Peter Anvin, the arch/x86 maintainers, Andrew Morton,
	Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst,
	Linux List Kernel Mailing

On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> In my experiments 64 bytes was the break even point for all the CPUs I
> had handy, but I guess that may change with other models.

Note that experiments with memcpy speed are almost invariably broken.
microbenchmarks don't show the impact of I$, but they also don't show
the impact of _behavior_.

For example, there might be things like "repeat strings do cacheline
optimizations" that end up meaning that cachelines stay in L2, for
example, and are never brought into L1. That can be a really good
thing, but it can also mean that now the result isn't as close to the
CPU, and the subsequent use of the cacheline can be costlier.

I say "go for upping the limit to 128 bytes".

That said, if the aio user copy is _so_ critical that it's this
noticeable, there may be other issues. Sometimes _real_ cost of small
user copies is often the STAC/CLAC, more so than the "rep movs".

It would be interesting to know exactly which copy it is that matters
so much...  *inlining* the erms case might show that nicely in
profiles.

               Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21 17:27       ` Linus Torvalds
@ 2018-11-21 18:04         ` Jens Axboe
  2018-11-21 18:26           ` Andy Lutomirski
  2018-11-21 18:16         ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2018-11-21 18:04 UTC (permalink / raw)
  To: Linus Torvalds, pabeni
  Cc: Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski,
	Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing

On 11/21/18 10:27 AM, Linus Torvalds wrote:
> On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>> In my experiments 64 bytes was the break even point for all the CPUs I
>> had handy, but I guess that may change with other models.
> 
> Note that experiments with memcpy speed are almost invariably broken.
> microbenchmarks don't show the impact of I$, but they also don't show
> the impact of _behavior_.
> 
> For example, there might be things like "repeat strings do cacheline
> optimizations" that end up meaning that cachelines stay in L2, for
> example, and are never brought into L1. That can be a really good
> thing, but it can also mean that now the result isn't as close to the
> CPU, and the subsequent use of the cacheline can be costlier.

Totally agree, which is why all my testing was NOT microbenchmarking.

> I say "go for upping the limit to 128 bytes".

See below...

> That said, if the aio user copy is _so_ critical that it's this
> noticeable, there may be other issues. Sometimes _real_ cost of small
> user copies is often the STAC/CLAC, more so than the "rep movs".
> 
> It would be interesting to know exactly which copy it is that matters
> so much...  *inlining* the erms case might show that nicely in
> profiles.

Oh I totally agree, which is why I since went a different route. The
copy that matters is the copy_from_user() of the iocb, which is 64
bytes. Even for 4k IOs, copying 64b per IO is somewhat counter
productive for O_DIRECT.

Playing around with this:

http://git.kernel.dk/cgit/linux-block/commit/?h=aio-poll&id=ed0a0a445c0af4cfd18b0682511981eaf352d483

since we're doing a new sys_io_setup2() for polled aio anyway. This
completely avoids the iocb copy, but that's just for my initial
particular gripe.


diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index db4e5aa0858b..21c4d68c5fac 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -175,8 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string)
  */
 ENTRY(copy_user_enhanced_fast_string)
 	ASM_STAC
-	cmpl $64,%edx
-	jb .L_copy_short_string	/* less then 64 bytes, avoid the costly 'rep' */
+	cmpl $128,%edx
+	jb .L_copy_short_string	/* less then 128 bytes, avoid costly 'rep' */
 	movl %edx,%ecx
 1:	rep
 	movsb

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21 18:04         ` Jens Axboe
@ 2018-11-21 18:26           ` Andy Lutomirski
  2018-11-21 18:43             ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2018-11-21 18:26 UTC (permalink / raw)
  To: Jens Axboe, dave.hansen
  Cc: Linus Torvalds, pabeni, Ingo Molnar, Thomas Gleixner,
	Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers,
	Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk,
	brgerst, Linux List Kernel Mailing



> On Nov 21, 2018, at 11:04 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
>> On 11/21/18 10:27 AM, Linus Torvalds wrote:
>>> On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>> 
>>> In my experiments 64 bytes was the break even point for all the CPUs I
>>> had handy, but I guess that may change with other models.
>> 
>> Note that experiments with memcpy speed are almost invariably broken.
>> microbenchmarks don't show the impact of I$, but they also don't show
>> the impact of _behavior_.
>> 
>> For example, there might be things like "repeat strings do cacheline
>> optimizations" that end up meaning that cachelines stay in L2, for
>> example, and are never brought into L1. That can be a really good
>> thing, but it can also mean that now the result isn't as close to the
>> CPU, and the subsequent use of the cacheline can be costlier.
> 
> Totally agree, which is why all my testing was NOT microbenchmarking.
> 
>> I say "go for upping the limit to 128 bytes".
> 
> See below...
> 
>> That said, if the aio user copy is _so_ critical that it's this
>> noticeable, there may be other issues. Sometimes _real_ cost of small
>> user copies is often the STAC/CLAC, more so than the "rep movs".
>> 
>> It would be interesting to know exactly which copy it is that matters
>> so much...  *inlining* the erms case might show that nicely in
>> profiles.
> 
> Oh I totally agree, which is why I since went a different route. The
> copy that matters is the copy_from_user() of the iocb, which is 64
> bytes. Even for 4k IOs, copying 64b per IO is somewhat counter
> productive for O_DIRECT.

Can we maybe use this as an excuse to ask for some reasonable instructions to access user memory?  Intel already did all the dirty work of giving something resembling sane semantics for the kernel doing a user-privileged access with WRUSS.  How about WRUSER, RDUSER, and maybe even the REP variants?  And I suppose LOCK CMPXCHGUSER.

Or Intel could try to make STAC and CLAC be genuinely fast (0 or 1 cycles and no stalls *ought* to be possible if it were handled in the front end, as long as there aren’t any PUSHF or POPF instructions in the pipeline).  As it stands, I assume that both instructions prevent any following memory accesses from starting until they retire, and they might even be nastily microcoded to handle the overloading of AC.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21 18:26           ` Andy Lutomirski
@ 2018-11-21 18:43             ` Linus Torvalds
  2018-11-21 22:38               ` Andy Lutomirski
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2018-11-21 18:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jens Axboe, dave.hansen, pabeni, Ingo Molnar, Thomas Gleixner,
	Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers,
	Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk,
	brgerst, Linux List Kernel Mailing

On Wed, Nov 21, 2018 at 10:26 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> Can we maybe use this as an excuse to ask for some reasonable instructions to access user memory?

I did that long ago. It's why we have CLAC/STAC today. I was told that
what I actually asked for (get an instruction to access user space - I
suggested using a segment override prefix) was not viable.

                 Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21 18:43             ` Linus Torvalds
@ 2018-11-21 22:38               ` Andy Lutomirski
  0 siblings, 0 replies; 39+ messages in thread
From: Andy Lutomirski @ 2018-11-21 22:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Dave Hansen, pabeni, Ingo Molnar, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, X86 ML,
	Andrew Morton, Andrew Lutomirski, Peter Zijlstra, Denys Vlasenko,
	Brian Gerst, LKML

On Wed, Nov 21, 2018 at 10:44 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Nov 21, 2018 at 10:26 AM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Can we maybe use this as an excuse to ask for some reasonable instructions to access user memory?
>
> I did that long ago. It's why we have CLAC/STAC today. I was told that
> what I actually asked for (get an instruction to access user space - I
> suggested using a segment override prefix) was not viable.
>

Maybe it wasn't viable *then*, but WRUSS really does write to
userspace with user privilege.  Surely the same mechanism with some of
the weirdness removed would do what we want.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21 17:27       ` Linus Torvalds
  2018-11-21 18:04         ` Jens Axboe
@ 2018-11-21 18:16         ` Linus Torvalds
  2018-11-21 19:01           ` Linus Torvalds
  2018-11-24  6:09           ` Jens Axboe
  1 sibling, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2018-11-21 18:16 UTC (permalink / raw)
  To: pabeni
  Cc: Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp,
	Peter Anvin, the arch/x86 maintainers, Andrew Morton,
	Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst,
	Linux List Kernel Mailing

On Wed, Nov 21, 2018 at 9:27 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It would be interesting to know exactly which copy it is that matters
> so much...  *inlining* the erms case might show that nicely in
> profiles.

Side note: the fact that Jens' patch (which I don't like in that form)
allegedly shrunk the resulting kernel binary would seem to indicate
that there's a *lot* of compile-time constant-sized memcpy calls that
we are missing, and that fall back to copy_user_generic().

It might be interesting to just change raw_copy_to/from_user() to
handle a lot more cases (in particular, handle cases where 'size' is
8-byte aligned). The special cases we *do* have may not be the right
ones (the 10-byte case in particular looks odd).

For example, instead of having a "if constant size is 8 bytes, do one
get/put_user()" case, we might have a "if constant size is < 64 just
unroll it into get/put_user()" calls.

                  Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21 18:16         ` Linus Torvalds
@ 2018-11-21 19:01           ` Linus Torvalds
  2018-11-22 10:32             ` Ingo Molnar
  2018-11-24  6:09           ` Jens Axboe
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2018-11-21 19:01 UTC (permalink / raw)
  To: pabeni
  Cc: Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp,
	Peter Anvin, the arch/x86 maintainers, Andrew Morton,
	Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst,
	Linux List Kernel Mailing

[-- Attachment #1: Type: text/plain, Size: 1893 bytes --]

On Wed, Nov 21, 2018 at 10:16 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It might be interesting to just change raw_copy_to/from_user() to
> handle a lot more cases (in particular, handle cases where 'size' is
> 8-byte aligned). The special cases we *do* have may not be the right
> ones (the 10-byte case in particular looks odd).
>
> For example, instead of having a "if constant size is 8 bytes, do one
> get/put_user()" case, we might have a "if constant size is < 64 just
> unroll it into get/put_user()" calls.

Actually, x86 doesn't even set INLINE_COPY_TO_USER, so I don't think
the constant size cases ever trigger at all the way they are set up
now.

I do have a random patch that makes "unsafe_put_user()" actually use
"asm goto" for the error case, and that, together with the attached
patch seems to generate fairly nice code, but even then it would
depend on gcc actually unrolling things (which we do *not* want in
general).

But for a 32-byte user copy (cp_old_stat), and that
INLINE_COPY_TO_USER, it generates this:

        stac
        movl    $32, %edx       #, size
        movq    %rsp, %rax      #, src
.L201:
        movq    (%rax), %rcx    # MEM[base: src_155, offset: 0B],
MEM[base: src_155, offset: 0B]
1:      movq %rcx,0(%rbp)       # MEM[base: src_155, offset: 0B],
MEM[(struct __large_struct *)dst_156]
ASM_EXTABLE_HANDLE from=1b to=.L200 handler="ex_handler_uaccess"        #

        addq    $8, %rax        #, src
        addq    $8, %rbp        #, statbuf
        subq    $8, %rdx        #, size
        jne     .L201   #,
        clac

which is actually fairly close to "optimal".

Random patch (with my "asm goto" hack included) attached, in case
people want to play with it.

Impressively, it actually removes more lines of code than it adds. But
I didn't actually check whether the end result *works*, so hey..

                  Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 13387 bytes --]

 arch/x86/include/asm/uaccess.h    |  96 +++++++++----------
 arch/x86/include/asm/uaccess_64.h | 191 ++++++++++++++++++--------------------
 fs/readdir.c                      |  22 +++--
 3 files changed, 149 insertions(+), 160 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index b5e58cc0c5e7..3f4c89deb7a1 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -12,6 +12,9 @@
 #include <asm/smap.h>
 #include <asm/extable.h>
 
+#define INLINE_COPY_TO_USER
+#define INLINE_COPY_FROM_USER
+
 /*
  * The fs value determines whether argument validity checking should be
  * performed or not.  If get_fs() == USER_DS, checking is performed, with
@@ -189,19 +192,14 @@ __typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL))
 
 
 #ifdef CONFIG_X86_32
-#define __put_user_asm_u64(x, addr, err, errret)			\
-	asm volatile("\n"						\
-		     "1:	movl %%eax,0(%2)\n"			\
-		     "2:	movl %%edx,4(%2)\n"			\
-		     "3:"						\
-		     ".section .fixup,\"ax\"\n"				\
-		     "4:	movl %3,%0\n"				\
-		     "	jmp 3b\n"					\
-		     ".previous\n"					\
-		     _ASM_EXTABLE_UA(1b, 4b)				\
-		     _ASM_EXTABLE_UA(2b, 4b)				\
-		     : "=r" (err)					\
-		     : "A" (x), "r" (addr), "i" (errret), "0" (err))
+#define __put_user_goto_u64(x, addr, label)			\
+	asm volatile("\n"					\
+		     "1:	movl %%eax,0(%2)\n"		\
+		     "2:	movl %%edx,4(%2)\n"		\
+		     _ASM_EXTABLE_UA(1b, %2l)			\
+		     _ASM_EXTABLE_UA(2b, %2l)			\
+		     : : "A" (x), "r" (addr)			\
+		     : : label)
 
 #define __put_user_asm_ex_u64(x, addr)					\
 	asm volatile("\n"						\
@@ -216,8 +214,8 @@ __typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL))
 	asm volatile("call __put_user_8" : "=a" (__ret_pu)	\
 		     : "A" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx")
 #else
-#define __put_user_asm_u64(x, ptr, retval, errret) \
-	__put_user_asm(x, ptr, retval, "q", "", "er", errret)
+#define __put_user_goto_u64(x, ptr, label) \
+	__put_user_goto(x, ptr, "q", "", "er", label)
 #define __put_user_asm_ex_u64(x, addr)	\
 	__put_user_asm_ex(x, addr, "q", "", "er")
 #define __put_user_x8(x, ptr, __ret_pu) __put_user_x(8, x, ptr, __ret_pu)
@@ -278,23 +276,21 @@ extern void __put_user_8(void);
 	__builtin_expect(__ret_pu, 0);				\
 })
 
-#define __put_user_size(x, ptr, size, retval, errret)			\
+#define __put_user_size(x, ptr, size, label)				\
 do {									\
-	retval = 0;							\
 	__chk_user_ptr(ptr);						\
 	switch (size) {							\
 	case 1:								\
-		__put_user_asm(x, ptr, retval, "b", "b", "iq", errret);	\
+		__put_user_goto(x, ptr, "b", "b", "iq", label);	\
 		break;							\
 	case 2:								\
-		__put_user_asm(x, ptr, retval, "w", "w", "ir", errret);	\
+		__put_user_goto(x, ptr, "w", "w", "ir", label);		\
 		break;							\
 	case 4:								\
-		__put_user_asm(x, ptr, retval, "l", "k", "ir", errret);	\
+		__put_user_goto(x, ptr, "l", "k", "ir", label);		\
 		break;							\
 	case 8:								\
-		__put_user_asm_u64((__typeof__(*ptr))(x), ptr, retval,	\
-				   errret);				\
+		__put_user_goto_u64((__typeof__(*ptr))(x), ptr, label);	\
 		break;							\
 	default:							\
 		__put_user_bad();					\
@@ -439,9 +435,12 @@ do {									\
 
 #define __put_user_nocheck(x, ptr, size)			\
 ({								\
-	int __pu_err;						\
+	__label__ __pu_label;					\
+	int __pu_err = -EFAULT;					\
 	__uaccess_begin();					\
-	__put_user_size((x), (ptr), (size), __pu_err, -EFAULT);	\
+	__put_user_size((x), (ptr), (size), __pu_label);	\
+	__pu_err = 0;						\
+__pu_label:							\
 	__uaccess_end();					\
 	__builtin_expect(__pu_err, 0);				\
 })
@@ -466,17 +465,23 @@ struct __large_struct { unsigned long buf[100]; };
  * we do not write to any memory gcc knows about, so there are no
  * aliasing issues.
  */
-#define __put_user_asm(x, addr, err, itype, rtype, ltype, errret)	\
-	asm volatile("\n"						\
-		     "1:	mov"itype" %"rtype"1,%2\n"		\
-		     "2:\n"						\
-		     ".section .fixup,\"ax\"\n"				\
-		     "3:	mov %3,%0\n"				\
-		     "	jmp 2b\n"					\
-		     ".previous\n"					\
-		     _ASM_EXTABLE_UA(1b, 3b)				\
-		     : "=r"(err)					\
-		     : ltype(x), "m" (__m(addr)), "i" (errret), "0" (err))
+#define __put_user_goto(x, addr, itype, rtype, ltype, label)	\
+	asm volatile goto("\n"						\
+		"1:	mov"itype" %"rtype"0,%1\n"			\
+		_ASM_EXTABLE_UA(1b, %l2)					\
+		: : ltype(x), "m" (__m(addr))				\
+		: : label)
+
+#define __put_user_failed(x, addr, itype, rtype, ltype, errret)		\
+	({	__label__ __puflab;					\
+		int __pufret = errret;					\
+		__put_user_goto(x,addr,itype,rtype,ltype,__puflab);	\
+		__pufret = 0;						\
+	__puflab: __pufret; })
+
+#define __put_user_asm(x, addr, retval, itype, rtype, ltype, errret)	do {	\
+	retval = __put_user_failed(x, addr, itype, rtype, ltype, errret);	\
+} while (0)
 
 #define __put_user_asm_ex(x, addr, itype, rtype, ltype)			\
 	asm volatile("1:	mov"itype" %"rtype"0,%1\n"		\
@@ -687,12 +692,6 @@ extern struct movsl_mask {
 
 #define ARCH_HAS_NOCACHE_UACCESS 1
 
-#ifdef CONFIG_X86_32
-# include <asm/uaccess_32.h>
-#else
-# include <asm/uaccess_64.h>
-#endif
-
 /*
  * We rely on the nested NMI work to allow atomic faults from the NMI path; the
  * nested NMI paths are careful to preserve CR2.
@@ -711,13 +710,8 @@ extern struct movsl_mask {
 #define user_access_begin()	__uaccess_begin()
 #define user_access_end()	__uaccess_end()
 
-#define unsafe_put_user(x, ptr, err_label)					\
-do {										\
-	int __pu_err;								\
-	__typeof__(*(ptr)) __pu_val = (x);					\
-	__put_user_size(__pu_val, (ptr), sizeof(*(ptr)), __pu_err, -EFAULT);	\
-	if (unlikely(__pu_err)) goto err_label;					\
-} while (0)
+#define unsafe_put_user(x, ptr, label)	\
+	__put_user_size((x), (ptr), sizeof(*(ptr)), label)
 
 #define unsafe_get_user(x, ptr, err_label)					\
 do {										\
@@ -728,5 +722,11 @@ do {										\
 	if (unlikely(__gu_err)) goto err_label;					\
 } while (0)
 
+#ifdef CONFIG_X86_32
+# include <asm/uaccess_32.h>
+#else
+# include <asm/uaccess_64.h>
+#endif
+
 #endif /* _ASM_X86_UACCESS_H */
 
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index a9d637bc301d..8dd85d2fce28 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -65,117 +65,104 @@ copy_to_user_mcsafe(void *to, const void *from, unsigned len)
 static __always_inline __must_check unsigned long
 raw_copy_from_user(void *dst, const void __user *src, unsigned long size)
 {
-	int ret = 0;
-
-	if (!__builtin_constant_p(size))
-		return copy_user_generic(dst, (__force void *)src, size);
-	switch (size) {
-	case 1:
-		__uaccess_begin_nospec();
-		__get_user_asm_nozero(*(u8 *)dst, (u8 __user *)src,
-			      ret, "b", "b", "=q", 1);
-		__uaccess_end();
-		return ret;
-	case 2:
-		__uaccess_begin_nospec();
-		__get_user_asm_nozero(*(u16 *)dst, (u16 __user *)src,
-			      ret, "w", "w", "=r", 2);
-		__uaccess_end();
-		return ret;
-	case 4:
-		__uaccess_begin_nospec();
-		__get_user_asm_nozero(*(u32 *)dst, (u32 __user *)src,
-			      ret, "l", "k", "=r", 4);
-		__uaccess_end();
-		return ret;
-	case 8:
-		__uaccess_begin_nospec();
-		__get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src,
-			      ret, "q", "", "=r", 8);
-		__uaccess_end();
-		return ret;
-	case 10:
-		__uaccess_begin_nospec();
-		__get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src,
-			       ret, "q", "", "=r", 10);
-		if (likely(!ret))
-			__get_user_asm_nozero(*(u16 *)(8 + (char *)dst),
-				       (u16 __user *)(8 + (char __user *)src),
-				       ret, "w", "w", "=r", 2);
-		__uaccess_end();
-		return ret;
-	case 16:
-		__uaccess_begin_nospec();
-		__get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src,
-			       ret, "q", "", "=r", 16);
-		if (likely(!ret))
-			__get_user_asm_nozero(*(u64 *)(8 + (char *)dst),
-				       (u64 __user *)(8 + (char __user *)src),
-				       ret, "q", "", "=r", 8);
-		__uaccess_end();
-		return ret;
-	default:
+	if (!__builtin_constant_p(size) || size > 128)
 		return copy_user_generic(dst, (__force void *)src, size);
+
+	/* Small constant size: unroll */
+	user_access_begin();
+	while (size >= sizeof(unsigned long)) {
+		unsigned long val;
+		unsafe_get_user(val, (const unsigned long __user *) src, out);
+		*(unsigned long *)dst = val;
+		size -= sizeof(unsigned long);
+		src += sizeof(unsigned long);
+		dst += sizeof(unsigned long);
 	}
+
+	while (size >= sizeof(unsigned int)) {
+		unsigned int val;
+		unsafe_get_user(val, (const unsigned int __user *) src, out);
+		*(unsigned int *)dst = val;
+		size -= sizeof(unsigned int);
+		src += sizeof(unsigned int);
+		dst += sizeof(unsigned int);
+	}
+
+	while (size >= sizeof(unsigned short)) {
+		unsigned short val;
+		unsafe_get_user(val, (const unsigned short __user *) src, out);
+		*(unsigned short *)dst = val;
+		size -= sizeof(unsigned short);
+		src += sizeof(unsigned short);
+		dst += sizeof(unsigned short);
+	}
+
+	while (size >= sizeof(unsigned char)) {
+		unsigned char val;
+		unsafe_get_user(val, (const unsigned char __user *) src, out);
+		*(unsigned char *)dst = val;
+		size -= sizeof(unsigned char);
+		src += sizeof(unsigned char);
+		dst += sizeof(unsigned char);
+	}
+	user_access_end();
+	return 0;
+
+out:
+	user_access_end();
+	return size;
 }
 
 static __always_inline __must_check unsigned long
 raw_copy_to_user(void __user *dst, const void *src, unsigned long size)
 {
-	int ret = 0;
-
-	if (!__builtin_constant_p(size))
-		return copy_user_generic((__force void *)dst, src, size);
-	switch (size) {
-	case 1:
-		__uaccess_begin();
-		__put_user_asm(*(u8 *)src, (u8 __user *)dst,
-			      ret, "b", "b", "iq", 1);
-		__uaccess_end();
-		return ret;
-	case 2:
-		__uaccess_begin();
-		__put_user_asm(*(u16 *)src, (u16 __user *)dst,
-			      ret, "w", "w", "ir", 2);
-		__uaccess_end();
-		return ret;
-	case 4:
-		__uaccess_begin();
-		__put_user_asm(*(u32 *)src, (u32 __user *)dst,
-			      ret, "l", "k", "ir", 4);
-		__uaccess_end();
-		return ret;
-	case 8:
-		__uaccess_begin();
-		__put_user_asm(*(u64 *)src, (u64 __user *)dst,
-			      ret, "q", "", "er", 8);
-		__uaccess_end();
-		return ret;
-	case 10:
-		__uaccess_begin();
-		__put_user_asm(*(u64 *)src, (u64 __user *)dst,
-			       ret, "q", "", "er", 10);
-		if (likely(!ret)) {
-			asm("":::"memory");
-			__put_user_asm(4[(u16 *)src], 4 + (u16 __user *)dst,
-				       ret, "w", "w", "ir", 2);
-		}
-		__uaccess_end();
-		return ret;
-	case 16:
-		__uaccess_begin();
-		__put_user_asm(*(u64 *)src, (u64 __user *)dst,
-			       ret, "q", "", "er", 16);
-		if (likely(!ret)) {
-			asm("":::"memory");
-			__put_user_asm(1[(u64 *)src], 1 + (u64 __user *)dst,
-				       ret, "q", "", "er", 8);
-		}
-		__uaccess_end();
-		return ret;
-	default:
+	if (!__builtin_constant_p(size) || size > 128)
 		return copy_user_generic((__force void *)dst, src, size);
+
+	/* Small constant size: unroll */
+	user_access_begin();
+	while (size >= sizeof(unsigned long)) {
+		unsigned long val;
+		val = *(const unsigned long *)src;
+		unsafe_put_user(val, (unsigned long __user *) dst, out);
+		size -= sizeof(unsigned long);
+		src += sizeof(unsigned long);
+		dst += sizeof(unsigned long);
+	}
+
+	while (size >= sizeof(unsigned int)) {
+		unsigned int val;
+		val = *(const unsigned int *)src;
+		unsafe_put_user(val, (unsigned int __user *) dst, out);
+		size -= sizeof(unsigned int);
+		src += sizeof(unsigned int);
+		dst += sizeof(unsigned int);
+	}
+
+	while (size >= sizeof(unsigned short)) {
+		unsigned short val;
+		val = *(const unsigned short *)src;
+		unsafe_put_user(val, (unsigned short __user *) dst, out);
+		size -= sizeof(unsigned short);
+		src += sizeof(unsigned short);
+		dst += sizeof(unsigned short);
+	}
+
+	while (size >= sizeof(unsigned char)) {
+		unsigned char val;
+		val = *(const unsigned char *)src;
+		unsafe_put_user(val, (unsigned char __user *) dst, out);
+		size -= sizeof(unsigned char);
+		src += sizeof(unsigned char);
+		dst += sizeof(unsigned char);
 	}
+
+	user_access_end();
+	return 0;
+
+out:
+	user_access_end();
+	return size;
 }
 
 static __always_inline __must_check
diff --git a/fs/readdir.c b/fs/readdir.c
index d97f548e6323..f1e159e17125 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -185,25 +185,27 @@ static int filldir(struct dir_context *ctx, const char *name, int namlen,
 	if (dirent) {
 		if (signal_pending(current))
 			return -EINTR;
-		if (__put_user(offset, &dirent->d_off))
-			goto efault;
 	}
+
+	user_access_begin();
+	if (dirent)
+		unsafe_put_user(offset, &dirent->d_off, efault_end);
 	dirent = buf->current_dir;
-	if (__put_user(d_ino, &dirent->d_ino))
-		goto efault;
-	if (__put_user(reclen, &dirent->d_reclen))
-		goto efault;
+	unsafe_put_user(d_ino, &dirent->d_ino, efault_end);
+	unsafe_put_user(reclen, &dirent->d_reclen, efault_end);
+	unsafe_put_user(0, dirent->d_name + namlen, efault_end);
+	unsafe_put_user(d_type, (char __user *) dirent + reclen - 1, efault_end);
+	user_access_end();
+
 	if (copy_to_user(dirent->d_name, name, namlen))
 		goto efault;
-	if (__put_user(0, dirent->d_name + namlen))
-		goto efault;
-	if (__put_user(d_type, (char __user *) dirent + reclen - 1))
-		goto efault;
 	buf->previous = dirent;
 	dirent = (void __user *)dirent + reclen;
 	buf->current_dir = dirent;
 	buf->count -= reclen;
 	return 0;
+efault_end:
+	user_access_end();
 efault:
 	buf->error = -EFAULT;
 	return -EFAULT;

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21 19:01           ` Linus Torvalds
@ 2018-11-22 10:32             ` Ingo Molnar
  2018-11-22 11:13               ` Ingo Molnar
  2018-11-22 16:55               ` Linus Torvalds
  0 siblings, 2 replies; 39+ messages in thread
From: Ingo Molnar @ 2018-11-22 10:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar, bp,
	Peter Anvin, the arch/x86 maintainers, Andrew Morton,
	Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst,
	Linux List Kernel Mailing


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Nov 21, 2018 at 10:16 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It might be interesting to just change raw_copy_to/from_user() to
> > handle a lot more cases (in particular, handle cases where 'size' is
> > 8-byte aligned). The special cases we *do* have may not be the right
> > ones (the 10-byte case in particular looks odd).
> >
> > For example, instead of having a "if constant size is 8 bytes, do one
> > get/put_user()" case, we might have a "if constant size is < 64 just
> > unroll it into get/put_user()" calls.
> 
> Actually, x86 doesn't even set INLINE_COPY_TO_USER, so I don't think
> the constant size cases ever trigger at all the way they are set up
> now.

Side note, there's one artifact the patch exposes: some of the 
__builtin_constant_p() checks are imprecise and don't trigger at the 
early stage where GCC checks them, but the lenght is actually known to 
the compiler at later optimization stages.

This means that with Jen's patch some of the length checks go away. I 
checked x86-64 defconfig and a distro config, and the numbers were ~7% 
and 10%, so not a big effect.

The kernel text size reduction with Jen's patch is small but real:

 text		data		bss		dec		hex	filename
 19572694	11516934	19873888	50963516	309a43c	vmlinux.before
 19572468	11516934	19873888	50963290	309a35a	vmlinux.after

But I checked the disassembly, and it's not a real win, the new code is 
actually more complex than the old one, as expected, but GCC (7.3.0) does 
some particularly stupid things which bloats the generated code.

> I do have a random patch that makes "unsafe_put_user()" actually use
> "asm goto" for the error case, and that, together with the attached
> patch seems to generate fairly nice code, but even then it would
> depend on gcc actually unrolling things (which we do *not* want in
> general).
> 
> But for a 32-byte user copy (cp_old_stat), and that
> INLINE_COPY_TO_USER, it generates this:
> 
>         stac
>         movl    $32, %edx       #, size
>         movq    %rsp, %rax      #, src
> .L201:
>         movq    (%rax), %rcx    # MEM[base: src_155, offset: 0B],
> MEM[base: src_155, offset: 0B]
> 1:      movq %rcx,0(%rbp)       # MEM[base: src_155, offset: 0B],
> MEM[(struct __large_struct *)dst_156]
> ASM_EXTABLE_HANDLE from=1b to=.L200 handler="ex_handler_uaccess"        #
> 
>         addq    $8, %rax        #, src
>         addq    $8, %rbp        #, statbuf
>         subq    $8, %rdx        #, size
>         jne     .L201   #,
>         clac
> 
> which is actually fairly close to "optimal".

Yeah, that looks pretty sweet!

> Random patch (with my "asm goto" hack included) attached, in case
> people want to play with it.

Doesn't even look all that hacky to me. Any hack in it that I didn't 
notice? :-)

The only question is the inlining overhead - will try to measure that.

> Impressively, it actually removes more lines of code than it adds. But
> I didn't actually check whether the end result *works*, so hey..

Most of the linecount reduction appears to come from the simplification 
of the unroll loop and moving it into C, from a 6-way hard-coded copy 
routine:

> -	switch (size) {
> -	case 1:
> -	case 2:
> -	case 4:
> -	case 8:
> -	case 10:
> -	case 16:

to a more flexible 4-way loop unrolling:

> +	while (size >= sizeof(unsigned long)) {
> +	while (size >= sizeof(unsigned int)) {
> +	while (size >= sizeof(unsigned short)) {
> +	while (size >= sizeof(unsigned char)) {

Which is a nice improvement in itself.

> +	user_access_begin();
> +	if (dirent)
> +		unsafe_put_user(offset, &dirent->d_off, efault_end);
>  	dirent = buf->current_dir;
> +	unsafe_put_user(d_ino, &dirent->d_ino, efault_end);
> +	unsafe_put_user(reclen, &dirent->d_reclen, efault_end);
> +	unsafe_put_user(0, dirent->d_name + namlen, efault_end);
> +	unsafe_put_user(d_type, (char __user *) dirent + reclen - 1, efault_end);
> +	user_access_end();
> +
>  	if (copy_to_user(dirent->d_name, name, namlen))
>  		goto efault;
>  	buf->previous = dirent;
>  	dirent = (void __user *)dirent + reclen;
>  	buf->current_dir = dirent;
>  	buf->count -= reclen;
>  	return 0;
> +efault_end:
> +	user_access_end();
>  efault:
>  	buf->error = -EFAULT;
>  	return -EFAULT;

In terms of high level APIs, could we perhaps use the opportunity to 
introduce unsafe_write_user() instead, which would allow us to write it 
as:

	unsafe_write_user(&dirent->d_ino, d_ino, efault_end);
	unsafe_write_user(&dirent->d_reclen, reclen, efault_end);
	unsafe_write_user(dirent->d_name + namlen, 0, efault_end);
	unsafe_write_user((char __user *)dirent + reclen - 1, d_type, efault_end);

	if (copy_to_user(dirent->d_name, name, namlen))
		goto efault;

This gives it the regular 'VAR = VAL;' notation of C assigments, instead 
of the weird historical reverse notation that put_user()/get_user() uses.

Note how this newfangled ordering now matches the 'copy_to_user()' 
natural C-assignment parameter order that comes straight afterwards and 
makes it obvious that the d->name+namelen was writing the delimiter at 
the end.

I think we even had bugs from put_user() ordering mixups?

Or is it too late to try to fix this particular mistake?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-22 10:32             ` Ingo Molnar
@ 2018-11-22 11:13               ` Ingo Molnar
  2018-11-22 11:21                 ` Ingo Molnar
  2018-11-23 16:40                 ` Josh Poimboeuf
  2018-11-22 16:55               ` Linus Torvalds
  1 sibling, 2 replies; 39+ messages in thread
From: Ingo Molnar @ 2018-11-22 11:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar, bp,
	Peter Anvin, the arch/x86 maintainers, Andrew Morton,
	Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst,
	Linux List Kernel Mailing


* Ingo Molnar <mingo@kernel.org> wrote:

> The kernel text size reduction with Jen's patch is small but real:
> 
>  text		data		bss		dec		hex	filename
>  19572694	11516934	19873888	50963516	309a43c	vmlinux.before
>  19572468	11516934	19873888	50963290	309a35a	vmlinux.after
> 
> But I checked the disassembly, and it's not a real win, the new code is 
> actually more complex than the old one, as expected, but GCC (7.3.0) does 
> some particularly stupid things which bloats the generated code.

So I dug into this some more:

1)

Firstly I tracked down GCC bloating the might_fault() checks and the 
related out-of-line code exception handling which bloats the full 
generated function.

2)

But with even that complication eliminated, there's a size reduction when 
Jen's patch is applied, which is puzzling:

19563640	11516790	19882080	50962510	309a04e	vmlinux.before
19563274	11516790	19882080	50962144	3099ee0	vmlinux.after

but this is entirely due to the .altinstructions section being counted as 
'text' part of the vmlinux - while in reality it's not:

3)

The _real_ part of the vmlinux gets bloated by Jen's patch:

 ffffffff81000000 <_stext>:

 before:  ffffffff81b0e5e0 <__clear_user>
 after:   ffffffff81b0e670 <__clear_user>:

I.e. we get a e5e0 => e670 bloat, as expected.

In the config I tested a later section of the kernel image first aligns 
away the bloat:

 before: ffffffff82fa6321 <.altinstr_aux>:
 after:  ffffffff82fa6321 <.altinstr_aux>:

and then artificially debloats the modified kernel via the 
altinstructions section:

  before: Disassembly of section .exit.text: ffffffff83160798 <intel_uncore_exit>
  after:  Disassembly of section .exit.text: ffffffff83160608 <intel_uncore_exit>

Note that there's a third level of obfuscation here: Jen's patch actually 
*adds* a new altinstructions statement:

+       /*
+        * For smaller copies, don't use ERMS as it's slower.
+        */
+       if (len < 128) {
+               alternative_call(copy_user_generic_unrolled,
+                                copy_user_generic_string, X86_FEATURE_REP_GOOD,
+                                ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
+                                            "=d" (len)),
+                                "1" (to), "2" (from), "3" (len)
+                                : "memory", "rcx", "r8", "r9", "r10", "r11");
+               return ret;
+       }
+
        /*
         * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
         * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
         * Otherwise, use copy_user_generic_unrolled.
         */
        alternative_call_2(copy_user_generic_unrolled,
-                        copy_user_generic_string,
-                        X86_FEATURE_REP_GOOD,
-                        copy_user_enhanced_fast_string,
-                        X86_FEATURE_ERMS,
+                        copy_user_generic_string, X86_FEATURE_REP_GOOD,
+                        copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
                         ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
                                     "=d" (len)),
                         "1" (to), "2" (from), "3" (len)

So how can this change possibly result in a *small* altinstructions 
section?

4)

The reason is GCC's somewhat broken __builtin_constant() logic, which 
leaves ~10% of the constant call sites actually active, but which are 
then optimized by GCC's later stages, and the alternative_call_2() gets 
optimized out and replaced with the alternative_call() call.

This is where Jens's patch 'debloats' the vmlinux and confuses the 'size' 
utility and gains its code reduction street cred.

Note to self: watch out for patches that change altinstructions and don't 
make premature vmlinux size impact assumptions. :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-22 11:13               ` Ingo Molnar
@ 2018-11-22 11:21                 ` Ingo Molnar
  2018-11-23 16:40                 ` Josh Poimboeuf
  1 sibling, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2018-11-22 11:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar, bp,
	Peter Anvin, the arch/x86 maintainers, Andrew Morton,
	Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst,
	Linux List Kernel Mailing


* Ingo Molnar <mingo@kernel.org> wrote:

> So I dug into this some more:
> 
> 1)
> 
> Firstly I tracked down GCC bloating the might_fault() checks and the 
> related out-of-line code exception handling which bloats the full 
> generated function.

Sorry, I mis-remembered that detail when I wrote the email: it was 
CONFIG_HARDENED_USERCOPY=y and its object size checks that distros enable 
- and I disabled that option to simplify the size analysis.

(might_fault() doesn't have inline conditionals so shouldn't have any 
effect on the generated code.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-22 11:13               ` Ingo Molnar
  2018-11-22 11:21                 ` Ingo Molnar
@ 2018-11-23 16:40                 ` Josh Poimboeuf
  1 sibling, 0 replies; 39+ messages in thread
From: Josh Poimboeuf @ 2018-11-23 16:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar,
	bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton,
	Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst,
	Linux List Kernel Mailing

On Thu, Nov 22, 2018 at 12:13:41PM +0100, Ingo Molnar wrote:
> Note to self: watch out for patches that change altinstructions and don't 
> make premature vmlinux size impact assumptions. :-)

I noticed a similar problem with ORC data.  As it turns out, size's
"text" calculation also includes read-only sections.  That includes
.rodata and anything else not writable.

Maybe we need a more sensible "size" script for the kernel.  It would be
trivial to implement based on the output of "readelf -S vmlinux".

-- 
Josh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-22 10:32             ` Ingo Molnar
  2018-11-22 11:13               ` Ingo Molnar
@ 2018-11-22 16:55               ` Linus Torvalds
  2018-11-22 17:26                 ` Andy Lutomirski
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2018-11-22 16:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar, bp,
	Peter Anvin, the arch/x86 maintainers, Andrew Morton,
	Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst,
	Linux List Kernel Mailing

On Thu, Nov 22, 2018 at 2:32 AM Ingo Molnar <mingo@kernel.org> wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > Random patch (with my "asm goto" hack included) attached, in case
> > people want to play with it.
>
> Doesn't even look all that hacky to me. Any hack in it that I didn't
> notice? :-)

The code to use asm goto sadly doesn't have any fallback at all for
the "no asm goto available".

I guess we're getting close to "we require asm goto support", but I
don't think we're there yet.

Also, while "unsafe_put_user()" has been converted to use asm goto
(and yes, it really does generate much nicer code), the same is not
true of "unsafe_get_user()". Because sadly, gcc does not support asm
goto with output values.

So, realistically, my patch is not _technically_ hacky, but it's
simply not viable as things stand, and it's more of a tech
demonstration than anything else.

               Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-22 16:55               ` Linus Torvalds
@ 2018-11-22 17:26                 ` Andy Lutomirski
  2018-11-22 17:35                   ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2018-11-22 17:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, X86 ML, Andrew Morton,
	Andrew Lutomirski, Peter Zijlstra, Denys Vlasenko, Brian Gerst,
	LKML

On Thu, Nov 22, 2018 at 8:56 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Nov 22, 2018 at 2:32 AM Ingo Molnar <mingo@kernel.org> wrote:
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > >
> > > Random patch (with my "asm goto" hack included) attached, in case
> > > people want to play with it.
> >
> > Doesn't even look all that hacky to me. Any hack in it that I didn't
> > notice? :-)
>
> The code to use asm goto sadly doesn't have any fallback at all for
> the "no asm goto available".
>
> I guess we're getting close to "we require asm goto support", but I
> don't think we're there yet.

commit e501ce957a786ecd076ea0cfb10b114e6e4d0f40
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Jan 17 11:42:07 2018 +0100

    x86: Force asm-goto

    We want to start using asm-goto to guarantee the absence of dynamic
    branches (and thus speculation).

    A primary prerequisite for this is of course that the compiler
    supports asm-goto. This effecively lifts the minimum GCC version to
    build an x86 kernel to gcc-4.5.

This is basically the only good outcome from the speculation crap as
far as I'm concerned :)

So I think your patch is viable.  Also, with that patch applied,
put_user_ex() should become worse than worthless -- if gcc is any
good, plain old:

if (unsafe_put_user(...) != 0)
  goto err;
if (unsafe_put_user(...) != 0)
  goto err;
etc.

will generate *better* code than a series of put_user_ex() calls.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-22 17:26                 ` Andy Lutomirski
@ 2018-11-22 17:35                   ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2018-11-22 17:35 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Ingo Molnar, pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar,
	bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton,
	Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing

On Thu, Nov 22, 2018 at 9:26 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> So I think your patch is viable.  Also, with that patch applied,
> put_user_ex() should become worse than worthless

Yes. I hate those special-case _ex variants.

I guess I should just properly forward-port my patch series where the
different steps are separated out (not jumbled up like that patch I
actually posted).

                Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86: only use ERMS for user copies for larger sizes
  2018-11-21 18:16         ` Linus Torvalds
  2018-11-21 19:01           ` Linus Torvalds
@ 2018-11-24  6:09           ` Jens Axboe
  1 sibling, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2018-11-24  6:09 UTC (permalink / raw)
  To: Linus Torvalds, pabeni
  Cc: Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin,
	the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski,
	Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing

On 11/21/18 11:16 AM, Linus Torvalds wrote:
> On Wed, Nov 21, 2018 at 9:27 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> It would be interesting to know exactly which copy it is that matters
>> so much...  *inlining* the erms case might show that nicely in
>> profiles.
> 
> Side note: the fact that Jens' patch (which I don't like in that form)
> allegedly shrunk the resulting kernel binary would seem to indicate
> that there's a *lot* of compile-time constant-sized memcpy calls that
> we are missing, and that fall back to copy_user_generic().

Other kind of side note... This also affects memset(), which does
rep stosb if we have ERMS if any size memset. I noticed this from
sg_init_table(), which does a memset of the table. For my kind of
testing, the entry size is small. The below, too, reduces memset()
overhead by 50% here for me.

diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index 9bc861c71e75..bad0fdb9ddcd 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -60,6 +60,8 @@ EXPORT_SYMBOL(__memset)
  * rax   original destination
  */
 ENTRY(memset_erms)
+	cmpl $128,%edx
+	jb memset_orig
 	movq %rdi,%r9
 	movb %sil,%al
 	movq %rdx,%rcx

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2019-01-08 18:01 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <02bfc577-32a5-66be-64bf-d476b7d447d2@kernel.dk>
2018-11-20 20:24 ` [PATCH] x86: only use ERMS for user copies for larger sizes Jens Axboe
2018-11-21  6:36 ` Ingo Molnar
2018-11-21 13:32   ` Jens Axboe
2018-11-21 13:44     ` Denys Vlasenko
2018-11-22 17:36       ` David Laight
2018-11-22 17:52         ` Linus Torvalds
2018-11-22 18:06           ` Andy Lutomirski
2018-11-22 18:58             ` Linus Torvalds
2018-11-23  9:34               ` David Laight
2018-11-23 10:12                 ` David Laight
2018-11-23 16:36                   ` Linus Torvalds
2018-11-23 17:42                     ` Linus Torvalds
2018-11-23 18:39                       ` Andy Lutomirski
2018-11-23 18:44                         ` Linus Torvalds
2018-11-23 19:11                           ` Andy Lutomirski
2018-11-26 10:12                             ` David Laight
2018-11-26 10:01                     ` David Laight
2018-11-26 10:26                     ` David Laight
2019-01-05  2:38                       ` Linus Torvalds
2019-01-07  9:55                         ` David Laight
2019-01-07 17:43                           ` Linus Torvalds
2019-01-08  9:10                             ` David Laight
2019-01-08 18:01                               ` Linus Torvalds
2018-11-21 13:45     ` Paolo Abeni
2018-11-21 17:27       ` Linus Torvalds
2018-11-21 18:04         ` Jens Axboe
2018-11-21 18:26           ` Andy Lutomirski
2018-11-21 18:43             ` Linus Torvalds
2018-11-21 22:38               ` Andy Lutomirski
2018-11-21 18:16         ` Linus Torvalds
2018-11-21 19:01           ` Linus Torvalds
2018-11-22 10:32             ` Ingo Molnar
2018-11-22 11:13               ` Ingo Molnar
2018-11-22 11:21                 ` Ingo Molnar
2018-11-23 16:40                 ` Josh Poimboeuf
2018-11-22 16:55               ` Linus Torvalds
2018-11-22 17:26                 ` Andy Lutomirski
2018-11-22 17:35                   ` Linus Torvalds
2018-11-24  6:09           ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).