linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/uaccess: use unrolled string copy for short strings
@ 2017-06-21 11:09 Paolo Abeni
  2017-06-21 17:38 ` Kees Cook
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Paolo Abeni @ 2017-06-21 11:09 UTC (permalink / raw)
  To: x86
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Al Viro, Kees Cook,
	Hannes Frederic Sowa, linux-kernel

The 'rep' prefix suffers for a relevant "setup cost"; as a result
string copies with unrolled loops are faster than even
optimized string copy using 'rep' variant, for short string.

This change updates __copy_user_generic() to use the unrolled
version for small string length. The threshold length for short
string - 64 - has been selected with empirical measures as the
larger value that still ensure a measurable gain.

A micro-benchmark of __copy_from_user() with different lengths shows
the following:

string len	vanilla		patched 	delta
bytes		ticks		ticks		tick(%)

0		58		26		32(55%)
1		49		29		20(40%)
2		49		31		18(36%)
3		49		32		17(34%)
4		50		34		16(32%)
5		49		35		14(28%)
6		49		36		13(26%)
7		49		38		11(22%)
8		50		31		19(38%)
9		51		33		18(35%)
10		52		36		16(30%)
11		52		37		15(28%)
12		52		38		14(26%)
13		52		40		12(23%)
14		52		41		11(21%)
15		52		42		10(19%)
16		51		34		17(33%)
17		51		35		16(31%)
18		52		37		15(28%)
19		51		38		13(25%)
20		52		39		13(25%)
21		52		40		12(23%)
22		51		42		9(17%)
23		51		46		5(9%)
24		52		35		17(32%)
25		52		37		15(28%)
26		52		38		14(26%)
27		52		39		13(25%)
28		52		40		12(23%)
29		53		42		11(20%)
30		52		43		9(17%)
31		52		44		8(15%)
32		51		36		15(29%)
33		51		38		13(25%)
34		51		39		12(23%)
35		51		41		10(19%)
36		52		41		11(21%)
37		52		43		9(17%)
38		51		44		7(13%)
39		52		46		6(11%)
40		51		37		14(27%)
41		50		38		12(24%)
42		50		39		11(22%)
43		50		40		10(20%)
44		50		42		8(16%)
45		50		43		7(14%)
46		50		43		7(14%)
47		50		45		5(10%)
48		50		37		13(26%)
49		49		38		11(22%)
50		50		40		10(20%)
51		50		42		8(16%)
52		50		42		8(16%)
53		49		46		3(6%)
54		50		46		4(8%)
55		49		48		1(2%)
56		50		39		11(22%)
57		50		40		10(20%)
58		49		42		7(14%)
59		50		42		8(16%)
60		50		46		4(8%)
61		50		47		3(6%)
62		50		48		2(4%)
63		50		48		2(4%)
64		51		38		13(25%)

Above 64 bytes the gain fades away.

Very similar values are collectd for __copy_to_user().
UDP receive performances under flood with small packets using recvfrom()
increase by ~5%.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 arch/x86/include/asm/uaccess_64.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9..16a8871 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned len)
 {
 	unsigned ret;
 
+	if (len <= 64)
+		return copy_user_generic_unrolled(to, from, len);
+
 	/*
 	 * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
 	 * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
  2017-06-21 11:09 [PATCH] x86/uaccess: use unrolled string copy for short strings Paolo Abeni
@ 2017-06-21 17:38 ` Kees Cook
  2017-06-22 14:55   ` Alan Cox
  2017-06-22  8:47 ` Ingo Molnar
  2017-06-22 17:30 ` Linus Torvalds
  2 siblings, 1 reply; 10+ messages in thread
From: Kees Cook @ 2017-06-21 17:38 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Al Viro,
	Hannes Frederic Sowa, LKML

On Wed, Jun 21, 2017 at 4:09 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> The 'rep' prefix suffers for a relevant "setup cost"; as a result
> string copies with unrolled loops are faster than even
> optimized string copy using 'rep' variant, for short string.
>
> This change updates __copy_user_generic() to use the unrolled
> version for small string length. The threshold length for short
> string - 64 - has been selected with empirical measures as the
> larger value that still ensure a measurable gain.
>
> A micro-benchmark of __copy_from_user() with different lengths shows
> the following:
>
> string len      vanilla         patched         delta
> bytes           ticks           ticks           tick(%)
>
> 0               58              26              32(55%)
> 1               49              29              20(40%)
> 2               49              31              18(36%)
> 3               49              32              17(34%)
> 4               50              34              16(32%)
> 5               49              35              14(28%)
> 6               49              36              13(26%)
> 7               49              38              11(22%)
> 8               50              31              19(38%)
> 9               51              33              18(35%)
> 10              52              36              16(30%)
> 11              52              37              15(28%)
> 12              52              38              14(26%)
> 13              52              40              12(23%)
> 14              52              41              11(21%)
> 15              52              42              10(19%)
> 16              51              34              17(33%)
> 17              51              35              16(31%)
> 18              52              37              15(28%)
> 19              51              38              13(25%)
> 20              52              39              13(25%)
> 21              52              40              12(23%)
> 22              51              42              9(17%)
> 23              51              46              5(9%)
> 24              52              35              17(32%)
> 25              52              37              15(28%)
> 26              52              38              14(26%)
> 27              52              39              13(25%)
> 28              52              40              12(23%)
> 29              53              42              11(20%)
> 30              52              43              9(17%)
> 31              52              44              8(15%)
> 32              51              36              15(29%)
> 33              51              38              13(25%)
> 34              51              39              12(23%)
> 35              51              41              10(19%)
> 36              52              41              11(21%)
> 37              52              43              9(17%)
> 38              51              44              7(13%)
> 39              52              46              6(11%)
> 40              51              37              14(27%)
> 41              50              38              12(24%)
> 42              50              39              11(22%)
> 43              50              40              10(20%)
> 44              50              42              8(16%)
> 45              50              43              7(14%)
> 46              50              43              7(14%)
> 47              50              45              5(10%)
> 48              50              37              13(26%)
> 49              49              38              11(22%)
> 50              50              40              10(20%)
> 51              50              42              8(16%)
> 52              50              42              8(16%)
> 53              49              46              3(6%)
> 54              50              46              4(8%)
> 55              49              48              1(2%)
> 56              50              39              11(22%)
> 57              50              40              10(20%)
> 58              49              42              7(14%)
> 59              50              42              8(16%)
> 60              50              46              4(8%)
> 61              50              47              3(6%)
> 62              50              48              2(4%)
> 63              50              48              2(4%)
> 64              51              38              13(25%)
>
> Above 64 bytes the gain fades away.
>
> Very similar values are collectd for __copy_to_user().
> UDP receive performances under flood with small packets using recvfrom()
> increase by ~5%.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Since there are no regressions here, this seems sensible to me. :)

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  arch/x86/include/asm/uaccess_64.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index c5504b9..16a8871 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned len)
>  {
>         unsigned ret;
>
> +       if (len <= 64)
> +               return copy_user_generic_unrolled(to, from, len);
> +
>         /*
>          * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
>          * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
> --
> 2.9.4
>



-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
  2017-06-21 11:09 [PATCH] x86/uaccess: use unrolled string copy for short strings Paolo Abeni
  2017-06-21 17:38 ` Kees Cook
@ 2017-06-22  8:47 ` Ingo Molnar
  2017-06-22 17:02   ` Paolo Abeni
  2017-06-22 17:30 ` Linus Torvalds
  2 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2017-06-22  8:47 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Al Viro,
	Kees Cook, Hannes Frederic Sowa, linux-kernel, Linus Torvalds,
	Andy Lutomirski


* Paolo Abeni <pabeni@redhat.com> wrote:

> The 'rep' prefix suffers for a relevant "setup cost"; as a result
> string copies with unrolled loops are faster than even
> optimized string copy using 'rep' variant, for short string.
> 
> This change updates __copy_user_generic() to use the unrolled
> version for small string length. The threshold length for short
> string - 64 - has been selected with empirical measures as the
> larger value that still ensure a measurable gain.
> 
> A micro-benchmark of __copy_from_user() with different lengths shows
> the following:
> 
> string len	vanilla		patched 	delta
> bytes		ticks		ticks		tick(%)
> 
> 0		58		26		32(55%)
> 1		49		29		20(40%)
> 2		49		31		18(36%)
> 3		49		32		17(34%)
> 4		50		34		16(32%)
> 5		49		35		14(28%)
> 6		49		36		13(26%)
> 7		49		38		11(22%)
> 8		50		31		19(38%)
> 9		51		33		18(35%)
> 10		52		36		16(30%)
> 11		52		37		15(28%)
> 12		52		38		14(26%)
> 13		52		40		12(23%)
> 14		52		41		11(21%)
> 15		52		42		10(19%)
> 16		51		34		17(33%)
> 17		51		35		16(31%)
> 18		52		37		15(28%)
> 19		51		38		13(25%)
> 20		52		39		13(25%)
> 21		52		40		12(23%)
> 22		51		42		9(17%)
> 23		51		46		5(9%)
> 24		52		35		17(32%)
> 25		52		37		15(28%)
> 26		52		38		14(26%)
> 27		52		39		13(25%)
> 28		52		40		12(23%)
> 29		53		42		11(20%)
> 30		52		43		9(17%)
> 31		52		44		8(15%)
> 32		51		36		15(29%)
> 33		51		38		13(25%)
> 34		51		39		12(23%)
> 35		51		41		10(19%)
> 36		52		41		11(21%)
> 37		52		43		9(17%)
> 38		51		44		7(13%)
> 39		52		46		6(11%)
> 40		51		37		14(27%)
> 41		50		38		12(24%)
> 42		50		39		11(22%)
> 43		50		40		10(20%)
> 44		50		42		8(16%)
> 45		50		43		7(14%)
> 46		50		43		7(14%)
> 47		50		45		5(10%)
> 48		50		37		13(26%)
> 49		49		38		11(22%)
> 50		50		40		10(20%)
> 51		50		42		8(16%)
> 52		50		42		8(16%)
> 53		49		46		3(6%)
> 54		50		46		4(8%)
> 55		49		48		1(2%)
> 56		50		39		11(22%)
> 57		50		40		10(20%)
> 58		49		42		7(14%)
> 59		50		42		8(16%)
> 60		50		46		4(8%)
> 61		50		47		3(6%)
> 62		50		48		2(4%)
> 63		50		48		2(4%)
> 64		51		38		13(25%)
> 
> Above 64 bytes the gain fades away.
> 
> Very similar values are collectd for __copy_to_user().
> UDP receive performances under flood with small packets using recvfrom()
> increase by ~5%.

What CPU model(s) were used for the performance testing and was it performance 
tested on several different types of CPUs?

Please add a comment here:

+       if (len <= 64)
+               return copy_user_generic_unrolled(to, from, len);
+

... because it's not obvious at all that this is a performance optimization, not a 
correctness issue. Also explain that '64' is a number that we got from performance 
measurements.

But in general I like the change - as long as it was measured on reasonably modern 
x86 CPUs. I.e. it should not regress on modern Intel or AMD CPUs.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
  2017-06-21 17:38 ` Kees Cook
@ 2017-06-22 14:55   ` Alan Cox
  0 siblings, 0 replies; 10+ messages in thread
From: Alan Cox @ 2017-06-22 14:55 UTC (permalink / raw)
  To: Kees Cook
  Cc: Paolo Abeni, x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Al Viro, Hannes Frederic Sowa, LKML

> > diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> > index c5504b9..16a8871 100644
> > --- a/arch/x86/include/asm/uaccess_64.h
> > +++ b/arch/x86/include/asm/uaccess_64.h
> > @@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned len)
> >  {
> >         unsigned ret;
> >
> > +       if (len <= 64)
> > +               return copy_user_generic_unrolled(to, from, len);
> > +

Given this gets inlined surely that should bracketed with an ifdef
against optimizing for space.

Also you give one set of benchmarks but how do they look on different
processors - AMD v Intel, Atom v Core, Ryzen v older AMD etc ?

Alan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
  2017-06-22  8:47 ` Ingo Molnar
@ 2017-06-22 17:02   ` Paolo Abeni
  0 siblings, 0 replies; 10+ messages in thread
From: Paolo Abeni @ 2017-06-22 17:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: x86, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Al Viro,
	Kees Cook, Hannes Frederic Sowa, linux-kernel, Linus Torvalds,
	Andy Lutomirski

On Thu, 2017-06-22 at 10:47 +0200, Ingo Molnar wrote:
> * Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > The 'rep' prefix suffers for a relevant "setup cost"; as a result
> > string copies with unrolled loops are faster than even
> > optimized string copy using 'rep' variant, for short string.
> > 
> > This change updates __copy_user_generic() to use the unrolled
> > version for small string length. The threshold length for short
> > string - 64 - has been selected with empirical measures as the
> > larger value that still ensure a measurable gain.
> > 
> > A micro-benchmark of __copy_from_user() with different lengths shows
> > the following:
> > 
> > string len	vanilla		patched 	delta
> > bytes		ticks		ticks		tick(%)
> > 
> > 0		58		26		32(55%)
> > 1		49		29		20(40%)
> > 2		49		31		18(36%)
> > 3		49		32		17(34%)
> > 4		50		34		16(32%)
> > 5		49		35		14(28%)
> > 6		49		36		13(26%)
> > 7		49		38		11(22%)
> > 8		50		31		19(38%)
> > 9		51		33		18(35%)
> > 10		52		36		16(30%)
> > 11		52		37		15(28%)
> > 12		52		38		14(26%)
> > 13		52		40		12(23%)
> > 14		52		41		11(21%)
> > 15		52		42		10(19%)
> > 16		51		34		17(33%)
> > 17		51		35		16(31%)
> > 18		52		37		15(28%)
> > 19		51		38		13(25%)
> > 20		52		39		13(25%)
> > 21		52		40		12(23%)
> > 22		51		42		9(17%)
> > 23		51		46		5(9%)
> > 24		52		35		17(32%)
> > 25		52		37		15(28%)
> > 26		52		38		14(26%)
> > 27		52		39		13(25%)
> > 28		52		40		12(23%)
> > 29		53		42		11(20%)
> > 30		52		43		9(17%)
> > 31		52		44		8(15%)
> > 32		51		36		15(29%)
> > 33		51		38		13(25%)
> > 34		51		39		12(23%)
> > 35		51		41		10(19%)
> > 36		52		41		11(21%)
> > 37		52		43		9(17%)
> > 38		51		44		7(13%)
> > 39		52		46		6(11%)
> > 40		51		37		14(27%)
> > 41		50		38		12(24%)
> > 42		50		39		11(22%)
> > 43		50		40		10(20%)
> > 44		50		42		8(16%)
> > 45		50		43		7(14%)
> > 46		50		43		7(14%)
> > 47		50		45		5(10%)
> > 48		50		37		13(26%)
> > 49		49		38		11(22%)
> > 50		50		40		10(20%)
> > 51		50		42		8(16%)
> > 52		50		42		8(16%)
> > 53		49		46		3(6%)
> > 54		50		46		4(8%)
> > 55		49		48		1(2%)
> > 56		50		39		11(22%)
> > 57		50		40		10(20%)
> > 58		49		42		7(14%)
> > 59		50		42		8(16%)
> > 60		50		46		4(8%)
> > 61		50		47		3(6%)
> > 62		50		48		2(4%)
> > 63		50		48		2(4%)
> > 64		51		38		13(25%)
> > 
> > Above 64 bytes the gain fades away.
> > 
> > Very similar values are collectd for __copy_to_user().
> > UDP receive performances under flood with small packets using recvfrom()
> > increase by ~5%.
> 
> What CPU model(s) were used for the performance testing and was it performance 
> tested on several different types of CPUs?
> 
> Please add a comment here:
> 
> +       if (len <= 64)
> +               return copy_user_generic_unrolled(to, from, len);
> +
> 
> ... because it's not obvious at all that this is a performance optimization, not a 
> correctness issue. Also explain that '64' is a number that we got from performance 
> measurements.
> 
> But in general I like the change - as long as it was measured on reasonably modern 
> x86 CPUs. I.e. it should not regress on modern Intel or AMD CPUs.

Thank you for reviewing this.

I'll add an hopefully descriptive comment in v2.

The above figures are for an Intel Xeon E5-2690 v4.

I see similar data points with an i7-6500U CPU, while an i7-4810MQ
shows slightly better improvements. 

I'm in the process of collecting more figures for AMD processors, which
 I don't have so handy - it may take some time.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
  2017-06-21 11:09 [PATCH] x86/uaccess: use unrolled string copy for short strings Paolo Abeni
  2017-06-21 17:38 ` Kees Cook
  2017-06-22  8:47 ` Ingo Molnar
@ 2017-06-22 17:30 ` Linus Torvalds
  2017-06-22 17:54   ` Paolo Abeni
  2017-06-29 13:55   ` [PATCH] x86/uaccess: optimize copy_user_enhanced_fast_string for short string Paolo Abeni
  2 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2017-06-22 17:30 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: the arch/x86 maintainers, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Al Viro, Kees Cook, Hannes Frederic Sowa,
	Linux Kernel Mailing List

On Wed, Jun 21, 2017 at 4:09 AM, Paolo Abeni <pabeni@redhat.com> wrote:
>
> +       if (len <= 64)
> +               return copy_user_generic_unrolled(to, from, len);
> +
>         /*
>          * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
>          * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.

NAK. Please do *not* do this. It puts the check in completely the
wrong place for several reasons:

 (a) it puts it in the inlined caller case (which could be ok for
constant sizes, but not in general)

 (b) it uses the copy_user_generic_unrolled() function that will then
just test the size *AGAIN* against small cases.

so it's both bigger than necessary, and stupid.

So if you want to do this optimization, I'd argue that you should just
do it inside the copy_user_enhanced_fast_string() function itself, the
same way we already handle the really small case specially in
copy_user_generic_string().

And do *not* use the unrolled code, which isn't used for small copies
anyway - rewrite the "copy_user_generic_unrolled" function in that
same asm file to have the non-unrolled cases (label "17" and forward)
accessible, so that you don't bother re-testing the size.

                 Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/uaccess: use unrolled string copy for short strings
  2017-06-22 17:30 ` Linus Torvalds
@ 2017-06-22 17:54   ` Paolo Abeni
  2017-06-29 13:55   ` [PATCH] x86/uaccess: optimize copy_user_enhanced_fast_string for short string Paolo Abeni
  1 sibling, 0 replies; 10+ messages in thread
From: Paolo Abeni @ 2017-06-22 17:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: the arch/x86 maintainers, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Al Viro, Kees Cook, Hannes Frederic Sowa,
	Linux Kernel Mailing List

On Thu, 2017-06-22 at 10:30 -0700, Linus Torvalds wrote:
> So if you want to do this optimization, I'd argue that you should just
> do it inside the copy_user_enhanced_fast_string() function itself, the
> same way we already handle the really small case specially in
> copy_user_generic_string().
> 
> And do *not* use the unrolled code, which isn't used for small copies
> anyway - rewrite the "copy_user_generic_unrolled" function in that
> same asm file to have the non-unrolled cases (label "17" and forward)
> accessible, so that you don't bother re-testing the size.

Thank you for the feedback.

I'm quite new to the core x86 land; the rep stosb cost popped out while
messing with the networking. I'll try to dig into the asm.

Regards,

Paolo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] x86/uaccess: optimize copy_user_enhanced_fast_string for short string
  2017-06-22 17:30 ` Linus Torvalds
  2017-06-22 17:54   ` Paolo Abeni
@ 2017-06-29 13:55   ` Paolo Abeni
  2017-06-29 21:40     ` Linus Torvalds
  2017-06-30 13:10     ` [tip:x86/asm] x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings tip-bot for Paolo Abeni
  1 sibling, 2 replies; 10+ messages in thread
From: Paolo Abeni @ 2017-06-29 13:55 UTC (permalink / raw)
  To: x86
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Hannes Frederic Sowa, Linus Torvalds, Alan Cox, linux-kernel

According to the Intel datasheet, the rep movsb instruction
exposes a relevant setup cost - 50 ticks - which affect
badly short string copy operation.

This change tries to avoid such cost calling the explicit
loop available in the unrolled code for string shorter
than 64 bytes. Such value has been selected with empirical
measures as the largest value that still ensure a measurable
gain.

Micro benchmarks of the __copy_from_user() function with
lengths in the [0-63] range show this performance gain
(shorter the string, larger the gain):

- in the [55%-4%] range on Intel Xeon(R) CPU E5-2690 v4
- in the [72%-9%] range on Intel Core i7-4810MQ

Other tested CPUs - namely Intel Atom S1260 and AMD Opteron
8216 - show no differences, because they do not expose the
ERMS feature bit.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 arch/x86/lib/copy_user_64.S | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index c595957..020f75c 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -37,7 +37,7 @@ ENTRY(copy_user_generic_unrolled)
 	movl %edx,%ecx
 	andl $63,%edx
 	shrl $6,%ecx
-	jz 17f
+	jz .L_copy_short_string
 1:	movq (%rsi),%r8
 2:	movq 1*8(%rsi),%r9
 3:	movq 2*8(%rsi),%r10
@@ -58,7 +58,8 @@ ENTRY(copy_user_generic_unrolled)
 	leaq 64(%rdi),%rdi
 	decl %ecx
 	jnz 1b
-17:	movl %edx,%ecx
+.L_copy_short_string:
+	movl %edx,%ecx
 	andl $7,%edx
 	shrl $3,%ecx
 	jz 20f
@@ -174,6 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string)
  */
 ENTRY(copy_user_enhanced_fast_string)
 	ASM_STAC
+	cmpl $64,%edx
+	jb .L_copy_short_string	/* less then 64 bytes, avoid the costly 'rep' */
 	movl %edx,%ecx
 1:	rep
 	movsb
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] x86/uaccess: optimize copy_user_enhanced_fast_string for short string
  2017-06-29 13:55   ` [PATCH] x86/uaccess: optimize copy_user_enhanced_fast_string for short string Paolo Abeni
@ 2017-06-29 21:40     ` Linus Torvalds
  2017-06-30 13:10     ` [tip:x86/asm] x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings tip-bot for Paolo Abeni
  1 sibling, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2017-06-29 21:40 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: the arch/x86 maintainers, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Hannes Frederic Sowa, Alan Cox,
	Linux Kernel Mailing List

Looks ok to me, so Ack, fwiw.

I've have long hoped that Intel would actually fix their microcode
overhead issue so that we could get rid of all this and just inline
"rep movsb", but that hasn't been happening, so this patch looks fine.

Side niote: I suspect the copy_user_generic_string() might as well use
the same logic too, rather than having a different cut-off (8 bytes)
to a worse implementation (rep movsb on architectures that don't mark
themselves with the enhanced string).

                  Linus

On Thu, Jun 29, 2017 at 6:55 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> According to the Intel datasheet, the rep movsb instruction
> exposes a relevant setup cost - 50 ticks - which affect
> badly short string copy operation.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [tip:x86/asm] x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
  2017-06-29 13:55   ` [PATCH] x86/uaccess: optimize copy_user_enhanced_fast_string for short string Paolo Abeni
  2017-06-29 21:40     ` Linus Torvalds
@ 2017-06-30 13:10     ` tip-bot for Paolo Abeni
  1 sibling, 0 replies; 10+ messages in thread
From: tip-bot for Paolo Abeni @ 2017-06-30 13:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, hpa, pabeni, mingo, luto, brgerst, dvlasenk, keescook, tglx,
	hannes, peterz, jpoimboe, linux-kernel, gnomes, torvalds

Commit-ID:  236222d39347e0e486010f10c1493e83dbbdfba8
Gitweb:     http://git.kernel.org/tip/236222d39347e0e486010f10c1493e83dbbdfba8
Author:     Paolo Abeni <pabeni@redhat.com>
AuthorDate: Thu, 29 Jun 2017 15:55:58 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 30 Jun 2017 09:52:51 +0200

x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings

According to the Intel datasheet, the REP MOVSB instruction
exposes a pretty heavy setup cost (50 ticks), which hurts
short string copy operations.

This change tries to avoid this cost by calling the explicit
loop available in the unrolled code for strings shorter
than 64 bytes.

The 64 bytes cutoff value is arbitrary from the code logic
point of view - it has been selected based on measurements,
as the largest value that still ensures a measurable gain.

Micro benchmarks of the __copy_from_user() function with
lengths in the [0-63] range show this performance gain
(shorter the string, larger the gain):

 - in the [55%-4%] range on Intel Xeon(R) CPU E5-2690 v4
 - in the [72%-9%] range on Intel Core i7-4810MQ

Other tested CPUs - namely Intel Atom S1260 and AMD Opteron
8216 - show no difference, because they do not expose the
ERMS feature bit.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4533a1d101fd460f80e21329a34928fad521c1d4.1498744345.git.pabeni@redhat.com
[ Clarified the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/lib/copy_user_64.S | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index c595957..020f75c 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -37,7 +37,7 @@ ENTRY(copy_user_generic_unrolled)
 	movl %edx,%ecx
 	andl $63,%edx
 	shrl $6,%ecx
-	jz 17f
+	jz .L_copy_short_string
 1:	movq (%rsi),%r8
 2:	movq 1*8(%rsi),%r9
 3:	movq 2*8(%rsi),%r10
@@ -58,7 +58,8 @@ ENTRY(copy_user_generic_unrolled)
 	leaq 64(%rdi),%rdi
 	decl %ecx
 	jnz 1b
-17:	movl %edx,%ecx
+.L_copy_short_string:
+	movl %edx,%ecx
 	andl $7,%edx
 	shrl $3,%ecx
 	jz 20f
@@ -174,6 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string)
  */
 ENTRY(copy_user_enhanced_fast_string)
 	ASM_STAC
+	cmpl $64,%edx
+	jb .L_copy_short_string	/* less then 64 bytes, avoid the costly 'rep' */
 	movl %edx,%ecx
 1:	rep
 	movsb

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-06-30 13:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-21 11:09 [PATCH] x86/uaccess: use unrolled string copy for short strings Paolo Abeni
2017-06-21 17:38 ` Kees Cook
2017-06-22 14:55   ` Alan Cox
2017-06-22  8:47 ` Ingo Molnar
2017-06-22 17:02   ` Paolo Abeni
2017-06-22 17:30 ` Linus Torvalds
2017-06-22 17:54   ` Paolo Abeni
2017-06-29 13:55   ` [PATCH] x86/uaccess: optimize copy_user_enhanced_fast_string for short string Paolo Abeni
2017-06-29 21:40     ` Linus Torvalds
2017-06-30 13:10     ` [tip:x86/asm] x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings tip-bot for Paolo Abeni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).