* Re: [PATCH] x86: only use ERMS for user copies for larger sizes [not found] <02bfc577-32a5-66be-64bf-d476b7d447d2@kernel.dk> @ 2018-11-20 20:24 ` Jens Axboe 2018-11-21 6:36 ` Ingo Molnar 1 sibling, 0 replies; 39+ messages in thread From: Jens Axboe @ 2018-11-20 20:24 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin Cc: the arch/x86 maintainers, linux-kernel Forgot to CC the mailing list... On 11/20/18 1:18 PM, Jens Axboe wrote: > Hi, > > So this is a fun one... While I was doing the aio polled work, I noticed > that the submitting process spent a substantial amount of time copying > data to/from userspace. For aio, that's iocb and io_event, which are 64 > and 32 bytes respectively. Looking closer at this, and it seems that > ERMS rep movsb is SLOWER for smaller copies, due to a higher startup > cost. > > I came up with this hack to test it out, and low and behold, we now cut > the time spent in copying in half. 50% less. > > Since these kinds of patches tend to lend themselves to bike shedding, I > also ran a string of kernel compilations out of RAM. Results are as > follows: > > Patched : 62.86s avg, stddev 0.65s > Stock : 63.73s avg, stddev 0.67s > > which would also seem to indicate that we're faster punting smaller > (< 128 byte) copies. > > CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz > > Interestingly, text size is smaller with the patch as well?! > > I'm sure there are smarter ways to do this, but results look fairly > conclusive. FWIW, the behaviorial change was introduced by: > > commit 954e482bde20b0e208fd4d34ef26e10afd194600 > Author: Fenghua Yu <fenghua.yu@intel.com> > Date: Thu May 24 18:19:45 2012 -0700 > > x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature > > which contains nothing in terms of benchmarking or results, just claims > that the new hotness is better. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > > diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h > index a9d637bc301d..7dbb78827e64 100644 > --- a/arch/x86/include/asm/uaccess_64.h > +++ b/arch/x86/include/asm/uaccess_64.h > @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len) > { > unsigned ret; > > + /* > + * For smaller copies, don't use ERMS as it's slower. > + */ > + if (len < 128) { > + alternative_call(copy_user_generic_unrolled, > + copy_user_generic_string, X86_FEATURE_REP_GOOD, > + ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), > + "=d" (len)), > + "1" (to), "2" (from), "3" (len) > + : "memory", "rcx", "r8", "r9", "r10", "r11"); > + return ret; > + } > + > /* > * If CPU has ERMS feature, use copy_user_enhanced_fast_string. > * Otherwise, if CPU has rep_good feature, use copy_user_generic_string. > * Otherwise, use copy_user_generic_unrolled. > */ > alternative_call_2(copy_user_generic_unrolled, > - copy_user_generic_string, > - X86_FEATURE_REP_GOOD, > - copy_user_enhanced_fast_string, > - X86_FEATURE_ERMS, > + copy_user_generic_string, X86_FEATURE_REP_GOOD, > + copy_user_enhanced_fast_string, X86_FEATURE_ERMS, > ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), > "=d" (len)), > "1" (to), "2" (from), "3" (len) > -- Jens Axboe ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes [not found] <02bfc577-32a5-66be-64bf-d476b7d447d2@kernel.dk> 2018-11-20 20:24 ` [PATCH] x86: only use ERMS for user copies for larger sizes Jens Axboe @ 2018-11-21 6:36 ` Ingo Molnar 2018-11-21 13:32 ` Jens Axboe 1 sibling, 1 reply; 39+ messages in thread From: Ingo Molnar @ 2018-11-21 6:36 UTC (permalink / raw) To: Jens Axboe Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin, the arch/x86 maintainers, Linus Torvalds, Andrew Morton, Andy Lutomirski, Peter Zijlstra, Denys Vlasenko, Brian Gerst, linux-kernel [ Cc:-ed a few other gents and lkml. ] * Jens Axboe <axboe@kernel.dk> wrote: > Hi, > > So this is a fun one... While I was doing the aio polled work, I noticed > that the submitting process spent a substantial amount of time copying > data to/from userspace. For aio, that's iocb and io_event, which are 64 > and 32 bytes respectively. Looking closer at this, and it seems that > ERMS rep movsb is SLOWER for smaller copies, due to a higher startup > cost. > > I came up with this hack to test it out, and low and behold, we now cut > the time spent in copying in half. 50% less. > > Since these kinds of patches tend to lend themselves to bike shedding, I > also ran a string of kernel compilations out of RAM. Results are as > follows: > > Patched : 62.86s avg, stddev 0.65s > Stock : 63.73s avg, stddev 0.67s > > which would also seem to indicate that we're faster punting smaller > (< 128 byte) copies. > > CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz > > Interestingly, text size is smaller with the patch as well?! > > I'm sure there are smarter ways to do this, but results look fairly > conclusive. FWIW, the behaviorial change was introduced by: > > commit 954e482bde20b0e208fd4d34ef26e10afd194600 > Author: Fenghua Yu <fenghua.yu@intel.com> > Date: Thu May 24 18:19:45 2012 -0700 > > x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature > > which contains nothing in terms of benchmarking or results, just claims > that the new hotness is better. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > > diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h > index a9d637bc301d..7dbb78827e64 100644 > --- a/arch/x86/include/asm/uaccess_64.h > +++ b/arch/x86/include/asm/uaccess_64.h > @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len) > { > unsigned ret; > > + /* > + * For smaller copies, don't use ERMS as it's slower. > + */ > + if (len < 128) { > + alternative_call(copy_user_generic_unrolled, > + copy_user_generic_string, X86_FEATURE_REP_GOOD, > + ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), > + "=d" (len)), > + "1" (to), "2" (from), "3" (len) > + : "memory", "rcx", "r8", "r9", "r10", "r11"); > + return ret; > + } > + > /* > * If CPU has ERMS feature, use copy_user_enhanced_fast_string. > * Otherwise, if CPU has rep_good feature, use copy_user_generic_string. > * Otherwise, use copy_user_generic_unrolled. > */ > alternative_call_2(copy_user_generic_unrolled, > - copy_user_generic_string, > - X86_FEATURE_REP_GOOD, > - copy_user_enhanced_fast_string, > - X86_FEATURE_ERMS, > + copy_user_generic_string, X86_FEATURE_REP_GOOD, > + copy_user_enhanced_fast_string, X86_FEATURE_ERMS, > ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), > "=d" (len)), > "1" (to), "2" (from), "3" (len) So I'm inclined to do something like yours, because clearly the changelog of 954e482bde20 was at least partly false: Intel can say whatever they want, it's a fact that ERMS has high setup costs for low buffer sizes - ERMS is optimized for large size, cache-aligned copies mainly. But the result is counter-intuitive in terms of kernel text footprint, plus the '128' is pretty arbitrary - we should at least try to come up with a break-even point where manual copy is about as fast as ERMS - on at least a single CPU ... Thanks, Ingo ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 6:36 ` Ingo Molnar @ 2018-11-21 13:32 ` Jens Axboe 2018-11-21 13:44 ` Denys Vlasenko 2018-11-21 13:45 ` Paolo Abeni 0 siblings, 2 replies; 39+ messages in thread From: Jens Axboe @ 2018-11-21 13:32 UTC (permalink / raw) To: Ingo Molnar Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin, the arch/x86 maintainers, Linus Torvalds, Andrew Morton, Andy Lutomirski, Peter Zijlstra, Denys Vlasenko, Brian Gerst, linux-kernel, pabeni On 11/20/18 11:36 PM, Ingo Molnar wrote: > > [ Cc:-ed a few other gents and lkml. ] > > * Jens Axboe <axboe@kernel.dk> wrote: > >> Hi, >> >> So this is a fun one... While I was doing the aio polled work, I noticed >> that the submitting process spent a substantial amount of time copying >> data to/from userspace. For aio, that's iocb and io_event, which are 64 >> and 32 bytes respectively. Looking closer at this, and it seems that >> ERMS rep movsb is SLOWER for smaller copies, due to a higher startup >> cost. >> >> I came up with this hack to test it out, and low and behold, we now cut >> the time spent in copying in half. 50% less. >> >> Since these kinds of patches tend to lend themselves to bike shedding, I >> also ran a string of kernel compilations out of RAM. Results are as >> follows: >> >> Patched : 62.86s avg, stddev 0.65s >> Stock : 63.73s avg, stddev 0.67s >> >> which would also seem to indicate that we're faster punting smaller >> (< 128 byte) copies. >> >> CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz >> >> Interestingly, text size is smaller with the patch as well?! >> >> I'm sure there are smarter ways to do this, but results look fairly >> conclusive. FWIW, the behaviorial change was introduced by: >> >> commit 954e482bde20b0e208fd4d34ef26e10afd194600 >> Author: Fenghua Yu <fenghua.yu@intel.com> >> Date: Thu May 24 18:19:45 2012 -0700 >> >> x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature >> >> which contains nothing in terms of benchmarking or results, just claims >> that the new hotness is better. >> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> --- >> >> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h >> index a9d637bc301d..7dbb78827e64 100644 >> --- a/arch/x86/include/asm/uaccess_64.h >> +++ b/arch/x86/include/asm/uaccess_64.h >> @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len) >> { >> unsigned ret; >> >> + /* >> + * For smaller copies, don't use ERMS as it's slower. >> + */ >> + if (len < 128) { >> + alternative_call(copy_user_generic_unrolled, >> + copy_user_generic_string, X86_FEATURE_REP_GOOD, >> + ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), >> + "=d" (len)), >> + "1" (to), "2" (from), "3" (len) >> + : "memory", "rcx", "r8", "r9", "r10", "r11"); >> + return ret; >> + } >> + >> /* >> * If CPU has ERMS feature, use copy_user_enhanced_fast_string. >> * Otherwise, if CPU has rep_good feature, use copy_user_generic_string. >> * Otherwise, use copy_user_generic_unrolled. >> */ >> alternative_call_2(copy_user_generic_unrolled, >> - copy_user_generic_string, >> - X86_FEATURE_REP_GOOD, >> - copy_user_enhanced_fast_string, >> - X86_FEATURE_ERMS, >> + copy_user_generic_string, X86_FEATURE_REP_GOOD, >> + copy_user_enhanced_fast_string, X86_FEATURE_ERMS, >> ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), >> "=d" (len)), >> "1" (to), "2" (from), "3" (len) > > So I'm inclined to do something like yours, because clearly the changelog > of 954e482bde20 was at least partly false: Intel can say whatever they > want, it's a fact that ERMS has high setup costs for low buffer sizes - > ERMS is optimized for large size, cache-aligned copies mainly. I'm actually surprised that something like that was accepted, I guess 2012 was a simpler time :-) > But the result is counter-intuitive in terms of kernel text footprint, > plus the '128' is pretty arbitrary - we should at least try to come up > with a break-even point where manual copy is about as fast as ERMS - on > at least a single CPU ... I did some more investigation yesterday, and found this: commit 236222d39347e0e486010f10c1493e83dbbdfba8 Author: Paolo Abeni <pabeni@redhat.com> Date: Thu Jun 29 15:55:58 2017 +0200 x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings which does attempt to rectify it, but only using ERMS for >= 64 byte copies. At least for me, looks like the break even point is higher than that, which would mean that something like the below would be more appropriate. Adding Paolo, in case he actually wrote a test app for this. In terms of test app, I'm always weary of using microbenchmarking only for this type of thing, real world testing (if stable enough) is much more useful. diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index db4e5aa0858b..21c4d68c5fac 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -175,8 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string) */ ENTRY(copy_user_enhanced_fast_string) ASM_STAC - cmpl $64,%edx - jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */ + cmpl $128,%edx + jb .L_copy_short_string /* less then 128 bytes, avoid costly 'rep' */ movl %edx,%ecx 1: rep movsb -- Jens Axboe ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 13:32 ` Jens Axboe @ 2018-11-21 13:44 ` Denys Vlasenko 2018-11-22 17:36 ` David Laight 2018-11-21 13:45 ` Paolo Abeni 1 sibling, 1 reply; 39+ messages in thread From: Denys Vlasenko @ 2018-11-21 13:44 UTC (permalink / raw) To: Jens Axboe, Ingo Molnar Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin, the arch/x86 maintainers, Linus Torvalds, Andrew Morton, Andy Lutomirski, Peter Zijlstra, Brian Gerst, linux-kernel, pabeni On 11/21/2018 02:32 PM, Jens Axboe wrote: > On 11/20/18 11:36 PM, Ingo Molnar wrote: >> * Jens Axboe <axboe@kernel.dk> wrote: >>> So this is a fun one... While I was doing the aio polled work, I noticed >>> that the submitting process spent a substantial amount of time copying >>> data to/from userspace. For aio, that's iocb and io_event, which are 64 >>> and 32 bytes respectively. Looking closer at this, and it seems that >>> ERMS rep movsb is SLOWER for smaller copies, due to a higher startup >>> cost. >>> >>> I came up with this hack to test it out, and low and behold, we now cut >>> the time spent in copying in half. 50% less. >>> >>> Since these kinds of patches tend to lend themselves to bike shedding, I >>> also ran a string of kernel compilations out of RAM. Results are as >>> follows: >>> >>> Patched : 62.86s avg, stddev 0.65s >>> Stock : 63.73s avg, stddev 0.67s >>> >>> which would also seem to indicate that we're faster punting smaller >>> (< 128 byte) copies. >>> >>> CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz >>> >>> Interestingly, text size is smaller with the patch as well?! >>> >>> I'm sure there are smarter ways to do this, but results look fairly >>> conclusive. FWIW, the behaviorial change was introduced by: >>> >>> commit 954e482bde20b0e208fd4d34ef26e10afd194600 >>> Author: Fenghua Yu <fenghua.yu@intel.com> >>> Date: Thu May 24 18:19:45 2012 -0700 >>> >>> x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature >>> >>> which contains nothing in terms of benchmarking or results, just claims >>> that the new hotness is better. >>> >>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>> --- >>> >>> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h >>> index a9d637bc301d..7dbb78827e64 100644 >>> --- a/arch/x86/include/asm/uaccess_64.h >>> +++ b/arch/x86/include/asm/uaccess_64.h >>> @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len) >>> { >>> unsigned ret; >>> >>> + /* >>> + * For smaller copies, don't use ERMS as it's slower. >>> + */ >>> + if (len < 128) { >>> + alternative_call(copy_user_generic_unrolled, >>> + copy_user_generic_string, X86_FEATURE_REP_GOOD, >>> + ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), >>> + "=d" (len)), >>> + "1" (to), "2" (from), "3" (len) >>> + : "memory", "rcx", "r8", "r9", "r10", "r11"); >>> + return ret; >>> + } >>> + >>> /* >>> * If CPU has ERMS feature, use copy_user_enhanced_fast_string. >>> * Otherwise, if CPU has rep_good feature, use copy_user_generic_string. >>> * Otherwise, use copy_user_generic_unrolled. >>> */ >>> alternative_call_2(copy_user_generic_unrolled, >>> - copy_user_generic_string, >>> - X86_FEATURE_REP_GOOD, >>> - copy_user_enhanced_fast_string, >>> - X86_FEATURE_ERMS, >>> + copy_user_generic_string, X86_FEATURE_REP_GOOD, >>> + copy_user_enhanced_fast_string, X86_FEATURE_ERMS, >>> ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), >>> "=d" (len)), >>> "1" (to), "2" (from), "3" (len) >> >> So I'm inclined to do something like yours, because clearly the changelog >> of 954e482bde20 was at least partly false: Intel can say whatever they >> want, it's a fact that ERMS has high setup costs for low buffer sizes - >> ERMS is optimized for large size, cache-aligned copies mainly. > > I'm actually surprised that something like that was accepted, I guess > 2012 was a simpler time :-) > >> But the result is counter-intuitive in terms of kernel text footprint, >> plus the '128' is pretty arbitrary - we should at least try to come up >> with a break-even point where manual copy is about as fast as ERMS - on >> at least a single CPU ... > > I did some more investigation yesterday, and found this: > > commit 236222d39347e0e486010f10c1493e83dbbdfba8 > Author: Paolo Abeni <pabeni@redhat.com> > Date: Thu Jun 29 15:55:58 2017 +0200 > > x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings > > which does attempt to rectify it, but only using ERMS for >= 64 byte copies. > At least for me, looks like the break even point is higher than that, which > would mean that something like the below would be more appropriate. I also tested this while working for string ops code in musl. I think at least 128 bytes would be the minimum where "REP insn" are more efficient. In my testing, it's more like 256 bytes... ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 13:44 ` Denys Vlasenko @ 2018-11-22 17:36 ` David Laight 2018-11-22 17:52 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: David Laight @ 2018-11-22 17:36 UTC (permalink / raw) To: 'Denys Vlasenko', Jens Axboe, Ingo Molnar Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin, the arch/x86 maintainers, Linus Torvalds, Andrew Morton, Andy Lutomirski, Peter Zijlstra, Brian Gerst, linux-kernel, pabeni From: Denys Vlasenko > Sent: 21 November 2018 13:44 ... > I also tested this while working for string ops code in musl. > > I think at least 128 bytes would be the minimum where "REP insn" > are more efficient. In my testing, it's more like 256 bytes... What happens for misaligned copies? I had a feeling that the ERMS 'reb movsb' code used some kind of barrel shifter in that case. The other problem with the ERMS copy is that it gets used for copy_to/from_io() - and the 'rep movsb' on uncached locations has to do byte copies. Byte reads on PCIe are really horrid. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-22 17:36 ` David Laight @ 2018-11-22 17:52 ` Linus Torvalds 2018-11-22 18:06 ` Andy Lutomirski 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2018-11-22 17:52 UTC (permalink / raw) To: David.Laight Cc: dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni On Thu, Nov 22, 2018 at 9:36 AM David Laight <David.Laight@aculab.com> wrote: > > The other problem with the ERMS copy is that it gets used > for copy_to/from_io() - and the 'rep movsb' on uncached > locations has to do byte copies. Ugh. I thought we changed that *long* ago, because even our non-ERMS copy is broken for PCI (it does overlapping stores for the small tail cases). But looking at "memcpy_{from,to}io()", I don't see x86 overriding it with anything better. I suspect nobody uses those functions for anything critical any more. The fbcon people have their own copy functions, iirc. But we definitely should fix this. *NONE* of the regular memcpy functions actually work right for PCI space any more, and haven't for a long time. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-22 17:52 ` Linus Torvalds @ 2018-11-22 18:06 ` Andy Lutomirski 2018-11-22 18:58 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Andy Lutomirski @ 2018-11-22 18:06 UTC (permalink / raw) To: Linus Torvalds Cc: David Laight, Denys Vlasenko, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin, X86 ML, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, Brian Gerst, LKML, pabeni On Thu, Nov 22, 2018 at 9:53 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Nov 22, 2018 at 9:36 AM David Laight <David.Laight@aculab.com> wrote: > > > > The other problem with the ERMS copy is that it gets used > > for copy_to/from_io() - and the 'rep movsb' on uncached > > locations has to do byte copies. > > Ugh. I thought we changed that *long* ago, because even our non-ERMS > copy is broken for PCI (it does overlapping stores for the small tail > cases). > > But looking at "memcpy_{from,to}io()", I don't see x86 overriding it > with anything better. > > I suspect nobody uses those functions for anything critical any more. > The fbcon people have their own copy functions, iirc. > > But we definitely should fix this. *NONE* of the regular memcpy > functions actually work right for PCI space any more, and haven't for > a long time. I'm not personally volunteering, but I suspect we can do much better than we do now: - The new MOVDIRI and MOVDIR64B instructions can do big writes to WC and UC memory. I assume those would be safe to use in ...toio() functions, unless there are quirky devices out there that blow up if their MMIO space is written in 64-byte chunks. - MOVNTDQA can, I think, do 64-byte loads, but only from WC memory. For sufficiently large copies, it could plausibly be faster to create a WC alias and use MOVNTDQA than it is to copy in 8- for 16-byte chunks. The i915 driver has a copy implementation using MOVNTDQA -- maybe this should get promoted to something in arch/x86 called memcpy_from_wc(). --Andy ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-22 18:06 ` Andy Lutomirski @ 2018-11-22 18:58 ` Linus Torvalds 2018-11-23 9:34 ` David Laight 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2018-11-22 18:58 UTC (permalink / raw) To: Andrew Lutomirski Cc: David.Laight, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni On Thu, Nov 22, 2018 at 10:07 AM Andy Lutomirski <luto@kernel.org> wrote: > > I'm not personally volunteering, but I suspect we can do much better > than we do now: > > - The new MOVDIRI and MOVDIR64B instructions can do big writes to WC > and UC memory. > > - MOVNTDQA can, I think, do 64-byte loads, but only from WC memory. No, performance isn't the _primary_ issue. Nobody uses MMIO and expects high performance from the generic functions (but people may then tweak individual drivers to do tricks). And we've historically had various broken hardware that cares deeply about access size. Trying to be clever and do big accesses could easily break something. The fact that nobody has complained about the generic memcpy routines probably means that the broken hardware isn't in use any more, or it just works anyway. And nobody has complained about performance either, so it's clearly not a huge issue. "rep movs" probably works ok on WC memory writes anyway, it's the UC case that is bad, but I don't think anybody uses UC and then does the "memcp_to/fromio()" things. If you have UC memory, you tend to do the accesses properly. So I suspect we should just write memcpy_{to,from}io() in terms of writel/readl. Oh, and I just noticed that on x86 we expressly use our old "safe and sane" functions: see __inline_memcpy(), and its use in __memcpy_{from,to}io(). So the "falls back to memcpy" was always a red herring. We don't actually do that. Which explains why things work. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-22 18:58 ` Linus Torvalds @ 2018-11-23 9:34 ` David Laight 2018-11-23 10:12 ` David Laight 0 siblings, 1 reply; 39+ messages in thread From: David Laight @ 2018-11-23 9:34 UTC (permalink / raw) To: 'Linus Torvalds', Andrew Lutomirski Cc: dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni From: Linus Torvalds > Sent: 22 November 2018 18:58 ... > Oh, and I just noticed that on x86 we expressly use our old "safe and > sane" functions: see __inline_memcpy(), and its use in > __memcpy_{from,to}io(). > > So the "falls back to memcpy" was always a red herring. We don't > actually do that. > > Which explains why things work. It doesn't explain why I've seen single byte PCIe TLP generated by memcpy_to/fromio(). I've had to change code to use readq/writeq loops because the byte accesses are so slow - even when PIO performance should be 'good enough'. It might have been changed since last time I tested it. But I don't remember seeing a commit go by. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-23 9:34 ` David Laight @ 2018-11-23 10:12 ` David Laight 2018-11-23 16:36 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: David Laight @ 2018-11-23 10:12 UTC (permalink / raw) To: David Laight, 'Linus Torvalds', 'Andrew Lutomirski' Cc: 'dvlasenk@redhat.com', 'Jens Axboe', 'Ingo Molnar', 'Thomas Gleixner', 'Ingo Molnar', 'bp@alien8.de', 'Peter Anvin', 'the arch/x86 maintainers', 'Andrew Morton', 'Peter Zijlstra', 'brgerst@gmail.com', 'Linux List Kernel Mailing', 'pabeni@redhat.com' From: David Laight > Sent: 23 November 2018 09:35 > From: Linus Torvalds > > Sent: 22 November 2018 18:58 > ... > > Oh, and I just noticed that on x86 we expressly use our old "safe and > > sane" functions: see __inline_memcpy(), and its use in > > __memcpy_{from,to}io(). > > > > So the "falls back to memcpy" was always a red herring. We don't > > actually do that. > > > > Which explains why things work. > > It doesn't explain why I've seen single byte PCIe TLP generated > by memcpy_to/fromio(). > > I've had to change code to use readq/writeq loops because the > byte accesses are so slow - even when PIO performance should > be 'good enough'. > > It might have been changed since last time I tested it. > But I don't remember seeing a commit go by. I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel. Calling memcpy_fromio(kernel_buffer, PCIe_address, length) generates a lot of single byte TLP. What the code normally does is 64bit aligned PCIe reads with multiple writes and shifts to avoid writing beyond the end of the kernel buffer for 'odd' length transfers. Most of our PIO copies are actually direct to/from userspace. While copy_to/from_user() will work on PCIe memory, it is 'rep mosvb'. We also mmap() the PCIe space into process memory - and have be careful not to use memcpy() in usespace. On suitable systems userspace can use the AVX256 instructions to get wide reads. Much harder and more expensive in the kernel. In practise most of the bulk data transfers are requested by the PCIe slave. But there are times when PIO ones are needed, and 64 bit transfers are 8 times faster than 8 bit ones. This is all made more significant because it takes our fpga about 500ns to complete a single word PCIe read. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-23 10:12 ` David Laight @ 2018-11-23 16:36 ` Linus Torvalds 2018-11-23 17:42 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Linus Torvalds @ 2018-11-23 16:36 UTC (permalink / raw) To: David.Laight Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni On Fri, Nov 23, 2018 at 2:12 AM David Laight <David.Laight@aculab.com> wrote: > > I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel. > Calling memcpy_fromio(kernel_buffer, PCIe_address, length) > generates a lot of single byte TLP. I just tested it too - it turns out that the __inline_memcpy() code never triggers, and "memcpy_toio()" just generates a memcpy. So that code seems entirely dead. And, in fact, the codebase I looked at was the historical one, because I had been going back and looking at the history. The modern tree *does* have the "__inline_memcpy()" function I pointed at, but it's not actually hooked up to anything! This actually has been broken for _ages_. The breakage goes back to 2010, and commit 6175ddf06b61 ("x86: Clean up mem*io functions"), and it seems nobody really ever noticed - or thought that it was ok. That commit claims that iomem has no special significance on x86, but that really really isn't true, exactly because the access size does matter. And as mentioned, the generic memory copy routines are not at all appropriate, and that has nothing to do with ERMS. Our "do it by hand" memory copy routine does things like this: .Lless_16bytes: cmpl $8, %edx jb .Lless_8bytes /* * Move data from 8 bytes to 15 bytes. */ movq 0*8(%rsi), %r8 movq -1*8(%rsi, %rdx), %r9 movq %r8, 0*8(%rdi) movq %r9, -1*8(%rdi, %rdx) retq and note how for a 8-byte copy, it will do *two* reads of the same 8 bytes, and *two* writes of the same 8 byte destination. That's perfectly ok for regular memory, and it means that the code can handle an arbitrary 8-15 byte copy without any conditionals or loop counts, but it is *not* ok for iomem. Of course, in practice it all just happens to work in almost all situations (because a lot of iomem devices simply won't care), and manual access to iomem is basically extremely rare these days anyway, but it's definitely entirely and utterly broken. End result: we *used* to do this right. For the last eight years our "memcpy_{to,from}io()" has been entirely broken, and apparently even the people who noticed oddities like David, never reported it as breakage but instead just worked around it in drivers. Ho humm. Let me write a generic routine in lib/iomap_copy.c (which already does the "user specifies chunk size" cases), and hook it up for x86. David, are you using a bus analyzer or something to do your testing? I'll have a trial patch for you asap. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-23 16:36 ` Linus Torvalds @ 2018-11-23 17:42 ` Linus Torvalds 2018-11-23 18:39 ` Andy Lutomirski 2018-11-26 10:01 ` David Laight 2018-11-26 10:26 ` David Laight 2 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2018-11-23 17:42 UTC (permalink / raw) To: David.Laight Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni [-- Attachment #1: Type: text/plain, Size: 1291 bytes --] On Fri, Nov 23, 2018 at 8:36 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Let me write a generic routine in lib/iomap_copy.c (which already does > the "user specifies chunk size" cases), and hook it up for x86. Something like this? ENTIRELY UNTESTED! It might not compile. Seriously. And if it does compile, it might not work. And this doesn't actually do the memset_io() function at all, just the memcpy ones. Finally, it's worth noting that on x86, we have this: /* * override generic version in lib/iomap_copy.c */ ENTRY(__iowrite32_copy) movl %edx,%ecx rep movsd ret ENDPROC(__iowrite32_copy) because back in 2006, we did this: [PATCH] Add faster __iowrite32_copy routine for x86_64 This assembly version is measurably faster than the generic version in lib/iomap_copy.c. which actually implies that "rep movsd" is faster than doing __raw_writel() by hand. So it is possible that this should all be arch-specific code rather than that butt-ugly "generic" code I wrote in this patch. End result: I'm not really all that happy about this patch, but it's perhaps worth testing, and it's definitely worth discussing. Because our current memcpy_{to,from}io() is truly broken garbage. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 5129 bytes --] arch/x86/include/asm/io.h | 6 ++ include/linux/io.h | 2 + lib/iomap_copy.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 161 insertions(+) diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h index 832da8229cc7..3b9206ee25b8 100644 --- a/arch/x86/include/asm/io.h +++ b/arch/x86/include/asm/io.h @@ -92,6 +92,12 @@ build_mmio_write(__writel, "l", unsigned int, "r", ) #define mmiowb() barrier() +void __iowrite_copy(void __iomem *to, const void *from, size_t count); +void __ioread_copy(void *to, const void __iomem *from, size_t count); + +#define memcpy_toio __iowrite_copy +#define memcpy_fromio __ioread_copy + #ifdef CONFIG_X86_64 build_mmio_read(readq, "q", u64, "=r", :"memory") diff --git a/include/linux/io.h b/include/linux/io.h index 32e30e8fb9db..642f78970018 100644 --- a/include/linux/io.h +++ b/include/linux/io.h @@ -28,6 +28,8 @@ struct device; struct resource; +void __ioread_copy(void *to, const void __iomem *from, size_t count); +void __iowrite_copy(void __iomem *to, const void *from, size_t count); __visible void __iowrite32_copy(void __iomem *to, const void *from, size_t count); void __ioread32_copy(void *to, const void __iomem *from, size_t count); void __iowrite64_copy(void __iomem *to, const void *from, size_t count); diff --git a/lib/iomap_copy.c b/lib/iomap_copy.c index b8f1d6cbb200..8edc359dda62 100644 --- a/lib/iomap_copy.c +++ b/lib/iomap_copy.c @@ -17,6 +17,159 @@ #include <linux/export.h> #include <linux/io.h> +#include <asm/unaligned.h> + +static inline bool iomem_align(const void __iomem *ptr, int size, int count) +{ + return count >= size && (__force unsigned long)ptr & size; +} + + +/** + * __iowrite_copy - copy data to MMIO space + * @to: destination, in MMIO space + * @from: source + * @count: number of bytes to copy. + * + * Copy arbitrarily aligned data from kernel space to MMIO space, + * using reasonable chunking. + */ +void __attribute__((weak)) __iowrite_copy(void __iomem *to, + const void *from, + size_t count) +{ + if (iomem_align(to, 1, count)) { + unsigned char data = *(unsigned char *)from; + __raw_writeb(data, to); + from++; + to++; + count--; + } + if (iomem_align(to, 2, count)) { + unsigned short data = get_unaligned((unsigned short *)from); + __raw_writew(data, to); + from += 2; + to += 2; + count -= 2; + } +#ifdef CONFIG_64BIT + if (iomem_align(to, 4, count)) { + unsigned int data = get_unaligned((unsigned int *)from); + __raw_writel(data, to); + from += 4; + to += 4; + count -= 4; + } +#endif + while (count >= sizeof(unsigned long)) { + unsigned long data = get_unaligned((unsigned long *)from); +#ifdef CONFIG_64BIT + __raw_writeq(data, to); +#else + __raw_writel(data, to); +#endif + from += sizeof(unsigned long); + to += sizeof(unsigned long); + count -= sizeof(unsigned long); + } + +#ifdef CONFIG_64BIT + if (count >= 4) { + unsigned int data = get_unaligned((unsigned int *)from); + __raw_writel(data, to); + from += 4; + to += 4; + count -= 4; + } +#endif + + if (count >= 2) { + unsigned short data = get_unaligned((unsigned short *)from); + __raw_writew(data, to); + from += 2; + to += 2; + count -= 2; + } + + if (count) { + unsigned char data = *(unsigned char *)from; + __raw_writeb(data, to); + } +} +EXPORT_SYMBOL_GPL(__iowrite_copy); + +/** + * __ioread_copy - copy data from MMIO space + * @to: destination + * @from: source, in MMIO space + * @count: number of bytes to copy. + * + * Copy arbitrarily aligned data from MMIO space to kernel space, + * using reasonable chunking. + */ +void __attribute__((weak)) __ioread_copy(void *to, + const void __iomem *from, + size_t count) +{ + if (iomem_align(from, 1, count)) { + unsigned char data = __raw_readb(from); + put_unaligned(data, (unsigned char *) to); + from++; + to++; + count--; + } + if (iomem_align(to, 2, count)) { + unsigned short data = __raw_readw(from); + put_unaligned(data, (unsigned short *) to); + from += 2; + to += 2; + count -= 2; + } +#ifdef CONFIG_64BIT + if (iomem_align(to, 4, count)) { + unsigned int data = __raw_readl(from); + put_unaligned(data, (unsigned int *) to); + from += 4; + to += 4; + count -= 4; + } +#endif + while (count >= sizeof(unsigned long)) { +#ifdef CONFIG_64BIT + unsigned long data = __raw_readq(from); +#else + unsigned long data = __raw_readl(from); +#endif + put_unaligned(data, (unsigned long *) to); + from += sizeof(unsigned long); + to += sizeof(unsigned long); + count -= sizeof(unsigned long); + } + +#ifdef CONFIG_64BIT + if (count >= 4) { + unsigned int data = __raw_readl(from); + put_unaligned(data, (unsigned int *) to); + from += 4; + to += 4; + count -= 4; + } +#endif + + if (count >= 2) { + unsigned short data = __raw_readw(from); + put_unaligned(data, (unsigned short *) to); + from += 2; + to += 2; + count -= 2; + } + + if (count) { + unsigned char data = __raw_readb(from); + put_unaligned(data, (unsigned char *) to); + } +} +EXPORT_SYMBOL_GPL(__ioread_copy); /** * __iowrite32_copy - copy data to MMIO space, in 32-bit units ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-23 17:42 ` Linus Torvalds @ 2018-11-23 18:39 ` Andy Lutomirski 2018-11-23 18:44 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Andy Lutomirski @ 2018-11-23 18:39 UTC (permalink / raw) To: Linus Torvalds Cc: David.Laight, Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni > On Nov 23, 2018, at 10:42 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, Nov 23, 2018 at 8:36 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Let me write a generic routine in lib/iomap_copy.c (which already does >> the "user specifies chunk size" cases), and hook it up for x86. > > Something like this? > > ENTIRELY UNTESTED! It might not compile. Seriously. And if it does > compile, it might not work. > > And this doesn't actually do the memset_io() function at all, just the > memcpy ones. > > Finally, it's worth noting that on x86, we have this: > > /* > * override generic version in lib/iomap_copy.c > */ > ENTRY(__iowrite32_copy) > movl %edx,%ecx > rep movsd > ret > ENDPROC(__iowrite32_copy) > > because back in 2006, we did this: > > [PATCH] Add faster __iowrite32_copy routine for x86_64 > > This assembly version is measurably faster than the generic version in > lib/iomap_copy.c. > > which actually implies that "rep movsd" is faster than doing > __raw_writel() by hand. > > So it is possible that this should all be arch-specific code rather > than that butt-ugly "generic" code I wrote in this patch. > > End result: I'm not really all that happy about this patch, but it's > perhaps worth testing, and it's definitely worth discussing. Because > our current memcpy_{to,from}io() is truly broken garbage. > > What is memcpy_to_io even supposed to do? I’m guessing it’s defined as something like “copy this data to IO space using at most long-sized writes, all aligned, and writing each byte exactly once, in order.” That sounds... dubiously useful. I could see a function that writes to aligned memory in specified-sized chunks. And I can see a use for a function to just write it in whatever size chunks the architecture thinks is fastest, and *that* should probably use MOVDIR64B. Or is there some subtlety I’m missing? ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-23 18:39 ` Andy Lutomirski @ 2018-11-23 18:44 ` Linus Torvalds 2018-11-23 19:11 ` Andy Lutomirski 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2018-11-23 18:44 UTC (permalink / raw) To: Andy Lutomirski Cc: David.Laight, Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni On Fri, Nov 23, 2018 at 10:39 AM Andy Lutomirski <luto@amacapital.net> wrote: > > What is memcpy_to_io even supposed to do? I’m guessing it’s defined as something like “copy this data to IO space using at most long-sized writes, all aligned, and writing each byte exactly once, in order.” That sounds... dubiously useful. We've got hundreds of users of it, so it's fairly common.. > I could see a function that writes to aligned memory in specified-sized chunks. We have that. It's called "__iowrite{32,64}_copy()". It has very few users. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-23 18:44 ` Linus Torvalds @ 2018-11-23 19:11 ` Andy Lutomirski 2018-11-26 10:12 ` David Laight 0 siblings, 1 reply; 39+ messages in thread From: Andy Lutomirski @ 2018-11-23 19:11 UTC (permalink / raw) To: Linus Torvalds Cc: David.Laight, Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni > On Nov 23, 2018, at 11:44 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Fri, Nov 23, 2018 at 10:39 AM Andy Lutomirski <luto@amacapital.net> wrote: >> >> What is memcpy_to_io even supposed to do? I’m guessing it’s defined as something like “copy this data to IO space using at most long-sized writes, all aligned, and writing each byte exactly once, in order.” That sounds... dubiously useful. > > We've got hundreds of users of it, so it's fairly common.. > I’m wondering if the “at most long-sizes” restriction matters, especially given that we’re apparently accessing some of the same bytes more than once. I would believe that trying to encourage 16-byte writes (with AVX, ugh) or 64-byte writes (with MOVDIR64B) would be safe and could meaningfully speed up some workloads. >> I could see a function that writes to aligned memory in specified-sized chunks. > > We have that. It's called "__iowrite{32,64}_copy()". It has very few users. > > Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-23 19:11 ` Andy Lutomirski @ 2018-11-26 10:12 ` David Laight 0 siblings, 0 replies; 39+ messages in thread From: David Laight @ 2018-11-26 10:12 UTC (permalink / raw) To: 'Andy Lutomirski', Linus Torvalds Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni From: Andy Lutomirski > Sent: 23 November 2018 19:11 > > On Nov 23, 2018, at 11:44 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > >> On Fri, Nov 23, 2018 at 10:39 AM Andy Lutomirski <luto@amacapital.net> wrote: > >> > >> What is memcpy_to_io even supposed to do? I’m guessing it’s defined as > >> something like “copy this data to IO space using at most long-sized writes, > >> all aligned, and writing each byte exactly once, in order.” > >> That sounds... dubiously useful. > > > > We've got hundreds of users of it, so it's fairly common.. > > I’m wondering if the “at most long-sizes” restriction matters, especially > given that we’re apparently accessing some of the same bytes more than once. > I would believe that trying to encourage 16-byte writes (with AVX, ugh) or > 64-byte writes (with MOVDIR64B) would be safe and could meaningfully speed > up some workloads. The real gains come from increasing the width of IO reads, not IO writes. None of the x86 cpus I've got issue multiple concurrent PCIe reads (the PCIe completion tag seems to match the core number). PCIe writes are all 'posted' so there aren't big gaps between them. > >> I could see a function that writes to aligned memory in specified-sized chunks. > > > > We have that. It's called "__iowrite{32,64}_copy()". It has very few users. For x86 you want separate entry points for the 'rep movq' copy and one using an instruction loop. (Perhaps with guidance to the cutover length.) In most places the driver will know whether the size is above or below the cutover - which might be 256. Certainly transfers below 64 bytes are 'short'. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-23 16:36 ` Linus Torvalds 2018-11-23 17:42 ` Linus Torvalds @ 2018-11-26 10:01 ` David Laight 2018-11-26 10:26 ` David Laight 2 siblings, 0 replies; 39+ messages in thread From: David Laight @ 2018-11-26 10:01 UTC (permalink / raw) To: 'Linus Torvalds' Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni From: Linus Torvalds > Sent: 23 November 2018 16:36 > > On Fri, Nov 23, 2018 at 2:12 AM David Laight <David.Laight@aculab.com> wrote: > > > > I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel. > > Calling memcpy_fromio(kernel_buffer, PCIe_address, length) > > generates a lot of single byte TLP. > > I just tested it too - it turns out that the __inline_memcpy() code > never triggers, and "memcpy_toio()" just generates a memcpy. > > So that code seems entirely dead. > > And, in fact, the codebase I looked at was the historical one, because > I had been going back and looking at the history. The modern tree > *does* have the "__inline_memcpy()" function I pointed at, but it's > not actually hooked up to anything! > > This actually has been broken for _ages_. The breakage goes back to > 2010, and commit 6175ddf06b61 ("x86: Clean up mem*io functions"), and > it seems nobody really ever noticed - or thought that it was ok. It probably was ok in 2010 - that predates ERMS copy. > That commit claims that iomem has no special significance on x86, but > that really really isn't true, exactly because the access size does > matter. I suspect that memcpy_to/fromio() should only be used for IO space that has 'memory-like' semantics. So it shouldn't really matter what size accesses are done. OTOH the 'io' side is likely to be slow so you want to limit the number of io cycles (reads in particular). With memory-like semantics it is ok to read full words at both ends of the buffer to avoid extra transfers. Indeed, on PCIe, the misaligned transfer for the last 8 bytes is probably optimal. > And as mentioned, the generic memory copy routines are not at all > appropriate, and that has nothing to do with ERMS. Our "do it by hand" > memory copy routine does things like this: > > .Lless_16bytes: > cmpl $8, %edx > jb .Lless_8bytes > /* > * Move data from 8 bytes to 15 bytes. > */ > movq 0*8(%rsi), %r8 > movq -1*8(%rsi, %rdx), %r9 > movq %r8, 0*8(%rdi) > movq %r9, -1*8(%rdi, %rdx) > retq > > and note how for a 8-byte copy, it will do *two* reads of the same 8 > bytes, and *two* writes of the same 8 byte destination. That's > perfectly ok for regular memory, and it means that the code can handle > an arbitrary 8-15 byte copy without any conditionals or loop counts, > but it is *not* ok for iomem. I'd say it is ok for memcpy_to/fromio() since that should only really be used for targets with memory-like semantics - and could be documented as such. But doing the same transfers twice is definitely sub-optimal. > Of course, in practice it all just happens to work in almost all > situations (because a lot of iomem devices simply won't care), and > manual access to iomem is basically extremely rare these days anyway, > but it's definitely entirely and utterly broken. > > End result: we *used* to do this right. For the last eight years our > "memcpy_{to,from}io()" has been entirely broken, and apparently even > the people who noticed oddities like David, never reported it as > breakage but instead just worked around it in drivers. > > Ho humm. > > Let me write a generic routine in lib/iomap_copy.c (which already does > the "user specifies chunk size" cases), and hook it up for x86. > > David, are you using a bus analyzer or something to do your testing? > I'll have a trial patch for you asap. We've a TLP monitor built into our fpga image so can look at the last few TLPs (IIRC a 32kB buffer). Testing patches is a bit harder. The test system isn't one I build kernels on. Is there a 'sensible' amd64 kernel config that contains most of the drivers a modern system might need? It is a PITA trying to build kernels that will load on all my test systems (since I tend to move the disks between systems). Rebuilding the ubuntu config just takes too long and generates a ramdisk that doesn't fit in /boot. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-23 16:36 ` Linus Torvalds 2018-11-23 17:42 ` Linus Torvalds 2018-11-26 10:01 ` David Laight @ 2018-11-26 10:26 ` David Laight 2019-01-05 2:38 ` Linus Torvalds 2 siblings, 1 reply; 39+ messages in thread From: David Laight @ 2018-11-26 10:26 UTC (permalink / raw) To: 'Linus Torvalds' Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni From: Linus Torvalds > Sent: 23 November 2018 16:36 ... > End result: we *used* to do this right. For the last eight years our > "memcpy_{to,from}io()" has been entirely broken, and apparently even > the people who noticed oddities like David, never reported it as > breakage but instead just worked around it in drivers. I've mentioned it several times... Probably no one else noticed lots of single byte transfers while testing a TLP monitor he was writing for an FPGA :-) They are far too expensive to buy, and would never be connected to the right system at the right time - so we (I) wrote one. Unfortunately we don't really get to see what happens when the link comes up (or rather doesn't come up). We only get the LTSSM state transitions. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-26 10:26 ` David Laight @ 2019-01-05 2:38 ` Linus Torvalds 2019-01-07 9:55 ` David Laight 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2019-01-05 2:38 UTC (permalink / raw) To: David Laight Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni Coming back to this old thread, because I've spent most of the day resurrecting some of my old core x86 patches, and one of them was for the issue David Laight complained about: horrible memcpy_toio() performance. Yes, I should have done this before the merge window instead of at the end of it, but I didn't want to delay things for yet another release just because it fell through some cracks. Anyway, it would be lovely to hear whether memcpy_toio() now works reasonably. I just picked our very old legacy function for this, so it will do things in 32-bit chunks (even on x86-64), and I'm certainly open to somebody doing something smarter, but considering that nobody else seemed to show any interest in this at all, I just went "whatever, good enough". I tried to make it easy to improve on things if people want to. The other ancient patch I resurrected was the old "use asm goto for put_user" which I've had in a private branch for the last almost three years. I've tried to test this (it turns out I had completely screwed up the 32-bit case for put_user, for example), but I only have 64-bit user space, so the i386 build ended up being just about building and looking superficially at the code generation in a couple of places. More testing and comments appreciated. Now I have no ancient patches in any branches, or any known pending issue. Except for all the pull requests that are piling up because I didn't do them today since I was spending time on my own patches. Oh well. There's always tomorrow. Linus On Mon, Nov 26, 2018 at 2:26 AM David Laight <David.Laight@aculab.com> wrote: > > From: Linus Torvalds > > Sent: 23 November 2018 16:36 > ... > > End result: we *used* to do this right. For the last eight years our > > "memcpy_{to,from}io()" has been entirely broken, and apparently even > > the people who noticed oddities like David, never reported it as > > breakage but instead just worked around it in drivers. > > I've mentioned it several times... > > Probably no one else noticed lots of single byte transfers while > testing a TLP monitor he was writing for an FPGA :-) > They are far too expensive to buy, and would never be connected > to the right system at the right time - so we (I) wrote one. > > Unfortunately we don't really get to see what happens when the > link comes up (or rather doesn't come up). We only get the > LTSSM state transitions. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH] x86: only use ERMS for user copies for larger sizes 2019-01-05 2:38 ` Linus Torvalds @ 2019-01-07 9:55 ` David Laight 2019-01-07 17:43 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: David Laight @ 2019-01-07 9:55 UTC (permalink / raw) To: 'Linus Torvalds' Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni From: Linus Torvalds > Sent: 05 January 2019 02:39 ... > Anyway, it would be lovely to hear whether memcpy_toio() now works > reasonably. I just picked our very old legacy function for this, so it > will do things in 32-bit chunks (even on x86-64), and I'm certainly > open to somebody doing something smarter, but considering that nobody > else seemed to show any interest in this at all, I just went > "whatever, good enough". > > I tried to make it easy to improve on things if people want to. I'll do some tests once the merge has had time to settle. I needed to open-code one part because it wants to do copy_to_user() from a PCIe address buffer (which has to work). Using 64bit chunks for reads is probably worth while on x86-64. I might cook up a patch. Actually, if the AVX registers are available without an fpu save using larger chunks would be worthwhile - especially for io reads. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2019-01-07 9:55 ` David Laight @ 2019-01-07 17:43 ` Linus Torvalds 2019-01-08 9:10 ` David Laight 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2019-01-07 17:43 UTC (permalink / raw) To: David Laight Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni On Mon, Jan 7, 2019 at 1:55 AM David Laight <David.Laight@aculab.com> wrote: > > I needed to open-code one part because it wants to do copy_to_user() > from a PCIe address buffer (which has to work). It will never work for memcpy_fromio(). Any driver that thinks it will copy from io space to user space absolutely *has* to do it by hand. No questions, and no exceptions. Some loop like for (..) put_user(readl(iomem++), uaddr++); because neither copy_to_user() nor memcpy_fromio() will *ever* handle that correctly. They might randomly happen to work on x86, but absolutely nowhere else. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* RE: [PATCH] x86: only use ERMS for user copies for larger sizes 2019-01-07 17:43 ` Linus Torvalds @ 2019-01-08 9:10 ` David Laight 2019-01-08 18:01 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: David Laight @ 2019-01-08 9:10 UTC (permalink / raw) To: 'Linus Torvalds' Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni From: Linus Torvalds > Sent: 07 January 2019 17:44 > On Mon, Jan 7, 2019 at 1:55 AM David Laight <David.Laight@aculab.com> wrote: > > > > I needed to open-code one part because it wants to do copy_to_user() > > from a PCIe address buffer (which has to work). > > It will never work for memcpy_fromio(). Any driver that thinks it will > copy from io space to user space absolutely *has* to do it by hand. No > questions, and no exceptions. Some loop like > > for (..) > put_user(readl(iomem++), uaddr++); > > because neither copy_to_user() nor memcpy_fromio() will *ever* handle > that correctly. > > They might randomly happen to work on x86, but absolutely nowhere else. Actually they tend to handle it on a lot of systems. (But I don't do it.) Probably most of those where vm_iomap_memory() (to map IO memory space directly into user pages) works. It might be 'interesting' to build an amd64 kernel where all the IO memory addresses (eg returned by pci_iomap()) are offset by a large constant so direct accesses all fault and all the readl() macros (etc) add it back in. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2019-01-08 9:10 ` David Laight @ 2019-01-08 18:01 ` Linus Torvalds 0 siblings, 0 replies; 39+ messages in thread From: Linus Torvalds @ 2019-01-08 18:01 UTC (permalink / raw) To: David Laight Cc: Andrew Lutomirski, dvlasenk, Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, brgerst, Linux List Kernel Mailing, pabeni On Tue, Jan 8, 2019 at 1:10 AM David Laight <David.Laight@aculab.com> wrote: > > > > It will never work for memcpy_fromio(). Any driver that thinks it will > > copy from io space to user space absolutely *has* to do it by hand. No > > questions, and no exceptions. Some loop like > > > > for (..) > > put_user(readl(iomem++), uaddr++); > > > > because neither copy_to_user() nor memcpy_fromio() will *ever* handle > > that correctly. > > > > They might randomly happen to work on x86, but absolutely nowhere else. > > Actually they tend to handle it on a lot of systems. Not with memcpy_fromio(), at least. That doesn't work even on x86. Try it. If the user space page is swapped out (or not mapped), you'd get a kernel page fault. And if you do "copy_to_user()" from a mmio region, you get what you get. If somebody complains about it doing a byte-at-a-time copy, I'll laugh in their face and tell them to fix their broken driver. It might work on about half the architectures out there, but it's still complete garbage, and it's not a bug in copy_to_user(). Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 13:32 ` Jens Axboe 2018-11-21 13:44 ` Denys Vlasenko @ 2018-11-21 13:45 ` Paolo Abeni 2018-11-21 17:27 ` Linus Torvalds 1 sibling, 1 reply; 39+ messages in thread From: Paolo Abeni @ 2018-11-21 13:45 UTC (permalink / raw) To: Jens Axboe, Ingo Molnar Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin, the arch/x86 maintainers, Linus Torvalds, Andrew Morton, Andy Lutomirski, Peter Zijlstra, Denys Vlasenko, Brian Gerst, linux-kernel On Wed, 2018-11-21 at 06:32 -0700, Jens Axboe wrote: > I did some more investigation yesterday, and found this: > > commit 236222d39347e0e486010f10c1493e83dbbdfba8 > Author: Paolo Abeni <pabeni@redhat.com> > Date: Thu Jun 29 15:55:58 2017 +0200 > > x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings > > which does attempt to rectify it, but only using ERMS for >= 64 byte copies. > At least for me, looks like the break even point is higher than that, which > would mean that something like the below would be more appropriate. Back then I used a custom kernel module and the tsc to micro-benchmark the patched function and the original one with different buffer sizes. I'm sorry, the relevant code has been lost. In my experiments 64 bytes was the break even point for all the CPUs I had handy, but I guess that may change with other models. Cheers, Paolo ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 13:45 ` Paolo Abeni @ 2018-11-21 17:27 ` Linus Torvalds 2018-11-21 18:04 ` Jens Axboe 2018-11-21 18:16 ` Linus Torvalds 0 siblings, 2 replies; 39+ messages in thread From: Linus Torvalds @ 2018-11-21 17:27 UTC (permalink / raw) To: pabeni Cc: Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <pabeni@redhat.com> wrote: > > In my experiments 64 bytes was the break even point for all the CPUs I > had handy, but I guess that may change with other models. Note that experiments with memcpy speed are almost invariably broken. microbenchmarks don't show the impact of I$, but they also don't show the impact of _behavior_. For example, there might be things like "repeat strings do cacheline optimizations" that end up meaning that cachelines stay in L2, for example, and are never brought into L1. That can be a really good thing, but it can also mean that now the result isn't as close to the CPU, and the subsequent use of the cacheline can be costlier. I say "go for upping the limit to 128 bytes". That said, if the aio user copy is _so_ critical that it's this noticeable, there may be other issues. Sometimes _real_ cost of small user copies is often the STAC/CLAC, more so than the "rep movs". It would be interesting to know exactly which copy it is that matters so much... *inlining* the erms case might show that nicely in profiles. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 17:27 ` Linus Torvalds @ 2018-11-21 18:04 ` Jens Axboe 2018-11-21 18:26 ` Andy Lutomirski 2018-11-21 18:16 ` Linus Torvalds 1 sibling, 1 reply; 39+ messages in thread From: Jens Axboe @ 2018-11-21 18:04 UTC (permalink / raw) To: Linus Torvalds, pabeni Cc: Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing On 11/21/18 10:27 AM, Linus Torvalds wrote: > On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <pabeni@redhat.com> wrote: >> >> In my experiments 64 bytes was the break even point for all the CPUs I >> had handy, but I guess that may change with other models. > > Note that experiments with memcpy speed are almost invariably broken. > microbenchmarks don't show the impact of I$, but they also don't show > the impact of _behavior_. > > For example, there might be things like "repeat strings do cacheline > optimizations" that end up meaning that cachelines stay in L2, for > example, and are never brought into L1. That can be a really good > thing, but it can also mean that now the result isn't as close to the > CPU, and the subsequent use of the cacheline can be costlier. Totally agree, which is why all my testing was NOT microbenchmarking. > I say "go for upping the limit to 128 bytes". See below... > That said, if the aio user copy is _so_ critical that it's this > noticeable, there may be other issues. Sometimes _real_ cost of small > user copies is often the STAC/CLAC, more so than the "rep movs". > > It would be interesting to know exactly which copy it is that matters > so much... *inlining* the erms case might show that nicely in > profiles. Oh I totally agree, which is why I since went a different route. The copy that matters is the copy_from_user() of the iocb, which is 64 bytes. Even for 4k IOs, copying 64b per IO is somewhat counter productive for O_DIRECT. Playing around with this: http://git.kernel.dk/cgit/linux-block/commit/?h=aio-poll&id=ed0a0a445c0af4cfd18b0682511981eaf352d483 since we're doing a new sys_io_setup2() for polled aio anyway. This completely avoids the iocb copy, but that's just for my initial particular gripe. diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index db4e5aa0858b..21c4d68c5fac 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -175,8 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string) */ ENTRY(copy_user_enhanced_fast_string) ASM_STAC - cmpl $64,%edx - jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */ + cmpl $128,%edx + jb .L_copy_short_string /* less then 128 bytes, avoid costly 'rep' */ movl %edx,%ecx 1: rep movsb -- Jens Axboe ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 18:04 ` Jens Axboe @ 2018-11-21 18:26 ` Andy Lutomirski 2018-11-21 18:43 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Andy Lutomirski @ 2018-11-21 18:26 UTC (permalink / raw) To: Jens Axboe, dave.hansen Cc: Linus Torvalds, pabeni, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing > On Nov 21, 2018, at 11:04 AM, Jens Axboe <axboe@kernel.dk> wrote: > >> On 11/21/18 10:27 AM, Linus Torvalds wrote: >>> On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <pabeni@redhat.com> wrote: >>> >>> In my experiments 64 bytes was the break even point for all the CPUs I >>> had handy, but I guess that may change with other models. >> >> Note that experiments with memcpy speed are almost invariably broken. >> microbenchmarks don't show the impact of I$, but they also don't show >> the impact of _behavior_. >> >> For example, there might be things like "repeat strings do cacheline >> optimizations" that end up meaning that cachelines stay in L2, for >> example, and are never brought into L1. That can be a really good >> thing, but it can also mean that now the result isn't as close to the >> CPU, and the subsequent use of the cacheline can be costlier. > > Totally agree, which is why all my testing was NOT microbenchmarking. > >> I say "go for upping the limit to 128 bytes". > > See below... > >> That said, if the aio user copy is _so_ critical that it's this >> noticeable, there may be other issues. Sometimes _real_ cost of small >> user copies is often the STAC/CLAC, more so than the "rep movs". >> >> It would be interesting to know exactly which copy it is that matters >> so much... *inlining* the erms case might show that nicely in >> profiles. > > Oh I totally agree, which is why I since went a different route. The > copy that matters is the copy_from_user() of the iocb, which is 64 > bytes. Even for 4k IOs, copying 64b per IO is somewhat counter > productive for O_DIRECT. Can we maybe use this as an excuse to ask for some reasonable instructions to access user memory? Intel already did all the dirty work of giving something resembling sane semantics for the kernel doing a user-privileged access with WRUSS. How about WRUSER, RDUSER, and maybe even the REP variants? And I suppose LOCK CMPXCHGUSER. Or Intel could try to make STAC and CLAC be genuinely fast (0 or 1 cycles and no stalls *ought* to be possible if it were handled in the front end, as long as there aren’t any PUSHF or POPF instructions in the pipeline). As it stands, I assume that both instructions prevent any following memory accesses from starting until they retire, and they might even be nastily microcoded to handle the overloading of AC. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 18:26 ` Andy Lutomirski @ 2018-11-21 18:43 ` Linus Torvalds 2018-11-21 22:38 ` Andy Lutomirski 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2018-11-21 18:43 UTC (permalink / raw) To: Andy Lutomirski Cc: Jens Axboe, dave.hansen, pabeni, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing On Wed, Nov 21, 2018 at 10:26 AM Andy Lutomirski <luto@amacapital.net> wrote: > > Can we maybe use this as an excuse to ask for some reasonable instructions to access user memory? I did that long ago. It's why we have CLAC/STAC today. I was told that what I actually asked for (get an instruction to access user space - I suggested using a segment override prefix) was not viable. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 18:43 ` Linus Torvalds @ 2018-11-21 22:38 ` Andy Lutomirski 0 siblings, 0 replies; 39+ messages in thread From: Andy Lutomirski @ 2018-11-21 22:38 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Dave Hansen, pabeni, Ingo Molnar, Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin, X86 ML, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, Denys Vlasenko, Brian Gerst, LKML On Wed, Nov 21, 2018 at 10:44 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, Nov 21, 2018 at 10:26 AM Andy Lutomirski <luto@amacapital.net> wrote: > > > > Can we maybe use this as an excuse to ask for some reasonable instructions to access user memory? > > I did that long ago. It's why we have CLAC/STAC today. I was told that > what I actually asked for (get an instruction to access user space - I > suggested using a segment override prefix) was not viable. > Maybe it wasn't viable *then*, but WRUSS really does write to userspace with user privilege. Surely the same mechanism with some of the weirdness removed would do what we want. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 17:27 ` Linus Torvalds 2018-11-21 18:04 ` Jens Axboe @ 2018-11-21 18:16 ` Linus Torvalds 2018-11-21 19:01 ` Linus Torvalds 2018-11-24 6:09 ` Jens Axboe 1 sibling, 2 replies; 39+ messages in thread From: Linus Torvalds @ 2018-11-21 18:16 UTC (permalink / raw) To: pabeni Cc: Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing On Wed, Nov 21, 2018 at 9:27 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It would be interesting to know exactly which copy it is that matters > so much... *inlining* the erms case might show that nicely in > profiles. Side note: the fact that Jens' patch (which I don't like in that form) allegedly shrunk the resulting kernel binary would seem to indicate that there's a *lot* of compile-time constant-sized memcpy calls that we are missing, and that fall back to copy_user_generic(). It might be interesting to just change raw_copy_to/from_user() to handle a lot more cases (in particular, handle cases where 'size' is 8-byte aligned). The special cases we *do* have may not be the right ones (the 10-byte case in particular looks odd). For example, instead of having a "if constant size is 8 bytes, do one get/put_user()" case, we might have a "if constant size is < 64 just unroll it into get/put_user()" calls. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 18:16 ` Linus Torvalds @ 2018-11-21 19:01 ` Linus Torvalds 2018-11-22 10:32 ` Ingo Molnar 2018-11-24 6:09 ` Jens Axboe 1 sibling, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2018-11-21 19:01 UTC (permalink / raw) To: pabeni Cc: Jens Axboe, Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing [-- Attachment #1: Type: text/plain, Size: 1893 bytes --] On Wed, Nov 21, 2018 at 10:16 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It might be interesting to just change raw_copy_to/from_user() to > handle a lot more cases (in particular, handle cases where 'size' is > 8-byte aligned). The special cases we *do* have may not be the right > ones (the 10-byte case in particular looks odd). > > For example, instead of having a "if constant size is 8 bytes, do one > get/put_user()" case, we might have a "if constant size is < 64 just > unroll it into get/put_user()" calls. Actually, x86 doesn't even set INLINE_COPY_TO_USER, so I don't think the constant size cases ever trigger at all the way they are set up now. I do have a random patch that makes "unsafe_put_user()" actually use "asm goto" for the error case, and that, together with the attached patch seems to generate fairly nice code, but even then it would depend on gcc actually unrolling things (which we do *not* want in general). But for a 32-byte user copy (cp_old_stat), and that INLINE_COPY_TO_USER, it generates this: stac movl $32, %edx #, size movq %rsp, %rax #, src .L201: movq (%rax), %rcx # MEM[base: src_155, offset: 0B], MEM[base: src_155, offset: 0B] 1: movq %rcx,0(%rbp) # MEM[base: src_155, offset: 0B], MEM[(struct __large_struct *)dst_156] ASM_EXTABLE_HANDLE from=1b to=.L200 handler="ex_handler_uaccess" # addq $8, %rax #, src addq $8, %rbp #, statbuf subq $8, %rdx #, size jne .L201 #, clac which is actually fairly close to "optimal". Random patch (with my "asm goto" hack included) attached, in case people want to play with it. Impressively, it actually removes more lines of code than it adds. But I didn't actually check whether the end result *works*, so hey.. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 13387 bytes --] arch/x86/include/asm/uaccess.h | 96 +++++++++---------- arch/x86/include/asm/uaccess_64.h | 191 ++++++++++++++++++-------------------- fs/readdir.c | 22 +++-- 3 files changed, 149 insertions(+), 160 deletions(-) diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index b5e58cc0c5e7..3f4c89deb7a1 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -12,6 +12,9 @@ #include <asm/smap.h> #include <asm/extable.h> +#define INLINE_COPY_TO_USER +#define INLINE_COPY_FROM_USER + /* * The fs value determines whether argument validity checking should be * performed or not. If get_fs() == USER_DS, checking is performed, with @@ -189,19 +192,14 @@ __typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL)) #ifdef CONFIG_X86_32 -#define __put_user_asm_u64(x, addr, err, errret) \ - asm volatile("\n" \ - "1: movl %%eax,0(%2)\n" \ - "2: movl %%edx,4(%2)\n" \ - "3:" \ - ".section .fixup,\"ax\"\n" \ - "4: movl %3,%0\n" \ - " jmp 3b\n" \ - ".previous\n" \ - _ASM_EXTABLE_UA(1b, 4b) \ - _ASM_EXTABLE_UA(2b, 4b) \ - : "=r" (err) \ - : "A" (x), "r" (addr), "i" (errret), "0" (err)) +#define __put_user_goto_u64(x, addr, label) \ + asm volatile("\n" \ + "1: movl %%eax,0(%2)\n" \ + "2: movl %%edx,4(%2)\n" \ + _ASM_EXTABLE_UA(1b, %2l) \ + _ASM_EXTABLE_UA(2b, %2l) \ + : : "A" (x), "r" (addr) \ + : : label) #define __put_user_asm_ex_u64(x, addr) \ asm volatile("\n" \ @@ -216,8 +214,8 @@ __typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL)) asm volatile("call __put_user_8" : "=a" (__ret_pu) \ : "A" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx") #else -#define __put_user_asm_u64(x, ptr, retval, errret) \ - __put_user_asm(x, ptr, retval, "q", "", "er", errret) +#define __put_user_goto_u64(x, ptr, label) \ + __put_user_goto(x, ptr, "q", "", "er", label) #define __put_user_asm_ex_u64(x, addr) \ __put_user_asm_ex(x, addr, "q", "", "er") #define __put_user_x8(x, ptr, __ret_pu) __put_user_x(8, x, ptr, __ret_pu) @@ -278,23 +276,21 @@ extern void __put_user_8(void); __builtin_expect(__ret_pu, 0); \ }) -#define __put_user_size(x, ptr, size, retval, errret) \ +#define __put_user_size(x, ptr, size, label) \ do { \ - retval = 0; \ __chk_user_ptr(ptr); \ switch (size) { \ case 1: \ - __put_user_asm(x, ptr, retval, "b", "b", "iq", errret); \ + __put_user_goto(x, ptr, "b", "b", "iq", label); \ break; \ case 2: \ - __put_user_asm(x, ptr, retval, "w", "w", "ir", errret); \ + __put_user_goto(x, ptr, "w", "w", "ir", label); \ break; \ case 4: \ - __put_user_asm(x, ptr, retval, "l", "k", "ir", errret); \ + __put_user_goto(x, ptr, "l", "k", "ir", label); \ break; \ case 8: \ - __put_user_asm_u64((__typeof__(*ptr))(x), ptr, retval, \ - errret); \ + __put_user_goto_u64((__typeof__(*ptr))(x), ptr, label); \ break; \ default: \ __put_user_bad(); \ @@ -439,9 +435,12 @@ do { \ #define __put_user_nocheck(x, ptr, size) \ ({ \ - int __pu_err; \ + __label__ __pu_label; \ + int __pu_err = -EFAULT; \ __uaccess_begin(); \ - __put_user_size((x), (ptr), (size), __pu_err, -EFAULT); \ + __put_user_size((x), (ptr), (size), __pu_label); \ + __pu_err = 0; \ +__pu_label: \ __uaccess_end(); \ __builtin_expect(__pu_err, 0); \ }) @@ -466,17 +465,23 @@ struct __large_struct { unsigned long buf[100]; }; * we do not write to any memory gcc knows about, so there are no * aliasing issues. */ -#define __put_user_asm(x, addr, err, itype, rtype, ltype, errret) \ - asm volatile("\n" \ - "1: mov"itype" %"rtype"1,%2\n" \ - "2:\n" \ - ".section .fixup,\"ax\"\n" \ - "3: mov %3,%0\n" \ - " jmp 2b\n" \ - ".previous\n" \ - _ASM_EXTABLE_UA(1b, 3b) \ - : "=r"(err) \ - : ltype(x), "m" (__m(addr)), "i" (errret), "0" (err)) +#define __put_user_goto(x, addr, itype, rtype, ltype, label) \ + asm volatile goto("\n" \ + "1: mov"itype" %"rtype"0,%1\n" \ + _ASM_EXTABLE_UA(1b, %l2) \ + : : ltype(x), "m" (__m(addr)) \ + : : label) + +#define __put_user_failed(x, addr, itype, rtype, ltype, errret) \ + ({ __label__ __puflab; \ + int __pufret = errret; \ + __put_user_goto(x,addr,itype,rtype,ltype,__puflab); \ + __pufret = 0; \ + __puflab: __pufret; }) + +#define __put_user_asm(x, addr, retval, itype, rtype, ltype, errret) do { \ + retval = __put_user_failed(x, addr, itype, rtype, ltype, errret); \ +} while (0) #define __put_user_asm_ex(x, addr, itype, rtype, ltype) \ asm volatile("1: mov"itype" %"rtype"0,%1\n" \ @@ -687,12 +692,6 @@ extern struct movsl_mask { #define ARCH_HAS_NOCACHE_UACCESS 1 -#ifdef CONFIG_X86_32 -# include <asm/uaccess_32.h> -#else -# include <asm/uaccess_64.h> -#endif - /* * We rely on the nested NMI work to allow atomic faults from the NMI path; the * nested NMI paths are careful to preserve CR2. @@ -711,13 +710,8 @@ extern struct movsl_mask { #define user_access_begin() __uaccess_begin() #define user_access_end() __uaccess_end() -#define unsafe_put_user(x, ptr, err_label) \ -do { \ - int __pu_err; \ - __typeof__(*(ptr)) __pu_val = (x); \ - __put_user_size(__pu_val, (ptr), sizeof(*(ptr)), __pu_err, -EFAULT); \ - if (unlikely(__pu_err)) goto err_label; \ -} while (0) +#define unsafe_put_user(x, ptr, label) \ + __put_user_size((x), (ptr), sizeof(*(ptr)), label) #define unsafe_get_user(x, ptr, err_label) \ do { \ @@ -728,5 +722,11 @@ do { \ if (unlikely(__gu_err)) goto err_label; \ } while (0) +#ifdef CONFIG_X86_32 +# include <asm/uaccess_32.h> +#else +# include <asm/uaccess_64.h> +#endif + #endif /* _ASM_X86_UACCESS_H */ diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index a9d637bc301d..8dd85d2fce28 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -65,117 +65,104 @@ copy_to_user_mcsafe(void *to, const void *from, unsigned len) static __always_inline __must_check unsigned long raw_copy_from_user(void *dst, const void __user *src, unsigned long size) { - int ret = 0; - - if (!__builtin_constant_p(size)) - return copy_user_generic(dst, (__force void *)src, size); - switch (size) { - case 1: - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u8 *)dst, (u8 __user *)src, - ret, "b", "b", "=q", 1); - __uaccess_end(); - return ret; - case 2: - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u16 *)dst, (u16 __user *)src, - ret, "w", "w", "=r", 2); - __uaccess_end(); - return ret; - case 4: - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u32 *)dst, (u32 __user *)src, - ret, "l", "k", "=r", 4); - __uaccess_end(); - return ret; - case 8: - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src, - ret, "q", "", "=r", 8); - __uaccess_end(); - return ret; - case 10: - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src, - ret, "q", "", "=r", 10); - if (likely(!ret)) - __get_user_asm_nozero(*(u16 *)(8 + (char *)dst), - (u16 __user *)(8 + (char __user *)src), - ret, "w", "w", "=r", 2); - __uaccess_end(); - return ret; - case 16: - __uaccess_begin_nospec(); - __get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src, - ret, "q", "", "=r", 16); - if (likely(!ret)) - __get_user_asm_nozero(*(u64 *)(8 + (char *)dst), - (u64 __user *)(8 + (char __user *)src), - ret, "q", "", "=r", 8); - __uaccess_end(); - return ret; - default: + if (!__builtin_constant_p(size) || size > 128) return copy_user_generic(dst, (__force void *)src, size); + + /* Small constant size: unroll */ + user_access_begin(); + while (size >= sizeof(unsigned long)) { + unsigned long val; + unsafe_get_user(val, (const unsigned long __user *) src, out); + *(unsigned long *)dst = val; + size -= sizeof(unsigned long); + src += sizeof(unsigned long); + dst += sizeof(unsigned long); } + + while (size >= sizeof(unsigned int)) { + unsigned int val; + unsafe_get_user(val, (const unsigned int __user *) src, out); + *(unsigned int *)dst = val; + size -= sizeof(unsigned int); + src += sizeof(unsigned int); + dst += sizeof(unsigned int); + } + + while (size >= sizeof(unsigned short)) { + unsigned short val; + unsafe_get_user(val, (const unsigned short __user *) src, out); + *(unsigned short *)dst = val; + size -= sizeof(unsigned short); + src += sizeof(unsigned short); + dst += sizeof(unsigned short); + } + + while (size >= sizeof(unsigned char)) { + unsigned char val; + unsafe_get_user(val, (const unsigned char __user *) src, out); + *(unsigned char *)dst = val; + size -= sizeof(unsigned char); + src += sizeof(unsigned char); + dst += sizeof(unsigned char); + } + user_access_end(); + return 0; + +out: + user_access_end(); + return size; } static __always_inline __must_check unsigned long raw_copy_to_user(void __user *dst, const void *src, unsigned long size) { - int ret = 0; - - if (!__builtin_constant_p(size)) - return copy_user_generic((__force void *)dst, src, size); - switch (size) { - case 1: - __uaccess_begin(); - __put_user_asm(*(u8 *)src, (u8 __user *)dst, - ret, "b", "b", "iq", 1); - __uaccess_end(); - return ret; - case 2: - __uaccess_begin(); - __put_user_asm(*(u16 *)src, (u16 __user *)dst, - ret, "w", "w", "ir", 2); - __uaccess_end(); - return ret; - case 4: - __uaccess_begin(); - __put_user_asm(*(u32 *)src, (u32 __user *)dst, - ret, "l", "k", "ir", 4); - __uaccess_end(); - return ret; - case 8: - __uaccess_begin(); - __put_user_asm(*(u64 *)src, (u64 __user *)dst, - ret, "q", "", "er", 8); - __uaccess_end(); - return ret; - case 10: - __uaccess_begin(); - __put_user_asm(*(u64 *)src, (u64 __user *)dst, - ret, "q", "", "er", 10); - if (likely(!ret)) { - asm("":::"memory"); - __put_user_asm(4[(u16 *)src], 4 + (u16 __user *)dst, - ret, "w", "w", "ir", 2); - } - __uaccess_end(); - return ret; - case 16: - __uaccess_begin(); - __put_user_asm(*(u64 *)src, (u64 __user *)dst, - ret, "q", "", "er", 16); - if (likely(!ret)) { - asm("":::"memory"); - __put_user_asm(1[(u64 *)src], 1 + (u64 __user *)dst, - ret, "q", "", "er", 8); - } - __uaccess_end(); - return ret; - default: + if (!__builtin_constant_p(size) || size > 128) return copy_user_generic((__force void *)dst, src, size); + + /* Small constant size: unroll */ + user_access_begin(); + while (size >= sizeof(unsigned long)) { + unsigned long val; + val = *(const unsigned long *)src; + unsafe_put_user(val, (unsigned long __user *) dst, out); + size -= sizeof(unsigned long); + src += sizeof(unsigned long); + dst += sizeof(unsigned long); + } + + while (size >= sizeof(unsigned int)) { + unsigned int val; + val = *(const unsigned int *)src; + unsafe_put_user(val, (unsigned int __user *) dst, out); + size -= sizeof(unsigned int); + src += sizeof(unsigned int); + dst += sizeof(unsigned int); + } + + while (size >= sizeof(unsigned short)) { + unsigned short val; + val = *(const unsigned short *)src; + unsafe_put_user(val, (unsigned short __user *) dst, out); + size -= sizeof(unsigned short); + src += sizeof(unsigned short); + dst += sizeof(unsigned short); + } + + while (size >= sizeof(unsigned char)) { + unsigned char val; + val = *(const unsigned char *)src; + unsafe_put_user(val, (unsigned char __user *) dst, out); + size -= sizeof(unsigned char); + src += sizeof(unsigned char); + dst += sizeof(unsigned char); } + + user_access_end(); + return 0; + +out: + user_access_end(); + return size; } static __always_inline __must_check diff --git a/fs/readdir.c b/fs/readdir.c index d97f548e6323..f1e159e17125 100644 --- a/fs/readdir.c +++ b/fs/readdir.c @@ -185,25 +185,27 @@ static int filldir(struct dir_context *ctx, const char *name, int namlen, if (dirent) { if (signal_pending(current)) return -EINTR; - if (__put_user(offset, &dirent->d_off)) - goto efault; } + + user_access_begin(); + if (dirent) + unsafe_put_user(offset, &dirent->d_off, efault_end); dirent = buf->current_dir; - if (__put_user(d_ino, &dirent->d_ino)) - goto efault; - if (__put_user(reclen, &dirent->d_reclen)) - goto efault; + unsafe_put_user(d_ino, &dirent->d_ino, efault_end); + unsafe_put_user(reclen, &dirent->d_reclen, efault_end); + unsafe_put_user(0, dirent->d_name + namlen, efault_end); + unsafe_put_user(d_type, (char __user *) dirent + reclen - 1, efault_end); + user_access_end(); + if (copy_to_user(dirent->d_name, name, namlen)) goto efault; - if (__put_user(0, dirent->d_name + namlen)) - goto efault; - if (__put_user(d_type, (char __user *) dirent + reclen - 1)) - goto efault; buf->previous = dirent; dirent = (void __user *)dirent + reclen; buf->current_dir = dirent; buf->count -= reclen; return 0; +efault_end: + user_access_end(); efault: buf->error = -EFAULT; return -EFAULT; ^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 19:01 ` Linus Torvalds @ 2018-11-22 10:32 ` Ingo Molnar 2018-11-22 11:13 ` Ingo Molnar 2018-11-22 16:55 ` Linus Torvalds 0 siblings, 2 replies; 39+ messages in thread From: Ingo Molnar @ 2018-11-22 10:32 UTC (permalink / raw) To: Linus Torvalds Cc: pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Nov 21, 2018 at 10:16 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > It might be interesting to just change raw_copy_to/from_user() to > > handle a lot more cases (in particular, handle cases where 'size' is > > 8-byte aligned). The special cases we *do* have may not be the right > > ones (the 10-byte case in particular looks odd). > > > > For example, instead of having a "if constant size is 8 bytes, do one > > get/put_user()" case, we might have a "if constant size is < 64 just > > unroll it into get/put_user()" calls. > > Actually, x86 doesn't even set INLINE_COPY_TO_USER, so I don't think > the constant size cases ever trigger at all the way they are set up > now. Side note, there's one artifact the patch exposes: some of the __builtin_constant_p() checks are imprecise and don't trigger at the early stage where GCC checks them, but the lenght is actually known to the compiler at later optimization stages. This means that with Jen's patch some of the length checks go away. I checked x86-64 defconfig and a distro config, and the numbers were ~7% and 10%, so not a big effect. The kernel text size reduction with Jen's patch is small but real: text data bss dec hex filename 19572694 11516934 19873888 50963516 309a43c vmlinux.before 19572468 11516934 19873888 50963290 309a35a vmlinux.after But I checked the disassembly, and it's not a real win, the new code is actually more complex than the old one, as expected, but GCC (7.3.0) does some particularly stupid things which bloats the generated code. > I do have a random patch that makes "unsafe_put_user()" actually use > "asm goto" for the error case, and that, together with the attached > patch seems to generate fairly nice code, but even then it would > depend on gcc actually unrolling things (which we do *not* want in > general). > > But for a 32-byte user copy (cp_old_stat), and that > INLINE_COPY_TO_USER, it generates this: > > stac > movl $32, %edx #, size > movq %rsp, %rax #, src > .L201: > movq (%rax), %rcx # MEM[base: src_155, offset: 0B], > MEM[base: src_155, offset: 0B] > 1: movq %rcx,0(%rbp) # MEM[base: src_155, offset: 0B], > MEM[(struct __large_struct *)dst_156] > ASM_EXTABLE_HANDLE from=1b to=.L200 handler="ex_handler_uaccess" # > > addq $8, %rax #, src > addq $8, %rbp #, statbuf > subq $8, %rdx #, size > jne .L201 #, > clac > > which is actually fairly close to "optimal". Yeah, that looks pretty sweet! > Random patch (with my "asm goto" hack included) attached, in case > people want to play with it. Doesn't even look all that hacky to me. Any hack in it that I didn't notice? :-) The only question is the inlining overhead - will try to measure that. > Impressively, it actually removes more lines of code than it adds. But > I didn't actually check whether the end result *works*, so hey.. Most of the linecount reduction appears to come from the simplification of the unroll loop and moving it into C, from a 6-way hard-coded copy routine: > - switch (size) { > - case 1: > - case 2: > - case 4: > - case 8: > - case 10: > - case 16: to a more flexible 4-way loop unrolling: > + while (size >= sizeof(unsigned long)) { > + while (size >= sizeof(unsigned int)) { > + while (size >= sizeof(unsigned short)) { > + while (size >= sizeof(unsigned char)) { Which is a nice improvement in itself. > + user_access_begin(); > + if (dirent) > + unsafe_put_user(offset, &dirent->d_off, efault_end); > dirent = buf->current_dir; > + unsafe_put_user(d_ino, &dirent->d_ino, efault_end); > + unsafe_put_user(reclen, &dirent->d_reclen, efault_end); > + unsafe_put_user(0, dirent->d_name + namlen, efault_end); > + unsafe_put_user(d_type, (char __user *) dirent + reclen - 1, efault_end); > + user_access_end(); > + > if (copy_to_user(dirent->d_name, name, namlen)) > goto efault; > buf->previous = dirent; > dirent = (void __user *)dirent + reclen; > buf->current_dir = dirent; > buf->count -= reclen; > return 0; > +efault_end: > + user_access_end(); > efault: > buf->error = -EFAULT; > return -EFAULT; In terms of high level APIs, could we perhaps use the opportunity to introduce unsafe_write_user() instead, which would allow us to write it as: unsafe_write_user(&dirent->d_ino, d_ino, efault_end); unsafe_write_user(&dirent->d_reclen, reclen, efault_end); unsafe_write_user(dirent->d_name + namlen, 0, efault_end); unsafe_write_user((char __user *)dirent + reclen - 1, d_type, efault_end); if (copy_to_user(dirent->d_name, name, namlen)) goto efault; This gives it the regular 'VAR = VAL;' notation of C assigments, instead of the weird historical reverse notation that put_user()/get_user() uses. Note how this newfangled ordering now matches the 'copy_to_user()' natural C-assignment parameter order that comes straight afterwards and makes it obvious that the d->name+namelen was writing the delimiter at the end. I think we even had bugs from put_user() ordering mixups? Or is it too late to try to fix this particular mistake? Thanks, Ingo ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-22 10:32 ` Ingo Molnar @ 2018-11-22 11:13 ` Ingo Molnar 2018-11-22 11:21 ` Ingo Molnar 2018-11-23 16:40 ` Josh Poimboeuf 2018-11-22 16:55 ` Linus Torvalds 1 sibling, 2 replies; 39+ messages in thread From: Ingo Molnar @ 2018-11-22 11:13 UTC (permalink / raw) To: Linus Torvalds Cc: pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing * Ingo Molnar <mingo@kernel.org> wrote: > The kernel text size reduction with Jen's patch is small but real: > > text data bss dec hex filename > 19572694 11516934 19873888 50963516 309a43c vmlinux.before > 19572468 11516934 19873888 50963290 309a35a vmlinux.after > > But I checked the disassembly, and it's not a real win, the new code is > actually more complex than the old one, as expected, but GCC (7.3.0) does > some particularly stupid things which bloats the generated code. So I dug into this some more: 1) Firstly I tracked down GCC bloating the might_fault() checks and the related out-of-line code exception handling which bloats the full generated function. 2) But with even that complication eliminated, there's a size reduction when Jen's patch is applied, which is puzzling: 19563640 11516790 19882080 50962510 309a04e vmlinux.before 19563274 11516790 19882080 50962144 3099ee0 vmlinux.after but this is entirely due to the .altinstructions section being counted as 'text' part of the vmlinux - while in reality it's not: 3) The _real_ part of the vmlinux gets bloated by Jen's patch: ffffffff81000000 <_stext>: before: ffffffff81b0e5e0 <__clear_user> after: ffffffff81b0e670 <__clear_user>: I.e. we get a e5e0 => e670 bloat, as expected. In the config I tested a later section of the kernel image first aligns away the bloat: before: ffffffff82fa6321 <.altinstr_aux>: after: ffffffff82fa6321 <.altinstr_aux>: and then artificially debloats the modified kernel via the altinstructions section: before: Disassembly of section .exit.text: ffffffff83160798 <intel_uncore_exit> after: Disassembly of section .exit.text: ffffffff83160608 <intel_uncore_exit> Note that there's a third level of obfuscation here: Jen's patch actually *adds* a new altinstructions statement: + /* + * For smaller copies, don't use ERMS as it's slower. + */ + if (len < 128) { + alternative_call(copy_user_generic_unrolled, + copy_user_generic_string, X86_FEATURE_REP_GOOD, + ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), + "=d" (len)), + "1" (to), "2" (from), "3" (len) + : "memory", "rcx", "r8", "r9", "r10", "r11"); + return ret; + } + /* * If CPU has ERMS feature, use copy_user_enhanced_fast_string. * Otherwise, if CPU has rep_good feature, use copy_user_generic_string. * Otherwise, use copy_user_generic_unrolled. */ alternative_call_2(copy_user_generic_unrolled, - copy_user_generic_string, - X86_FEATURE_REP_GOOD, - copy_user_enhanced_fast_string, - X86_FEATURE_ERMS, + copy_user_generic_string, X86_FEATURE_REP_GOOD, + copy_user_enhanced_fast_string, X86_FEATURE_ERMS, ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), "=d" (len)), "1" (to), "2" (from), "3" (len) So how can this change possibly result in a *small* altinstructions section? 4) The reason is GCC's somewhat broken __builtin_constant() logic, which leaves ~10% of the constant call sites actually active, but which are then optimized by GCC's later stages, and the alternative_call_2() gets optimized out and replaced with the alternative_call() call. This is where Jens's patch 'debloats' the vmlinux and confuses the 'size' utility and gains its code reduction street cred. Note to self: watch out for patches that change altinstructions and don't make premature vmlinux size impact assumptions. :-) Thanks, Ingo ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-22 11:13 ` Ingo Molnar @ 2018-11-22 11:21 ` Ingo Molnar 2018-11-23 16:40 ` Josh Poimboeuf 1 sibling, 0 replies; 39+ messages in thread From: Ingo Molnar @ 2018-11-22 11:21 UTC (permalink / raw) To: Linus Torvalds Cc: pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing * Ingo Molnar <mingo@kernel.org> wrote: > So I dug into this some more: > > 1) > > Firstly I tracked down GCC bloating the might_fault() checks and the > related out-of-line code exception handling which bloats the full > generated function. Sorry, I mis-remembered that detail when I wrote the email: it was CONFIG_HARDENED_USERCOPY=y and its object size checks that distros enable - and I disabled that option to simplify the size analysis. (might_fault() doesn't have inline conditionals so shouldn't have any effect on the generated code.) Thanks, Ingo ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-22 11:13 ` Ingo Molnar 2018-11-22 11:21 ` Ingo Molnar @ 2018-11-23 16:40 ` Josh Poimboeuf 1 sibling, 0 replies; 39+ messages in thread From: Josh Poimboeuf @ 2018-11-23 16:40 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing On Thu, Nov 22, 2018 at 12:13:41PM +0100, Ingo Molnar wrote: > Note to self: watch out for patches that change altinstructions and don't > make premature vmlinux size impact assumptions. :-) I noticed a similar problem with ORC data. As it turns out, size's "text" calculation also includes read-only sections. That includes .rodata and anything else not writable. Maybe we need a more sensible "size" script for the kernel. It would be trivial to implement based on the output of "readelf -S vmlinux". -- Josh ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-22 10:32 ` Ingo Molnar 2018-11-22 11:13 ` Ingo Molnar @ 2018-11-22 16:55 ` Linus Torvalds 2018-11-22 17:26 ` Andy Lutomirski 1 sibling, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2018-11-22 16:55 UTC (permalink / raw) To: Ingo Molnar Cc: pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing On Thu, Nov 22, 2018 at 2:32 AM Ingo Molnar <mingo@kernel.org> wrote: > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > Random patch (with my "asm goto" hack included) attached, in case > > people want to play with it. > > Doesn't even look all that hacky to me. Any hack in it that I didn't > notice? :-) The code to use asm goto sadly doesn't have any fallback at all for the "no asm goto available". I guess we're getting close to "we require asm goto support", but I don't think we're there yet. Also, while "unsafe_put_user()" has been converted to use asm goto (and yes, it really does generate much nicer code), the same is not true of "unsafe_get_user()". Because sadly, gcc does not support asm goto with output values. So, realistically, my patch is not _technically_ hacky, but it's simply not viable as things stand, and it's more of a tech demonstration than anything else. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-22 16:55 ` Linus Torvalds @ 2018-11-22 17:26 ` Andy Lutomirski 2018-11-22 17:35 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Andy Lutomirski @ 2018-11-22 17:26 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin, X86 ML, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, Denys Vlasenko, Brian Gerst, LKML On Thu, Nov 22, 2018 at 8:56 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Nov 22, 2018 at 2:32 AM Ingo Molnar <mingo@kernel.org> wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > Random patch (with my "asm goto" hack included) attached, in case > > > people want to play with it. > > > > Doesn't even look all that hacky to me. Any hack in it that I didn't > > notice? :-) > > The code to use asm goto sadly doesn't have any fallback at all for > the "no asm goto available". > > I guess we're getting close to "we require asm goto support", but I > don't think we're there yet. commit e501ce957a786ecd076ea0cfb10b114e6e4d0f40 Author: Peter Zijlstra <peterz@infradead.org> Date: Wed Jan 17 11:42:07 2018 +0100 x86: Force asm-goto We want to start using asm-goto to guarantee the absence of dynamic branches (and thus speculation). A primary prerequisite for this is of course that the compiler supports asm-goto. This effecively lifts the minimum GCC version to build an x86 kernel to gcc-4.5. This is basically the only good outcome from the speculation crap as far as I'm concerned :) So I think your patch is viable. Also, with that patch applied, put_user_ex() should become worse than worthless -- if gcc is any good, plain old: if (unsafe_put_user(...) != 0) goto err; if (unsafe_put_user(...) != 0) goto err; etc. will generate *better* code than a series of put_user_ex() calls. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-22 17:26 ` Andy Lutomirski @ 2018-11-22 17:35 ` Linus Torvalds 0 siblings, 0 replies; 39+ messages in thread From: Linus Torvalds @ 2018-11-22 17:35 UTC (permalink / raw) To: Andrew Lutomirski Cc: Ingo Molnar, pabeni, Jens Axboe, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing On Thu, Nov 22, 2018 at 9:26 AM Andy Lutomirski <luto@kernel.org> wrote: > > So I think your patch is viable. Also, with that patch applied, > put_user_ex() should become worse than worthless Yes. I hate those special-case _ex variants. I guess I should just properly forward-port my patch series where the different steps are separated out (not jumbled up like that patch I actually posted). Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH] x86: only use ERMS for user copies for larger sizes 2018-11-21 18:16 ` Linus Torvalds 2018-11-21 19:01 ` Linus Torvalds @ 2018-11-24 6:09 ` Jens Axboe 1 sibling, 0 replies; 39+ messages in thread From: Jens Axboe @ 2018-11-24 6:09 UTC (permalink / raw) To: Linus Torvalds, pabeni Cc: Ingo Molnar, Thomas Gleixner, Ingo Molnar, bp, Peter Anvin, the arch/x86 maintainers, Andrew Morton, Andrew Lutomirski, Peter Zijlstra, dvlasenk, brgerst, Linux List Kernel Mailing On 11/21/18 11:16 AM, Linus Torvalds wrote: > On Wed, Nov 21, 2018 at 9:27 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> It would be interesting to know exactly which copy it is that matters >> so much... *inlining* the erms case might show that nicely in >> profiles. > > Side note: the fact that Jens' patch (which I don't like in that form) > allegedly shrunk the resulting kernel binary would seem to indicate > that there's a *lot* of compile-time constant-sized memcpy calls that > we are missing, and that fall back to copy_user_generic(). Other kind of side note... This also affects memset(), which does rep stosb if we have ERMS if any size memset. I noticed this from sg_init_table(), which does a memset of the table. For my kind of testing, the entry size is small. The below, too, reduces memset() overhead by 50% here for me. diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S index 9bc861c71e75..bad0fdb9ddcd 100644 --- a/arch/x86/lib/memset_64.S +++ b/arch/x86/lib/memset_64.S @@ -60,6 +60,8 @@ EXPORT_SYMBOL(__memset) * rax original destination */ ENTRY(memset_erms) + cmpl $128,%edx + jb memset_orig movq %rdi,%r9 movb %sil,%al movq %rdx,%rcx -- Jens Axboe ^ permalink raw reply related [flat|nested] 39+ messages in thread
end of thread, other threads:[~2019-01-08 18:01 UTC | newest] Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <02bfc577-32a5-66be-64bf-d476b7d447d2@kernel.dk> 2018-11-20 20:24 ` [PATCH] x86: only use ERMS for user copies for larger sizes Jens Axboe 2018-11-21 6:36 ` Ingo Molnar 2018-11-21 13:32 ` Jens Axboe 2018-11-21 13:44 ` Denys Vlasenko 2018-11-22 17:36 ` David Laight 2018-11-22 17:52 ` Linus Torvalds 2018-11-22 18:06 ` Andy Lutomirski 2018-11-22 18:58 ` Linus Torvalds 2018-11-23 9:34 ` David Laight 2018-11-23 10:12 ` David Laight 2018-11-23 16:36 ` Linus Torvalds 2018-11-23 17:42 ` Linus Torvalds 2018-11-23 18:39 ` Andy Lutomirski 2018-11-23 18:44 ` Linus Torvalds 2018-11-23 19:11 ` Andy Lutomirski 2018-11-26 10:12 ` David Laight 2018-11-26 10:01 ` David Laight 2018-11-26 10:26 ` David Laight 2019-01-05 2:38 ` Linus Torvalds 2019-01-07 9:55 ` David Laight 2019-01-07 17:43 ` Linus Torvalds 2019-01-08 9:10 ` David Laight 2019-01-08 18:01 ` Linus Torvalds 2018-11-21 13:45 ` Paolo Abeni 2018-11-21 17:27 ` Linus Torvalds 2018-11-21 18:04 ` Jens Axboe 2018-11-21 18:26 ` Andy Lutomirski 2018-11-21 18:43 ` Linus Torvalds 2018-11-21 22:38 ` Andy Lutomirski 2018-11-21 18:16 ` Linus Torvalds 2018-11-21 19:01 ` Linus Torvalds 2018-11-22 10:32 ` Ingo Molnar 2018-11-22 11:13 ` Ingo Molnar 2018-11-22 11:21 ` Ingo Molnar 2018-11-23 16:40 ` Josh Poimboeuf 2018-11-22 16:55 ` Linus Torvalds 2018-11-22 17:26 ` Andy Lutomirski 2018-11-22 17:35 ` Linus Torvalds 2018-11-24 6:09 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).