From mboxrd@z Thu Jan 1 00:00:00 1970 From: trong@android.com (Tri Vo) Date: Mon, 6 Aug 2018 14:54:32 -0700 Subject: [PATCH] arm64: build CONFIG_ARM64_LSE_ATOMICS without -ffixed- flags In-Reply-To: References: <20180802231753.86336-1-trong@android.com> Message-ID: To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Sat, Aug 4, 2018 at 12:21 AM, Ard Biesheuvel wrote: > On 4 August 2018 at 03:05, Tri Vo wrote: >> On Fri, Aug 3, 2018 at 12:34 PM, Ard Biesheuvel >> wrote: >>> On 3 August 2018 at 21:13, Ard Biesheuvel wrote: >>>> On 3 August 2018 at 21:08, Tri Vo wrote: >>>>> On Fri, Aug 3, 2018 at 10:27 AM, Ard Biesheuvel >>>>> wrote: >>>>>> On 3 August 2018 at 19:17, Nick Desaulniers wrote: >>>>>>> On Fri, Aug 3, 2018 at 1:43 AM Ard Biesheuvel wrote: >>>>>>>> >>>>>>>> On 3 August 2018 at 01:17, Tri Vo wrote: >>>>>>>> > Remove -ffixed- compile flags as they are not required for correct >>>>>>>> > behavior of out-of-line ll/sc implementations of atomic functions. >>>>>>>> > Registers (except x0 which contains return value) only need to be >>>>>>>> > callee-saved, not necessarily fixed. >>>>>>>> > >>>>>>>> > For readability purposes, represent "callee-saves-all-registers" calling >>>>>>>> > convention with a combination of -fcall-used- and -fcall-saved- flags only. >>>>>>>> > >>>>>>>> > Signed-off-by: Tri Vo >>>>>>>> >>>>>>>> Wouldn't this permit the compiler to stack/unstack these registers in >>>>>>>> the implementations of the out of line LL/SC fallbacks. >>>>> >>>>> Yes. However, the number of registers that get stacked/unstacked >>>>> should remain the same since compiler should only preserve registers >>>>> that are used by callee (assuming callee saves all). >>>>>>> >>>>>>> I'm curious to understand more about how this LSE patching works, and >>>>>>> what the constraints exactly are. Are there some docs I should read >>>>>>> that you can point me to? I see >>>>>>> arch/arm64/include/asm/atomic_ll_sc.h, but are those are the in line >>>>>>> implementations? Does that make arch/arm64/include/asm/atomic.h the >>>>>>> out of line definitions? What are the restrictions around preventing >>>>>>> the compiler to push/pop those registers? >>>>>>> >>>>>> >>>>>> The idea is that LSE atomics can each be patched into a bl instruction >>>>>> that calls the appropriate LL/SC fallback, but without the overhead of >>>>>> an ordinary function call. (actually the bl is there by default, but >>>>>> patched into LSE ops on h/w that supports it) >>>>>> >>>>>> So that is why the 'bl' is in an asm() block, and why x30 is used as a >>>>>> temporary/clobber in the LSE sequences: the function call is >>>>>> completely hidden from the compiler. Obviously, for performance, you >>>>>> want to prevent these fallbacks from touching the stack. >>>>> >>>>> The function call is hidden from the compiler on the caller side >>>>> using 'bl' in the asm block. I suspect that is done to make >>>>> alternative instructions the same byte length (constraints of dynamic >>>>> patching). However, on the callee side, LL/SC fallbacks >>>>> (in atomic_ll_sc.h) are implemented as C functions. So the compiler >>>>> still handles the function call details and register allocation >>>>> within those functions. >>>>> >>>>> Since the caller doesn't save any registers, callee stacks/unstacks >>>>> all the registers it uses. So, for example, LL/SC fallback >>>>> implementation of atomic_add preserves 2 registers with or without >>>>> this change. >>>>> >>>> >>>> Did you look at the disassembly? >> >> Yes, I did. > > 0000000000000000 <__ll_sc_atomic_add>: > 0: f9800031 prfm pstl1strm, [x1] > 4: 885f7c31 ldxr w17, [x1] > 8: 0b000231 add w17, w17, w0 > c: 88107c31 stxr w16, w17, [x1] > 10: 35ffffb0 cbnz w16, 4 <__ll_sc_atomic_add+0x4> > 14: d65f03c0 ret > > Now where does it preserve 2 registers? I tested the change on a Hikey 4.9 kernel. The disassembly there looks like this: 0000000000000000 <__ll_sc_atomic_add>: 0: a9be7bfd stp x29, x30, [sp, #-32]! 4: 910003fd mov x29, sp 8: a90127e8 stp x8, x9, [sp, #16] c: 2a0003e9 mov w9, w0 10: aa0103e8 mov x8, x1 14: aa1e03e0 mov x0, x30 18: 94000000 bl 0 <_mcount> 1c: f9800111 prfm pstl1strm, [x8] 20: 885f7d00 ldxr w0, [x8] 24: 0b090000 add w0, w0, w9 28: 88107d00 stxr w16, w0, [x8] 2c: 35ffffb0 cbnz w16, 20 <__ll_sc_atomic_add+0x20> 30: a94127e8 ldp x8, x9, [sp, #16] 34: a8c27bfd ldp x29, x30, [sp], #32 38: d65f03c0 ret 3c: d503201f nop Excluding FP, SP, there are 2 registers x8, x9 preserved. > >>> >>> Actually, I only just spotted that you only modify x1 .. x7 and that >>> the other registers already use -fcall-saved-xNN >>> >>> Will should comment on why the distinction exists in the first place. >>> Note that a strict reading of the GCC man page suggests that using >>> -fcall-saved-x8 is a bad idea, given that it is the indirect return >>> register. However, given the limited scope of the option here, that >>> probably does not matter in practice. >>> >>> So I guess you need this patch for Clang? Would be better to mention >>> that in the commit log if this is the case. >> >> Yes, I am planning to fold whatever compiler support is needed for LSE >> atomics into "callee-saves-all-registers" calling convention (without >> ffixed flags). I'll then implement that calling convention in Clang. >> I'll mention that in the commit message. > > Not *all* registers. x16 and x17 are excluded in this case, and I am > not sure whether you would to encode that in an option. > > What is wrong with using -ffixed or -fcall-saved? Looking at your disassembly snippet, usage of -ffixed flags makes sense to me now. They prevent argument registers from being unnecessarily stacked/unstacked if compiler uses x16, x17 (which don't need to be saved) for allocation. This patch will indeed cause a performance regression, although not on all build environments. Sorry for the confusion.