* [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() @ 2023-10-10 16:42 Uros Bizjak 2023-10-10 17:32 ` Linus Torvalds 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-10 16:42 UTC (permalink / raw) To: x86, linux-kernel Cc: Uros Bizjak, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Linus Torvalds, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform better optimizations, such as setting an appropriate base to compute the address instead of an add instruction. E.g.: address calcuation in amd_pmu_enable_virt() improves from: 48 c7 c0 00 00 00 00 mov $0x0,%rax 87b7: R_X86_64_32S cpu_hw_events 65 48 03 05 00 00 00 add %gs:0x0(%rip),%rax 00 87bf: R_X86_64_PC32 this_cpu_off-0x4 48 c7 80 28 13 00 00 movq $0x0,0x1328(%rax) 00 00 00 00 to: 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax 00 8798: R_X86_64_PC32 this_cpu_off-0x4 48 c7 80 00 00 00 00 movq $0x0,0x0(%rax) 00 00 00 00 87a6: R_X86_64_32S cpu_hw_events+0x1328 Co-developed-by: Nadav Amit <namit@vmware.com> Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> --- arch/x86/include/asm/percpu.h | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 60ea7755c0fe..cdc188279c5a 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -49,6 +49,19 @@ #define __force_percpu_prefix "%%"__stringify(__percpu_seg)":" #define __my_cpu_offset this_cpu_read(this_cpu_off) +#ifdef CONFIG_USE_X86_SEG_SUPPORT +/* + * Efficient implementation for cases in which the compiler supports + * named address spaces. Allows the compiler to perform additional + * optimizations that can save more instructions. + */ +#define arch_raw_cpu_ptr(ptr) \ +({ \ + unsigned long tcp_ptr__; \ + tcp_ptr__ = __raw_cpu_read(, this_cpu_off) + (unsigned long)(ptr); \ + (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ +}) +#else /* CONFIG_USE_X86_SEG_SUPPORT */ /* * Compared to the generic __my_cpu_offset version, the following * saves one instruction and avoids clobbering a temp register. @@ -61,6 +74,8 @@ : "m" (__my_cpu_var(this_cpu_off)), "0" (ptr)); \ (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ }) +#endif /* CONFIG_USE_X86_SEG_SUPPORT */ + #else /* CONFIG_SMP */ #define __percpu_seg_override #define __percpu_prefix "" -- 2.41.0 ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-10 16:42 [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() Uros Bizjak @ 2023-10-10 17:32 ` Linus Torvalds 2023-10-10 18:22 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-10 17:32 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Tue, 10 Oct 2023 at 09:43, Uros Bizjak <ubizjak@gmail.com> wrote: > > Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform > better optimizations, such as setting an appropriate base to compute > the address instead of an add instruction. Hmm. I wonder.. > + tcp_ptr__ = __raw_cpu_read(, this_cpu_off) + (unsigned long)(ptr); \ Do we really even want to use __raw_cpu_read(this_cpu_off) at all? On my machines (I tested an Intel 8th gen laptop, and my AMD Zen 2 Threadripper machine), 'rdgsbase' seems to be basically two cycles. I wonder if we'd be better off using that, rather than doing the load. Yes, a load that hits in L1D$ will schedule better, so it's "cheaper" in that sense. The rdgsbase instruction *probably* will end up only decoding in the first decoder etc. But we're talking single-cycle kind of effects, and the rdgsbase case should be much better from a cache perspective and might use fewer memory pipeline resources to offset the fact that it uses an unusual front end decoder resource... Yes, yes, we'd have to make it an alternative, and do something like static __always_inline unsigned long new_cpu_offset(void) { unsigned long res; asm(ALTERNATIVE( "movq %%gs:this_cpu_off,%0", "rdgsbase %0", X86_FEATURE_FSGSBASE) : "=r" (res)); return res; } but it still seems fairly straightforward. Hmm? Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-10 17:32 ` Linus Torvalds @ 2023-10-10 18:22 ` Uros Bizjak 2023-10-10 18:25 ` Nadav Amit 2023-10-10 18:37 ` Linus Torvalds 0 siblings, 2 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-10 18:22 UTC (permalink / raw) To: Linus Torvalds Cc: x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Tue, Oct 10, 2023 at 7:32 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 10 Oct 2023 at 09:43, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform > > better optimizations, such as setting an appropriate base to compute > > the address instead of an add instruction. > > Hmm. I wonder.. > > > + tcp_ptr__ = __raw_cpu_read(, this_cpu_off) + (unsigned long)(ptr); \ > > Do we really even want to use __raw_cpu_read(this_cpu_off) at all? Please note that besides propagation of the addition into address, the patch also exposes memory load to the compiler, with the anticipation that the compiler CSEs the load from this_cpu_off from eventual multiple addresses. For this to work, we have to get rid of the asms. It is important that the compiler knows that this is a memory load, so it can also apply other compiler magic to it. BTW: A follow-up patch will also use__raw_cpu_read to implement this_cpu_read_stable. We can then read "const aliased" current_task to CSE the load even more, something similar to [1]. [1] https://lore.kernel.org/lkml/20190823224424.15296-8-namit@vmware.com/ Uros. > On my machines (I tested an Intel 8th gen laptop, and my AMD Zen 2 > Threadripper machine), 'rdgsbase' seems to be basically two cycles. > > I wonder if we'd be better off using that, rather than doing the load. > > Yes, a load that hits in L1D$ will schedule better, so it's "cheaper" > in that sense. The rdgsbase instruction *probably* will end up only > decoding in the first decoder etc. But we're talking single-cycle kind > of effects, and the rdgsbase case should be much better from a cache > perspective and might use fewer memory pipeline resources to offset > the fact that it uses an unusual front end decoder resource... > > Yes, yes, we'd have to make it an alternative, and do something like > > static __always_inline unsigned long new_cpu_offset(void) > { > unsigned long res; > asm(ALTERNATIVE( > "movq %%gs:this_cpu_off,%0", > "rdgsbase %0", > X86_FEATURE_FSGSBASE) > : "=r" (res)); > return res; > } > > but it still seems fairly straightforward. Hmm? > > Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-10 18:22 ` Uros Bizjak @ 2023-10-10 18:25 ` Nadav Amit 2023-10-10 18:42 ` Linus Torvalds 2023-10-10 18:37 ` Linus Torvalds 1 sibling, 1 reply; 106+ messages in thread From: Nadav Amit @ 2023-10-10 18:25 UTC (permalink / raw) To: Uros Bizjak Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf > On Oct 10, 2023, at 9:22 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > > On Tue, Oct 10, 2023 at 7:32 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> On Tue, 10 Oct 2023 at 09:43, Uros Bizjak <ubizjak@gmail.com> wrote: >>> >>> Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform >>> better optimizations, such as setting an appropriate base to compute >>> the address instead of an add instruction. >> >> Hmm. I wonder.. >> >>> + tcp_ptr__ = __raw_cpu_read(, this_cpu_off) + (unsigned long)(ptr); \ >> >> Do we really even want to use __raw_cpu_read(this_cpu_off) at all? > > Please note that besides propagation of the addition into address, the > patch also exposes memory load to the compiler, with the anticipation > that the compiler CSEs the load from this_cpu_off from eventual > multiple addresses. For this to work, we have to get rid of the asms. > It is important that the compiler knows that this is a memory load, so > it can also apply other compiler magic to it. > > BTW: A follow-up patch will also use__raw_cpu_read to implement > this_cpu_read_stable. We can then read "const aliased" current_task to > CSE the load even more, something similar to [1]. I was just writing the same thing. :) As a minor note the proposed assembly version seems to be missing __FORCE_ORDER as an input argument to prevent reordering past preempt_enable and preempt_disable. But that’s really not the main point. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-10 18:25 ` Nadav Amit @ 2023-10-10 18:42 ` Linus Torvalds 0 siblings, 0 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-10 18:42 UTC (permalink / raw) To: Nadav Amit Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Tue, 10 Oct 2023 at 11:25, Nadav Amit <namit@vmware.com> wrote: > > As a minor note the proposed assembly version seems to be missing > __FORCE_ORDER as an input argument to prevent reordering past preempt_enable > and preempt_disable. But that’s really not the main point. Hmm. No, it's probably *is* the main point - see my reply to Uros that the CSE on the inline asm itself gets rid of duplication. And yes, we currently rely on that asm CSE for doing 'current' and not reloading the value all the time. So yes, we'd like to have a barrier for not moving it across the preemption barriers, and __FORCE_ORDER would seem to be a good way to do that. I really suspect that 'rdgsbase' is better than a memory load in practice, but I have no numbers to back that up, apart from a "it's not a slow instruction, and core CPU is generally better than memory". Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-10 18:22 ` Uros Bizjak 2023-10-10 18:25 ` Nadav Amit @ 2023-10-10 18:37 ` Linus Torvalds 2023-10-10 18:41 ` Uros Bizjak 2023-10-11 7:41 ` Nadav Amit 1 sibling, 2 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-10 18:37 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Tue, 10 Oct 2023 at 11:22, Uros Bizjak <ubizjak@gmail.com> wrote: > > Please note that besides propagation of the addition into address, the > patch also exposes memory load to the compiler, with the anticipation > that the compiler CSEs the load from this_cpu_off from eventual > multiple addresses. For this to work, we have to get rid of the asms. I actually checked that the inline asm gets combined, the same way the this_cpu_read_stable cases do (which we use for 'current'). Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-10 18:37 ` Linus Torvalds @ 2023-10-10 18:41 ` Uros Bizjak 2023-10-10 18:52 ` Linus Torvalds 2023-10-11 7:41 ` Nadav Amit 1 sibling, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-10 18:41 UTC (permalink / raw) To: Linus Torvalds Cc: x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Tue, Oct 10, 2023 at 8:38 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 10 Oct 2023 at 11:22, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > Please note that besides propagation of the addition into address, the > > patch also exposes memory load to the compiler, with the anticipation > > that the compiler CSEs the load from this_cpu_off from eventual > > multiple addresses. For this to work, we have to get rid of the asms. > > I actually checked that the inline asm gets combined, the same way the > this_cpu_read_stable cases do (which we use for 'current'). Yes, but does it CSE the load from multiple addresses? Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-10 18:41 ` Uros Bizjak @ 2023-10-10 18:52 ` Linus Torvalds 2023-10-11 7:27 ` Uros Bizjak 2023-10-11 18:42 ` Uros Bizjak 0 siblings, 2 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-10 18:52 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Tue, 10 Oct 2023 at 11:41, Uros Bizjak <ubizjak@gmail.com> wrote: > > Yes, but does it CSE the load from multiple addresses? Yes, it should do that just right, because the *asm* itself is identical, just the offsets (that gcc then adds separately) would be different. This is not unlike how we depend on gcc CSE'ing the "current" part when doing multiple accesses of different members off that: static __always_inline struct task_struct *get_current(void) { return this_cpu_read_stable(pcpu_hot.current_task); } with this_cpu_read_stable() being an inline asm that lacks the memory component (the same way the fallback hides it by just using "%%gs:this_cpu_off" directly inside the asm, instead of exposing it as a memory access to gcc). Of course, I think that with the "__seg_gs" patches, we *could* expose the "%%gs:this_cpu_off" part to gcc, since gcc hopefully then can do the alias analysis on that side and see that it can CSE the thing anyway. That might be a better choice than __FORCE_ORDER, in fact. IOW, something like static __always_inline unsigned long new_cpu_offset(void) { unsigned long res; asm(ALTERNATIVE( "movq " __percpu_arg(1) ",%0", "rdgsbase %0", X86_FEATURE_FSGSBASE) : "=r" (res) : "m" (this_cpu_off)); return res; } would presumably work together with your __seg_gs stuff. UNTESTED!! Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-10 18:52 ` Linus Torvalds @ 2023-10-11 7:27 ` Uros Bizjak 2023-10-11 7:45 ` Uros Bizjak 2023-10-11 18:42 ` Uros Bizjak 1 sibling, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-11 7:27 UTC (permalink / raw) To: Linus Torvalds Cc: x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Tue, Oct 10, 2023 at 8:52 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 10 Oct 2023 at 11:41, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > Yes, but does it CSE the load from multiple addresses? > > Yes, it should do that just right, because the *asm* itself is > identical, just the offsets (that gcc then adds separately) would be > different. Indeed. To illustrate the question with an example, foo() and bar() should compile to the same assembly, and there should be only one read form m resp. n: --cut here-- __seg_gs int m; int foo (void) { return m + m; } int n; static inline int get (int *m) { int res; asm ("mov %%gs:%1, %0" : "=r"(res) : "m"(*m)); return res; } int bar (void) { return get (&n) + get (&n); } --cut here-- And they do: 0000000000000000 <foo>: 0: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax # 7 <foo+0x7> 7: 01 c0 add %eax,%eax 9: c3 retq 0000000000000010 <bar>: 10: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax # 17 <bar+0x7> 17: 01 c0 add %eax,%eax 19: c3 retq > > This is not unlike how we depend on gcc CSE'ing the "current" part > when doing multiple accesses of different members off that: > > static __always_inline struct task_struct *get_current(void) > { > return this_cpu_read_stable(pcpu_hot.current_task); > } > > with this_cpu_read_stable() being an inline asm that lacks the memory > component (the same way the fallback hides it by just using > "%%gs:this_cpu_off" directly inside the asm, instead of exposing it as > a memory access to gcc). > > Of course, I think that with the "__seg_gs" patches, we *could* expose > the "%%gs:this_cpu_off" part to gcc, since gcc hopefully then can do > the alias analysis on that side and see that it can CSE the thing > anyway. > > That might be a better choice than __FORCE_ORDER, in fact. > > IOW, something like > > static __always_inline unsigned long new_cpu_offset(void) > { > unsigned long res; > asm(ALTERNATIVE( > "movq " __percpu_arg(1) ",%0", > "rdgsbase %0", > X86_FEATURE_FSGSBASE) > : "=r" (res) > : "m" (this_cpu_off)); > return res; > } > > would presumably work together with your __seg_gs stuff. I have zero experience with rdgsbase insn, but the above is not dependent on __seg_gs, so (the movq part at least) would also work in the current mainline. To work together with __seg_gs stuff, this_cpu_offset should be enclosed in __my_cpu_var. Also, if rdgsbase is substituted with rdfsbase, it will also work for 32-bit targets. Uros. > UNTESTED!! > > Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-11 7:27 ` Uros Bizjak @ 2023-10-11 7:45 ` Uros Bizjak 2023-10-11 19:40 ` Linus Torvalds 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-11 7:45 UTC (permalink / raw) To: Linus Torvalds Cc: x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Wed, Oct 11, 2023 at 9:27 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > IOW, something like > > > > static __always_inline unsigned long new_cpu_offset(void) > > { > > unsigned long res; > > asm(ALTERNATIVE( > > "movq " __percpu_arg(1) ",%0", > > "rdgsbase %0", > > X86_FEATURE_FSGSBASE) > > : "=r" (res) > > : "m" (this_cpu_off)); > > return res; > > } > > > > would presumably work together with your __seg_gs stuff. > > I have zero experience with rdgsbase insn, but the above is not > dependent on __seg_gs, so (the movq part at least) would also work in > the current mainline. To work together with __seg_gs stuff, > this_cpu_offset should be enclosed in __my_cpu_var. Also, if rdgsbase > is substituted with rdfsbase, it will also work for 32-bit targets. In fact, rdgsbase is available only for 64-bit targets. Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-11 7:45 ` Uros Bizjak @ 2023-10-11 19:40 ` Linus Torvalds 0 siblings, 0 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-11 19:40 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Wed, 11 Oct 2023 at 00:45, Uros Bizjak <ubizjak@gmail.com> wrote: > > In fact, rdgsbase is available only for 64-bit targets. Not even all 64-bit targets. That's why I did that ALTERNATIVE() thing with X86_FEATURE_FSGSBASE, which uses the kernel instruction re-writing. So that suggested asm of mine defaults to loading the value from memory through %gs, but with X86_FEATURE_FSGSBASE it gets rewritten to use rdgsbase. And again - I'm not sure it's any faster, but it's _potentially_ certainly better in that it doesn't use the cache-line. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-10 18:52 ` Linus Torvalds 2023-10-11 7:27 ` Uros Bizjak @ 2023-10-11 18:42 ` Uros Bizjak 2023-10-11 19:51 ` Linus Torvalds 1 sibling, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-11 18:42 UTC (permalink / raw) To: Linus Torvalds Cc: x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf [-- Attachment #1: Type: text/plain, Size: 2431 bytes --] On Tue, Oct 10, 2023 at 8:52 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 10 Oct 2023 at 11:41, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > Yes, but does it CSE the load from multiple addresses? > > Yes, it should do that just right, because the *asm* itself is > identical, just the offsets (that gcc then adds separately) would be > different. > > This is not unlike how we depend on gcc CSE'ing the "current" part > when doing multiple accesses of different members off that: > > static __always_inline struct task_struct *get_current(void) > { > return this_cpu_read_stable(pcpu_hot.current_task); > } > > with this_cpu_read_stable() being an inline asm that lacks the memory > component (the same way the fallback hides it by just using > "%%gs:this_cpu_off" directly inside the asm, instead of exposing it as > a memory access to gcc). > > Of course, I think that with the "__seg_gs" patches, we *could* expose > the "%%gs:this_cpu_off" part to gcc, since gcc hopefully then can do > the alias analysis on that side and see that it can CSE the thing > anyway. > > That might be a better choice than __FORCE_ORDER, in fact. > > IOW, something like > > static __always_inline unsigned long new_cpu_offset(void) > { > unsigned long res; > asm(ALTERNATIVE( > "movq " __percpu_arg(1) ",%0", > "rdgsbase %0", > X86_FEATURE_FSGSBASE) > : "=r" (res) > : "m" (this_cpu_off)); > return res; > } > > would presumably work together with your __seg_gs stuff. > > UNTESTED!! The attached patch was tested on a target with fsgsbase CPUID and without it. It works! The patch improves amd_pmu_enable_virt() in the same way as reported in the original patch submission and also reduces the number of percpu offset reads (either from this_cpu_off or with rdgsbase) from 1663 to 1571. The only drawback is a larger binary size: text data bss dec hex filename 25546594 4387686 808452 30742732 1d518cc vmlinux-new.o 25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o that increases by 31k (0.123%), probably due to 1578 rdgsbase alternatives. I'll prepare and submit a patch for tip/percpu branch. Uros. > > Linus [-- Attachment #2: cpu_ptr-mainline.diff.txt --] [-- Type: text/plain, Size: 1493 bytes --] diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 34734d730463..8450fe4a2753 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -31,18 +31,32 @@ #define __percpu_prefix "%%"__stringify(__percpu_seg)":" #define __my_cpu_offset this_cpu_read(this_cpu_off) -/* - * Compared to the generic __my_cpu_offset version, the following - * saves one instruction and avoids clobbering a temp register. - */ -#define arch_raw_cpu_ptr(ptr) \ -({ \ - unsigned long tcp_ptr__; \ - asm ("add " __percpu_arg(1) ", %0" \ - : "=r" (tcp_ptr__) \ - : "m" (this_cpu_off), "0" (ptr)); \ - (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ +#ifdef CONFIG_X86_64 +#define arch_raw_cpu_ptr(ptr) \ +({ \ + unsigned long tcp_ptr__; \ + asm (ALTERNATIVE("movq " __percpu_arg(1) ", %0", \ + "rdgsbase %0", \ + X86_FEATURE_FSGSBASE) \ + : "=r" (tcp_ptr__) \ + : "m" (this_cpu_off)); \ + \ + tcp_ptr__ += (unsigned long)(ptr); \ + (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ }) +#else /* CONFIG_X86_64 */ +#define arch_raw_cpu_ptr(ptr) \ +({ \ + unsigned long tcp_ptr__; \ + asm ("movl " __percpu_arg(1) ", %0" \ + : "=r" (tcp_ptr__) \ + : "m" (this_cpu_off)); \ + \ + tcp_ptr__ += (unsigned long)(ptr); \ + (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ +}) +#endif /* CONFIG_X86_64 */ + #else #define __percpu_prefix "" #endif ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-11 18:42 ` Uros Bizjak @ 2023-10-11 19:51 ` Linus Torvalds 2023-10-11 19:52 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-11 19:51 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Wed, 11 Oct 2023 at 11:42, Uros Bizjak <ubizjak@gmail.com> wrote: > > The attached patch was tested on a target with fsgsbase CPUID and > without it. It works! .. I should clearly read all my emails before answering some of them. Yes, that patch looks good to me, and I'm happy to hear that you actually tested it unlike my "maybe something like this". > The patch improves amd_pmu_enable_virt() in the same way as reported > in the original patch submission and also reduces the number of percpu > offset reads (either from this_cpu_off or with rdgsbase) from 1663 to > 1571. Dio y ou have any actka performance numbers? The patch looks good to me, and I *think* rdgsbase ends up being faster in practice due to avoiding a memory access, but that's very much a gut feel. > The only drawback is a larger binary size: > > text data bss dec hex filename > 25546594 4387686 808452 30742732 1d518cc vmlinux-new.o > 25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o > > that increases by 31k (0.123%), probably due to 1578 rdgsbase alternatives. I'm actually surprised that it increases the text size. The 'rdgsbase' instruction should be smaller than a 'mov %gs', so I would have expected the *data* size to increase due to the alternatives tables, but not the text size. [ Looks around ] Oh. It's because we put the altinstructions into the text section. That's kind of silly, but whatever. So I think that increase in text-size is not "real" - yes, it increases our binary size because we obviously have two instructions, but the actual *executable* part likely stays the same, and it's just that we grow the altinstruction metadata. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-11 19:51 ` Linus Torvalds @ 2023-10-11 19:52 ` Linus Torvalds 2023-10-11 20:00 ` Uros Bizjak 2023-10-11 22:37 ` Ingo Molnar 2 siblings, 0 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-11 19:52 UTC (permalink / raw) To: Uros Bizjak Cc: x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Wed, 11 Oct 2023 at 12:51, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Dio y ou have any actka performance numbers? And before anybody worries - no, I didn't have a stroke in the middle of writing that. I'm just not a great typist ;) Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-11 19:51 ` Linus Torvalds 2023-10-11 19:52 ` Linus Torvalds @ 2023-10-11 20:00 ` Uros Bizjak 2023-10-11 22:37 ` Ingo Molnar 2 siblings, 0 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-11 20:00 UTC (permalink / raw) To: Linus Torvalds Cc: x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Wed, Oct 11, 2023 at 9:52 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 11 Oct 2023 at 11:42, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > The attached patch was tested on a target with fsgsbase CPUID and > > without it. It works! > > .. I should clearly read all my emails before answering some of them. > > Yes, that patch looks good to me, and I'm happy to hear that you > actually tested it unlike my "maybe something like this". > > > The patch improves amd_pmu_enable_virt() in the same way as reported > > in the original patch submission and also reduces the number of percpu > > offset reads (either from this_cpu_off or with rdgsbase) from 1663 to > > 1571. > > Dio y ou have any actka performance numbers? The patch looks good to > me, and I *think* rdgsbase ends up being faster in practice due to > avoiding a memory access, but that's very much a gut feel. Unfortunately, I don't have any perf numbers, only those from Agner's instruction tables. The memory access performance has so many parameters, that gut feeling is the only thing besides real case-by-case measurements. The rule of thumb in the compiler world is also that memory access should be avoided. Uros. > > > The only drawback is a larger binary size: > > > > text data bss dec hex filename > > 25546594 4387686 808452 30742732 1d518cc vmlinux-new.o > > 25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o > > > > that increases by 31k (0.123%), probably due to 1578 rdgsbase alternatives. > > I'm actually surprised that it increases the text size. The 'rdgsbase' > instruction should be smaller than a 'mov %gs', so I would have > expected the *data* size to increase due to the alternatives tables, > but not the text size. > > [ Looks around ] > > Oh. It's because we put the altinstructions into the text section. > That's kind of silly, but whatever. > > So I think that increase in text-size is not "real" - yes, it > increases our binary size because we obviously have two instructions, > but the actual *executable* part likely stays the same, and it's just > that we grow the altinstruction metadata. > > Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-11 19:51 ` Linus Torvalds 2023-10-11 19:52 ` Linus Torvalds 2023-10-11 20:00 ` Uros Bizjak @ 2023-10-11 22:37 ` Ingo Molnar 2023-10-11 23:15 ` H. Peter Anvin 2 siblings, 1 reply; 106+ messages in thread From: Ingo Molnar @ 2023-10-11 22:37 UTC (permalink / raw) To: Linus Torvalds Cc: Uros Bizjak, x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The only drawback is a larger binary size: > > > > text data bss dec hex filename > > 25546594 4387686 808452 30742732 1d518cc vmlinux-new.o > > 25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o > > > > that increases by 31k (0.123%), probably due to 1578 rdgsbase alternatives. > > I'm actually surprised that it increases the text size. The 'rdgsbase' > instruction should be smaller than a 'mov %gs', so I would have > expected the *data* size to increase due to the alternatives tables, > but not the text size. > > [ Looks around ] > > Oh. It's because we put the altinstructions into the text section. > That's kind of silly, but whatever. Yeah, we should probably move .altinstructions from init-text to .init.data or so? Contains a bunch of other sections too that don't get executed directly ... and in fact has some non-code data structures too, such as ... ".apicdrivers". :-/ I suspect people put all that into .text because it was the easiest place to modify in the x86 linker script, and linker scripts are arguably scary. Will check all this. Thanks, Ingo ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-11 22:37 ` Ingo Molnar @ 2023-10-11 23:15 ` H. Peter Anvin 2023-10-12 1:35 ` Josh Poimboeuf 0 siblings, 1 reply; 106+ messages in thread From: H. Peter Anvin @ 2023-10-11 23:15 UTC (permalink / raw) To: Ingo Molnar, Linus Torvalds Cc: Uros Bizjak, x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On 10/11/23 15:37, Ingo Molnar wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > >>> The only drawback is a larger binary size: >>> >>> text data bss dec hex filename >>> 25546594 4387686 808452 30742732 1d518cc vmlinux-new.o >>> 25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o >>> >>> that increases by 31k (0.123%), probably due to 1578 rdgsbase alternatives. >> >> I'm actually surprised that it increases the text size. The 'rdgsbase' >> instruction should be smaller than a 'mov %gs', so I would have >> expected the *data* size to increase due to the alternatives tables, >> but not the text size. >> >> [ Looks around ] >> >> Oh. It's because we put the altinstructions into the text section. >> That's kind of silly, but whatever. > > Yeah, we should probably move .altinstructions from init-text to .init.data > or so? Contains a bunch of other sections too that don't get executed > directly ... and in fact has some non-code data structures too, such as ... > ".apicdrivers". :-/ > > I suspect people put all that into .text because it was the easiest place > to modify in the x86 linker script, and linker scripts are arguably scary. > Well, it's more than that; "size" considers all non-writable sections to be "text". -hpa ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-11 23:15 ` H. Peter Anvin @ 2023-10-12 1:35 ` Josh Poimboeuf 2023-10-12 6:19 ` Ingo Molnar 0 siblings, 1 reply; 106+ messages in thread From: Josh Poimboeuf @ 2023-10-12 1:35 UTC (permalink / raw) To: H. Peter Anvin Cc: Ingo Molnar, Linus Torvalds, Uros Bizjak, x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Wed, Oct 11, 2023 at 04:15:15PM -0700, H. Peter Anvin wrote: > On 10/11/23 15:37, Ingo Molnar wrote: > > > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > The only drawback is a larger binary size: > > > > > > > > text data bss dec hex filename > > > > 25546594 4387686 808452 30742732 1d518cc vmlinux-new.o > > > > 25515256 4387814 808452 30711522 1d49ee2 vmlinux-old.o > > > > > > > > that increases by 31k (0.123%), probably due to 1578 rdgsbase alternatives. > > > > > > I'm actually surprised that it increases the text size. The 'rdgsbase' > > > instruction should be smaller than a 'mov %gs', so I would have > > > expected the *data* size to increase due to the alternatives tables, > > > but not the text size. > > > > > > [ Looks around ] > > > > > > Oh. It's because we put the altinstructions into the text section. > > > That's kind of silly, but whatever. > > > > Yeah, we should probably move .altinstructions from init-text to .init.data > > or so? Contains a bunch of other sections too that don't get executed > > directly ... and in fact has some non-code data structures too, such as ... > > ".apicdrivers". :-/ > > > > I suspect people put all that into .text because it was the easiest place > > to modify in the x86 linker script, and linker scripts are arguably scary. > > > > Well, it's more than that; "size" considers all non-writable sections to be > "text". Indeed, I added a printf to "size", it shows that all the following sections are "text": .text .pci_fixup .tracedata __ksymtab __ksymtab_gpl __ksymtab_strings __init_rodata __param __ex_table .notes .orc_header .orc_unwind_ip .orc_unwind .init.text .altinstr_aux .x86_cpu_dev.init .parainstructions .retpoline_sites .return_sites .call_sites .altinstructions .altinstr_replacement .exit.text .smp_locks I can't fathom why it doesn't just filter based on the EXECINSTR section flag. "size" is probably worse than useless, as many of these sections can change size rather arbitrarily, especially .orc_* and .*_sites. I can't help but wonder how many hasty optimizations have been made over the years based on the sketchy output of this tool. It should be trivial to replace the use of "size" with our own "text_size" script which does what we want, e.g., filter on EXECINSTR. Here are the current EXECINSTR sections: ~/git/binutils-gdb/binutils $ readelf -WS /tmp/vmlinux |grep X [ 1] .text PROGBITS ffffffff81000000 200000 1200000 00 AX 0 0 4096 [21] .init.text PROGBITS ffffffff833b7000 27b7000 091b50 00 AX 0 0 16 [22] .altinstr_aux PROGBITS ffffffff83448b50 2848b50 00176a 00 AX 0 0 1 [30] .altinstr_replacement PROGBITS ffffffff8372661a 2b2661a 0028b9 00 AX 0 0 1 [32] .exit.text PROGBITS ffffffff83728f10 2b28f10 0030c7 00 AX 0 0 16 As Ingo mentioned, we could make .altinstr_replacement non-executable. That confuses objtool, but I think we could remedy that pretty easily. Though, another problem is that .text has a crazy amount of padding which makes it always the same size, due to the SRSO alias mitigation alignment linker magic. We should fix that somehow. -- Josh ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 1:35 ` Josh Poimboeuf @ 2023-10-12 6:19 ` Ingo Molnar 2023-10-12 16:08 ` Josh Poimboeuf 0 siblings, 1 reply; 106+ messages in thread From: Ingo Molnar @ 2023-10-12 6:19 UTC (permalink / raw) To: Josh Poimboeuf Cc: H. Peter Anvin, Linus Torvalds, Uros Bizjak, x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf * Josh Poimboeuf <jpoimboe@kernel.org> wrote: > Though, another problem is that .text has a crazy amount of padding > which makes it always the same size, due to the SRSO alias mitigation > alignment linker magic. We should fix that somehow. We could emit a non-aligned end-of-text symbol (we might have it already), and have a script or small .c program in scripts/ or tools/ that looks at vmlinux and displays a user-friendly and accurate list of text and data sizes in the kernel? And since objtool is technically an 'object files tool', and it already looks at sections & symbols, it could also grow a: objtool size <objfile> command that does the sane thing ... I'd definitely start using that, instead of 'size'. /me runs :-) Thanks, Ingo ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 6:19 ` Ingo Molnar @ 2023-10-12 16:08 ` Josh Poimboeuf 2023-10-12 17:59 ` Ingo Molnar 0 siblings, 1 reply; 106+ messages in thread From: Josh Poimboeuf @ 2023-10-12 16:08 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Linus Torvalds, Uros Bizjak, x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Thu, Oct 12, 2023 at 08:19:14AM +0200, Ingo Molnar wrote: > > * Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > > Though, another problem is that .text has a crazy amount of padding > > which makes it always the same size, due to the SRSO alias mitigation > > alignment linker magic. We should fix that somehow. > > We could emit a non-aligned end-of-text symbol (we might have it already), > and have a script or small .c program in scripts/ or tools/ that looks > at vmlinux and displays a user-friendly and accurate list of text and > data sizes in the kernel? > > And since objtool is technically an 'object files tool', and it already > looks at sections & symbols, it could also grow a: > > objtool size <objfile> > > command that does the sane thing ... I'd definitely start using that, instead of 'size'. > > /me runs :-) Yeah, that's actually not a bad idea. I had been thinking a "simple" script would be fine, but I'm realizing the scope of this thing could grow over time. In which case a script is less than ideal. And objtool already has the ability to do this pretty easily. -- Josh ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 16:08 ` Josh Poimboeuf @ 2023-10-12 17:59 ` Ingo Molnar 2023-10-12 21:30 ` Josh Poimboeuf 0 siblings, 1 reply; 106+ messages in thread From: Ingo Molnar @ 2023-10-12 17:59 UTC (permalink / raw) To: Josh Poimboeuf Cc: H. Peter Anvin, Linus Torvalds, Uros Bizjak, x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf * Josh Poimboeuf <jpoimboe@kernel.org> wrote: > On Thu, Oct 12, 2023 at 08:19:14AM +0200, Ingo Molnar wrote: > > > > * Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > > > > Though, another problem is that .text has a crazy amount of padding > > > which makes it always the same size, due to the SRSO alias mitigation > > > alignment linker magic. We should fix that somehow. > > > > We could emit a non-aligned end-of-text symbol (we might have it already), > > and have a script or small .c program in scripts/ or tools/ that looks > > at vmlinux and displays a user-friendly and accurate list of text and > > data sizes in the kernel? > > > > And since objtool is technically an 'object files tool', and it already > > looks at sections & symbols, it could also grow a: > > > > objtool size <objfile> > > > > command that does the sane thing ... I'd definitely start using that, instead of 'size'. > > > > /me runs :-) > > Yeah, that's actually not a bad idea. > > I had been thinking a "simple" script would be fine, but I'm realizing > the scope of this thing could grow over time. In which case a script is > less than ideal. And objtool already has the ability to do this pretty > easily. Yeah, and speed actually matters here: I have scripts that generate object comparisons between commits, and every second of runtime counts - and a script would be slower and more fragile for something like allmodconfig builds or larger disto configs. BTW., maybe the right objtool subcommand would be 'objtool sections', with an 'objtool sections size' sub-sub-command. Because I think this discussion shows that it would be good to have a bit of visibility into the sanity of our sections setup, with 'objtool sections check' for example doing a sanity check on whether there's anything extra in the text section that shouldn't be there? Or so ... Thanks, Ingo ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 17:59 ` Ingo Molnar @ 2023-10-12 21:30 ` Josh Poimboeuf 2023-10-13 10:52 ` Ingo Molnar 0 siblings, 1 reply; 106+ messages in thread From: Josh Poimboeuf @ 2023-10-12 21:30 UTC (permalink / raw) To: Ingo Molnar Cc: H. Peter Anvin, Linus Torvalds, Uros Bizjak, x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Thu, Oct 12, 2023 at 07:59:43PM +0200, Ingo Molnar wrote: > > * Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > > On Thu, Oct 12, 2023 at 08:19:14AM +0200, Ingo Molnar wrote: > > > > > > * Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > > > > > > Though, another problem is that .text has a crazy amount of padding > > > > which makes it always the same size, due to the SRSO alias mitigation > > > > alignment linker magic. We should fix that somehow. > > > > > > We could emit a non-aligned end-of-text symbol (we might have it already), > > > and have a script or small .c program in scripts/ or tools/ that looks > > > at vmlinux and displays a user-friendly and accurate list of text and > > > data sizes in the kernel? > > > > > > And since objtool is technically an 'object files tool', and it already > > > looks at sections & symbols, it could also grow a: > > > > > > objtool size <objfile> > > > > > > command that does the sane thing ... I'd definitely start using that, instead of 'size'. > > > > > > /me runs :-) > > > > Yeah, that's actually not a bad idea. > > > > I had been thinking a "simple" script would be fine, but I'm realizing > > the scope of this thing could grow over time. In which case a script is > > less than ideal. And objtool already has the ability to do this pretty > > easily. > > Yeah, and speed actually matters here: I have scripts that generate object > comparisons between commits, and every second of runtime counts - and a > script would be slower and more fragile for something like allmodconfig > builds or larger disto configs. Ah, good to know. > BTW., maybe the right objtool subcommand would be 'objtool sections', with > an 'objtool sections size' sub-sub-command. Because I think this discussion > shows that it would be good to have a bit of visibility into the sanity of > our sections setup, with 'objtool sections check' for example doing a > sanity check on whether there's anything extra in the text section that > shouldn't be there? Or so ... What would be an example of something "extra"? A sanity check might fit better alongside the other checks already being done by the main objtool "subcommand" which gets run by the kernel build. BTW, I actually removed subcommands a while ago when I overhauled objtool's interface to make it easier to combine options. That said, I'm not opposed to re-adding them if we can find a sane way to do so. Here's the current interface: Usage: objtool <actions> [<options>] file.o Actions: -h, --hacks[=<jump_label,noinstr,skylake>] patch toolchain bugs/limitations -i, --ibt validate and annotate IBT -l, --sls validate straight-line-speculation mitigations -m, --mcount annotate mcount/fentry calls for ftrace -n, --noinstr validate noinstr rules -o, --orc generate ORC metadata -r, --retpoline validate and annotate retpoline usage -s, --stackval validate frame pointer rules -t, --static-call annotate static calls -u, --uaccess validate uaccess rules for SMAP --cfi annotate kernel control flow integrity (kCFI) function preambles --dump[=<orc>] dump metadata --prefix <n> generate prefix symbols --rethunk validate and annotate rethunk usage --unret validate entry unret placement Options: -v, --verbose verbose warnings --backtrace unwind on error --backup create .orig files before modification --dry-run don't write modifications --link object is a linked object --mnop nop out mcount call sites --module object is part of a kernel module --no-unreachable skip 'unreachable instruction' warnings --sec-address print section addresses in warnings --stats print statistics Note how all the actions can be easily combined in a single execution instance. If we re-added subcommands, most of the existing functionality would be part of a single subcommand. It used to be called "check", but it's no longer a read-only operation so that's misleading. I'll call it "run" for now. Right now my preference would be to leave the existing interface as-is, and then graft optional subcommands on top. If no subcommand is specified then it would default to the "run" subcommand. It's a little funky, but it would work well for the common case, where ~99% of the functionality lives. And it doesn't break existing setups and backports. For example: # current interface (no changes) objtool --mcount --orc --retpoline --uaccess vmlinux.o # same, with optional explicit "run" subcommand objtool run --mcount --orc --retpoline --uaccess vmlinux.o # new "size" subcommand obtool size [options] vmlinux.o.before vmlinux.o.after -- Josh ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 21:30 ` Josh Poimboeuf @ 2023-10-13 10:52 ` Ingo Molnar 0 siblings, 0 replies; 106+ messages in thread From: Ingo Molnar @ 2023-10-13 10:52 UTC (permalink / raw) To: Josh Poimboeuf Cc: H. Peter Anvin, Linus Torvalds, Uros Bizjak, x86, linux-kernel, Nadav Amit, Andy Lutomirski, Brian Gerst, Denys Vlasenko, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf * Josh Poimboeuf <jpoimboe@kernel.org> wrote: > Right now my preference would be to leave the existing interface as-is, > and then graft optional subcommands on top. If no subcommand is > specified then it would default to the "run" subcommand. It's a little > funky, but it would work well for the common case, where ~99% of the > functionality lives. And it doesn't break existing setups and > backports. > > For example: > > # current interface (no changes) > objtool --mcount --orc --retpoline --uaccess vmlinux.o > > # same, with optional explicit "run" subcommand > objtool run --mcount --orc --retpoline --uaccess vmlinux.o > > # new "size" subcommand > obtool size [options] vmlinux.o.before vmlinux.o.after Yeah, sounds good! Thanks, Ingo ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-10 18:37 ` Linus Torvalds 2023-10-10 18:41 ` Uros Bizjak @ 2023-10-11 7:41 ` Nadav Amit 2023-10-11 19:37 ` Linus Torvalds 1 sibling, 1 reply; 106+ messages in thread From: Nadav Amit @ 2023-10-11 7:41 UTC (permalink / raw) To: Linus Torvalds Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf > On Oct 10, 2023, at 9:37 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > !! External Email > > On Tue, 10 Oct 2023 at 11:22, Uros Bizjak <ubizjak@gmail.com> wrote: >> >> Please note that besides propagation of the addition into address, the >> patch also exposes memory load to the compiler, with the anticipation >> that the compiler CSEs the load from this_cpu_off from eventual >> multiple addresses. For this to work, we have to get rid of the asms. > > I actually checked that the inline asm gets combined, the same way the > this_cpu_read_stable cases do (which we use for 'current’) You are correct. Having said that, for “current" we may be able to do something better, as regardless to preemption “current" remains the same, and this_cpu_read_stable() does miss some opportunities to avoid reloading the value from memory. I proposed a solution before, but I am not sure it would work properly with LTO. I guess Uros would know better. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-11 7:41 ` Nadav Amit @ 2023-10-11 19:37 ` Linus Torvalds 2023-10-11 21:32 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-11 19:37 UTC (permalink / raw) To: Nadav Amit Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Wed, 11 Oct 2023 at 00:42, Nadav Amit <namit@vmware.com> wrote: > > You are correct. Having said that, for “current" we may be able to do something > better, as regardless to preemption “current" remains the same, and > this_cpu_read_stable() does miss some opportunities to avoid reloading the > value from memory. It would be lovely to generate even better code, but that this_cpu_read_stable() thing is the best we've come up with. It intentionally has *no* memory inputs or anything else that might make gcc think "I need to re-do this". For example, instead of using "m" as a memory input, it very intentionally uses "p", to make it clear that that it just uses the _pointer_, not the memory location itself. That's obviously a lie - it actually does access memory - but it's a lie exactly because of the reason you mention: even when the memory location changes due to preemption (or explicit scheduling), it always changes back to the the value we care about. So gcc _should_ be able to CSE it in all situations, but it's entirely possible that gcc then decides to re-generate the value for whatever reason. It's a cheap op, so it's ok to regen, of course, but the intent is basically to let the compiler re-use the value as much as possible. But it *is* probably better to regenerate the value than it would be to spill and re-load it, and from the cases I've seen, this all tends to work fairly well. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-11 19:37 ` Linus Torvalds @ 2023-10-11 21:32 ` Uros Bizjak 2023-10-11 21:54 ` Linus Torvalds 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-11 21:32 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf [-- Attachment #1: Type: text/plain, Size: 2149 bytes --] On Wed, Oct 11, 2023 at 9:37 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 11 Oct 2023 at 00:42, Nadav Amit <namit@vmware.com> wrote: > > > > You are correct. Having said that, for “current" we may be able to do something > > better, as regardless to preemption “current" remains the same, and > > this_cpu_read_stable() does miss some opportunities to avoid reloading the > > value from memory. > > It would be lovely to generate even better code, but that > this_cpu_read_stable() thing is the best we've come up with. It > intentionally has *no* memory inputs or anything else that might make > gcc think "I need to re-do this". The attached patch makes this_cpu_read_stable a bit better by using rip-relative addressing. Immediate reduction of text section by 4kB and also makes the kernel some more PIE friendly. > For example, instead of using "m" as a memory input, it very > intentionally uses "p", to make it clear that that it just uses the > _pointer_, not the memory location itself. > > That's obviously a lie - it actually does access memory - but it's a > lie exactly because of the reason you mention: even when the memory > location changes due to preemption (or explicit scheduling), it always > changes back to the the value we care about. > > So gcc _should_ be able to CSE it in all situations, but it's entirely > possible that gcc then decides to re-generate the value for whatever > reason. It's a cheap op, so it's ok to regen, of course, but the > intent is basically to let the compiler re-use the value as much as > possible. > > But it *is* probably better to regenerate the value than it would be > to spill and re-load it, and from the cases I've seen, this all tends > to work fairly well. Reading the above, it looks to me that we don't want to play games with "const aliased" versions of current_task [1], as proposed by Nadav in his patch series. The current version of this_cpu_read_stable() (plus the attached trivial patch) is as good as it can get. [1] https://lore.kernel.org/lkml/20190823224424.15296-8-namit@vmware.com/ Uros. [-- Attachment #2: p.diff.txt --] [-- Type: text/plain, Size: 765 bytes --] diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index e047a0bc5554..b74169434b85 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -4,8 +4,10 @@ #ifdef CONFIG_X86_64 #define __percpu_seg gs +#define __percpu_rip "(%%rip)" #else #define __percpu_seg fs +#define __percpu_rip "" #endif #ifdef __ASSEMBLY__ @@ -85,7 +87,7 @@ #define __my_cpu_ptr(ptr) (__my_cpu_type(*ptr) *)(uintptr_t)(ptr) #define __my_cpu_var(var) (*__my_cpu_ptr(&var)) #define __percpu_arg(x) __percpu_prefix "%" #x -#define __force_percpu_arg(x) __force_percpu_prefix "%" #x +#define __force_percpu_arg(x) __force_percpu_prefix "%" #x __percpu_rip /* * Initialized pointers to per-cpu variables needed for the boot ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-11 21:32 ` Uros Bizjak @ 2023-10-11 21:54 ` Linus Torvalds 2023-10-12 15:19 ` Nadav Amit 0 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-11 21:54 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf On Wed, 11 Oct 2023 at 14:33, Uros Bizjak <ubizjak@gmail.com> wrote: > > Reading the above, it looks to me that we don't want to play games > with "const aliased" versions of current_task [1], as proposed by > Nadav in his patch series. Well, maybe I'd like it if I saw what the effect of it was, but that patch mentions "sync_mm_rss()" which doesn't actually exist (SPLIT_RSS_COUNTING is never defined, the split version is gone and hasn't existed since commit f1a7941243c1 "mm: convert mm's rss stats into percpu_counter") I'm not sure why gcc used to get code generation wrong there, and I don't see what could be improved in the current implementation of 'current_task', but I guess there could be things that could be done to make gcc more likely to CSE the cases.. IOW, I don't understand why Navad's patch would improve gcc code gen. It presumably depends on some gcc internal issue (ie "gcc just happens to be better at optimization X"). Which is obviously a valid thing in general. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-11 21:54 ` Linus Torvalds @ 2023-10-12 15:19 ` Nadav Amit 2023-10-12 16:33 ` Uros Bizjak ` (2 more replies) 0 siblings, 3 replies; 106+ messages in thread From: Nadav Amit @ 2023-10-12 15:19 UTC (permalink / raw) To: Linus Torvalds Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers > On Oct 12, 2023, at 12:54 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > !! External Email > > On Wed, 11 Oct 2023 at 14:33, Uros Bizjak <ubizjak@gmail.com> wrote: >> >> Reading the above, it looks to me that we don't want to play games >> with "const aliased" versions of current_task [1], as proposed by >> Nadav in his patch series. > > Well, maybe I'd like it if I saw what the effect of it was, but that > patch mentions "sync_mm_rss()" which doesn't actually exist > (SPLIT_RSS_COUNTING is never defined, the split version is gone and > hasn't existed since commit f1a7941243c1 "mm: convert mm's rss stats > into percpu_counter") So I added a new version of the current aliasing (well, actually pcpu_hot in the new version) on top of Uros’s patches, and the effect can be seen in many functions. I don’t want to bother with many examples so here is a common and simple one: Currently syscall_exit_work() that starts with: 0xffffffff8111e120 <+0>: push %rbp 0xffffffff8111e121 <+1>: mov %rdi,%rbp 0xffffffff8111e124 <+4>: push %rbx 0xffffffff8111e125 <+5>: mov %rsi,%rbx 0xffffffff8111e128 <+8>: and $0x20,%esi 0xffffffff8111e12b <+11>: je 0xffffffff8111e143 <syscall_exit_work+35> 0xffffffff8111e12d <+13>: mov %gs:0x2ac80,%rax 0xffffffff8111e136 <+22>: cmpb $0x0,0x800(%rax) 0xffffffff8111e13d <+29>: jne 0xffffffff8111e22a <syscall_exit_work+266> 0xffffffff8111e143 <+35>: mov %gs:0x2ac80,%rax 0xffffffff8111e14c <+44>: cmpq $0x0,0x7c8(%rax) Using the const-alias changes the beginning of syscall_exit_work to: 0xffffffff8111cb80 <+0>: push %r12 0xffffffff8111cb82 <+2>: mov %gs:0x7ef0e0f6(%rip),%r12 # 0x2ac80 <pcpu_hot> 0xffffffff8111cb8a <+10>: push %rbp 0xffffffff8111cb8b <+11>: mov %rdi,%rbp 0xffffffff8111cb8e <+14>: push %rbx 0xffffffff8111cb8f <+15>: mov %rsi,%rbx 0xffffffff8111cb92 <+18>: and $0x20,%esi 0xffffffff8111cb95 <+21>: je 0xffffffff8111cba6 <syscall_exit_work+38> 0xffffffff8111cb97 <+23>: cmpb $0x0,0x800(%r12) 0xffffffff8111cba0 <+32>: jne 0xffffffff8111cc7a <syscall_exit_work+250> 0xffffffff8111cba6 <+38>: cmpq $0x0,0x7c8(%r12) So we both see RIP-relative addressing is being used (hence the instruction is one byte shorter) and the reload going away. Now, I am not a compiler expert as for the rationale, but it googling around I can see Nick explaining the rationale [1] - if you use “p” your read memory. BTW: It is related to discussion you had [2], in which you encountered an issue I also encountered before [3]. My bad for pushing it in. Anyhow, I created a similar code on godbolt ( https://godbolt.org/z/dPqKKzPs4 ) to show this behavior - how compiler barriers cause reload. It seems that this behavior happens on GCC and CLANG on various versions. The idea behind the patch is that it communicates - in the compilation unit granularity - that current is fixed. There is an issue of whether it works with LTO, which I have never checked. [1] https://reviews.llvm.org/D145416 [2] https://lore.kernel.org/lkml/20230306120106.GE1267364@hirez.programming.kicks-ass.net/ [3] https://lore.kernel.org/all/20190823224424.15296-5-namit@vmware.com/ -- Here’s the updated patch - but I didn’t really boot a machine with it so new issues might have come since my last patch-set: -- >8 -- Date: Thu, 12 Oct 2023 06:02:03 -0700 Subject: [PATCH] Const current --- arch/x86/include/asm/current.h | 17 ++++++++++++++++- arch/x86/kernel/cpu/common.c | 4 ++++ include/linux/compiler.h | 2 +- 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h index a1168e7b69e5..d05fbb6a8bd7 100644 --- a/arch/x86/include/asm/current.h +++ b/arch/x86/include/asm/current.h @@ -36,9 +36,24 @@ static_assert(sizeof(struct pcpu_hot) == 64); DECLARE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot); +/* + * Hold a constant alias for current_task, which would allow to avoid caching of + * current task. + * + * We must mark const_current_task with the segment qualifiers, as otherwise gcc + * would do redundant reads of const_current_task. + */ +DECLARE_PER_CPU(struct pcpu_hot const __percpu_seg_override, const_pcpu_hot); + static __always_inline struct task_struct *get_current(void) { - return this_cpu_read_stable(pcpu_hot.current_task); + + /* + * GCC is missing functionality of removing segment qualifiers, which + * messes with per-cpu infrastructure that holds local copies. Use + * __raw_cpu_read to avoid holding any copy. + */ + return __raw_cpu_read(, const_pcpu_hot.current_task); } #define current get_current() diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 382d4e6b848d..94590af11388 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -2052,6 +2052,10 @@ DEFINE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot) = { }; EXPORT_PER_CPU_SYMBOL(pcpu_hot); +DECLARE_PER_CPU_ALIGNED(struct pcpu_hot const __percpu_seg_override, const_pcpu_hot) + __attribute__((alias("pcpu_hot"))); +EXPORT_PER_CPU_SYMBOL(const_pcpu_hot); + #ifdef CONFIG_X86_64 DEFINE_PER_CPU_FIRST(struct fixed_percpu_data, fixed_percpu_data) __aligned(PAGE_SIZE) __visible; diff --git a/include/linux/compiler.h b/include/linux/compiler.h index d7779a18b24f..e7059292085e 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -212,7 +212,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val, */ #define ___ADDRESSABLE(sym, __attrs) \ static void * __used __attrs \ - __UNIQUE_ID(__PASTE(__addressable_,sym)) = (void *)&sym; + __UNIQUE_ID(__PASTE(__addressable_,sym)) = (void *)(uintptr_t)&sym; #define __ADDRESSABLE(sym) \ ___ADDRESSABLE(sym, __section(".discard.addressable")) -- 2.25.1 ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 15:19 ` Nadav Amit @ 2023-10-12 16:33 ` Uros Bizjak 2023-10-12 16:55 ` Uros Bizjak 2023-10-12 16:56 ` [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() Linus Torvalds 2023-10-12 17:16 ` Linus Torvalds 2023-10-16 18:52 ` Uros Bizjak 2 siblings, 2 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-12 16:33 UTC (permalink / raw) To: Nadav Amit Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, Oct 12, 2023 at 5:19 PM Nadav Amit <namit@vmware.com> wrote: > > > > On Oct 12, 2023, at 12:54 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > !! External Email > > > > On Wed, 11 Oct 2023 at 14:33, Uros Bizjak <ubizjak@gmail.com> wrote: > >> > >> Reading the above, it looks to me that we don't want to play games > >> with "const aliased" versions of current_task [1], as proposed by > >> Nadav in his patch series. > > > > Well, maybe I'd like it if I saw what the effect of it was, but that > > patch mentions "sync_mm_rss()" which doesn't actually exist > > (SPLIT_RSS_COUNTING is never defined, the split version is gone and > > hasn't existed since commit f1a7941243c1 "mm: convert mm's rss stats > > into percpu_counter") > > So I added a new version of the current aliasing (well, actually pcpu_hot > in the new version) on top of Uros’s patches, and the effect can be seen > in many functions. I don’t want to bother with many examples so here is > a common and simple one: > > Currently syscall_exit_work() that starts with: > > 0xffffffff8111e120 <+0>: push %rbp > 0xffffffff8111e121 <+1>: mov %rdi,%rbp > 0xffffffff8111e124 <+4>: push %rbx > 0xffffffff8111e125 <+5>: mov %rsi,%rbx > 0xffffffff8111e128 <+8>: and $0x20,%esi > 0xffffffff8111e12b <+11>: je 0xffffffff8111e143 <syscall_exit_work+35> > 0xffffffff8111e12d <+13>: mov %gs:0x2ac80,%rax > 0xffffffff8111e136 <+22>: cmpb $0x0,0x800(%rax) > 0xffffffff8111e13d <+29>: jne 0xffffffff8111e22a <syscall_exit_work+266> > 0xffffffff8111e143 <+35>: mov %gs:0x2ac80,%rax > 0xffffffff8111e14c <+44>: cmpq $0x0,0x7c8(%rax) > > Using the const-alias changes the beginning of syscall_exit_work to: > > 0xffffffff8111cb80 <+0>: push %r12 > 0xffffffff8111cb82 <+2>: mov %gs:0x7ef0e0f6(%rip),%r12 # 0x2ac80 <pcpu_hot> > 0xffffffff8111cb8a <+10>: push %rbp > 0xffffffff8111cb8b <+11>: mov %rdi,%rbp > 0xffffffff8111cb8e <+14>: push %rbx > 0xffffffff8111cb8f <+15>: mov %rsi,%rbx > 0xffffffff8111cb92 <+18>: and $0x20,%esi > 0xffffffff8111cb95 <+21>: je 0xffffffff8111cba6 <syscall_exit_work+38> > 0xffffffff8111cb97 <+23>: cmpb $0x0,0x800(%r12) > 0xffffffff8111cba0 <+32>: jne 0xffffffff8111cc7a <syscall_exit_work+250> > 0xffffffff8111cba6 <+38>: cmpq $0x0,0x7c8(%r12) > > So we both see RIP-relative addressing is being used (hence the instruction is > one byte shorter) and the reload going away. Just a quick remark here: For some reason existing percpu_stable_op asm uses %P operand modifier. This will drop all syntax-specific prefixes and issue the bare constant. It will also remove the (%rip) suffix. What we want here is a generic %a modifier (See 6.47.2.8 Generic Operand Modifiers [1]) that will substitute a memory reference, with the actual operand treated as the address. In combination with "p" constraint will DTRT and will emit symbol with the (%rip) suffix when available, also when -fpie is in effect. [1] https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gcc.pdf Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 16:33 ` Uros Bizjak @ 2023-10-12 16:55 ` Uros Bizjak 2023-10-12 17:10 ` Linus Torvalds 2023-10-12 16:56 ` [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() Linus Torvalds 1 sibling, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-12 16:55 UTC (permalink / raw) To: Nadav Amit Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, Oct 12, 2023 at 6:33 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Thu, Oct 12, 2023 at 5:19 PM Nadav Amit <namit@vmware.com> wrote: > > > > > > > On Oct 12, 2023, at 12:54 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > !! External Email > > > > > > On Wed, 11 Oct 2023 at 14:33, Uros Bizjak <ubizjak@gmail.com> wrote: > > >> > > >> Reading the above, it looks to me that we don't want to play games > > >> with "const aliased" versions of current_task [1], as proposed by > > >> Nadav in his patch series. > > > > > > Well, maybe I'd like it if I saw what the effect of it was, but that > > > patch mentions "sync_mm_rss()" which doesn't actually exist > > > (SPLIT_RSS_COUNTING is never defined, the split version is gone and > > > hasn't existed since commit f1a7941243c1 "mm: convert mm's rss stats > > > into percpu_counter") > > > > So I added a new version of the current aliasing (well, actually pcpu_hot > > in the new version) on top of Uros’s patches, and the effect can be seen > > in many functions. I don’t want to bother with many examples so here is > > a common and simple one: > > > > Currently syscall_exit_work() that starts with: > > > > 0xffffffff8111e120 <+0>: push %rbp > > 0xffffffff8111e121 <+1>: mov %rdi,%rbp > > 0xffffffff8111e124 <+4>: push %rbx > > 0xffffffff8111e125 <+5>: mov %rsi,%rbx > > 0xffffffff8111e128 <+8>: and $0x20,%esi > > 0xffffffff8111e12b <+11>: je 0xffffffff8111e143 <syscall_exit_work+35> > > 0xffffffff8111e12d <+13>: mov %gs:0x2ac80,%rax > > 0xffffffff8111e136 <+22>: cmpb $0x0,0x800(%rax) > > 0xffffffff8111e13d <+29>: jne 0xffffffff8111e22a <syscall_exit_work+266> > > 0xffffffff8111e143 <+35>: mov %gs:0x2ac80,%rax > > 0xffffffff8111e14c <+44>: cmpq $0x0,0x7c8(%rax) > > > > Using the const-alias changes the beginning of syscall_exit_work to: > > > > 0xffffffff8111cb80 <+0>: push %r12 > > 0xffffffff8111cb82 <+2>: mov %gs:0x7ef0e0f6(%rip),%r12 # 0x2ac80 <pcpu_hot> > > 0xffffffff8111cb8a <+10>: push %rbp > > 0xffffffff8111cb8b <+11>: mov %rdi,%rbp > > 0xffffffff8111cb8e <+14>: push %rbx > > 0xffffffff8111cb8f <+15>: mov %rsi,%rbx > > 0xffffffff8111cb92 <+18>: and $0x20,%esi > > 0xffffffff8111cb95 <+21>: je 0xffffffff8111cba6 <syscall_exit_work+38> > > 0xffffffff8111cb97 <+23>: cmpb $0x0,0x800(%r12) > > 0xffffffff8111cba0 <+32>: jne 0xffffffff8111cc7a <syscall_exit_work+250> > > 0xffffffff8111cba6 <+38>: cmpq $0x0,0x7c8(%r12) > > > > So we both see RIP-relative addressing is being used (hence the instruction is > > one byte shorter) and the reload going away. > > Just a quick remark here: > > For some reason existing percpu_stable_op asm uses %P operand > modifier. This will drop all syntax-specific prefixes and issue the > bare constant. It will also remove the (%rip) suffix. What we want > here is a generic %a modifier (See 6.47.2.8 Generic Operand Modifiers > [1]) that will substitute a memory reference, with the actual operand > treated as the address. In combination with "p" constraint will DTRT > and will emit symbol with the (%rip) suffix when available, also when > -fpie is in effect. An example: --cut here-- int m; int foo (void) { asm ("# %0 %P0 %a0" :: "p" (&m)); } --cut here-- gcc -O2 -S: # $m m m(%rip) gcc -O2 -fpie -S: # $m m m(%rip) gcc -O2 -m32 -S: # $m m m Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 16:55 ` Uros Bizjak @ 2023-10-12 17:10 ` Linus Torvalds 2023-10-12 17:47 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-12 17:10 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 12 Oct 2023 at 09:55, Uros Bizjak <ubizjak@gmail.com> wrote: > > An example: Oh, I'm convinced. The fix seems to be a simple one-liner, ie just - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \ + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \ and it turns out that we have other places where I think we could use that '%a', For example, we have things like this: asm ("lea sme_cmdline_arg(%%rip), %0" : "=r" (cmdline_arg) : "p" (sme_cmdline_arg)); and I think the only reason we do that ridiculous asm is that the code in question really does want that (%rip) encoding. It sounds like this could just do asm ("lea %a1, %0" : "=r" (cmdline_arg) : "p" (sme_cmdline_arg)); instead. Once again, I claim ignorance of the operand modifiers as the reason for these kinds of things. But coming back to the stable op thing, I do wonder if there is some way we could avoid the unnecessary reload. I don't hate Nadav's patch, so that part is fine, but I'd like to understand what it is that makes gcc think it needs to reload. We have other cases (like the ALTERNATIVE() uses) where we *have* to use inline asm, so it would be good to know... Is it just that "p" (in the constraint, not "P" in the modifier) ends up always being seen as a memory access, even when we only use the address? That part has never really been something we've been entirely clear on. We *are* passing in just the address, so the hope in *that* place is that it's only an address dependency, not a memory one. (Of course, we use 'p' constraints in other places, and may expect/assume that those have memory dependency. Again - this has never been entirely clear) Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 17:10 ` Linus Torvalds @ 2023-10-12 17:47 ` Linus Torvalds 2023-10-12 18:01 ` Uros Bizjak 2023-10-12 17:52 ` Uros Bizjak 2023-11-20 9:39 ` Use %a asm operand modifier to obtain %rip-relative addressing Uros Bizjak 2 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-12 17:47 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 12 Oct 2023 at 10:10, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The fix seems to be a simple one-liner, ie just > > - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \ > + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \ Nope. That doesn't work at all. It turns out that we're not the only ones that didn't know about the 'a' modifier. clang has also never heard of it in this context, and the above one-liner results in an endless sea of errors, with error: invalid operand in inline asm: 'movq %gs:${1:a}, $0' Looking around, I think it's X86AsmPrinter::PrintAsmOperand() that is supposed to handle these things, and while it does have some handling for 'a', the comment around it says case 'a': // This is an address. Currently only 'i' and 'r' are expected. and I think our use ends up just confusing the heck out of clang. Of course, clang also does this: case 'P': // This is the operand of a call, treat specially. PrintPCRelImm(MI, OpNo, O); return false; so clang *already* generates those 'current' accesses as PCrelative, and I see movq %gs:pcpu_hot(%rip), %r13 in the generated code. End result: clang actually generates what we want just using 'P', and the whole "P vs a" is only a gcc thing. Why *does* gcc do that silly thing of dropping '(%rip)' from the address, btw? Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 17:47 ` Linus Torvalds @ 2023-10-12 18:01 ` Uros Bizjak 2023-10-13 9:38 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-12 18:01 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, Oct 12, 2023 at 7:47 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 12 Oct 2023 at 10:10, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > The fix seems to be a simple one-liner, ie just > > > > - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \ > > + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \ > > Nope. That doesn't work at all. > > It turns out that we're not the only ones that didn't know about the > 'a' modifier. > > clang has also never heard of it in this context, and the above > one-liner results in an endless sea of errors, with > > error: invalid operand in inline asm: 'movq %gs:${1:a}, $0' > > Looking around, I think it's X86AsmPrinter::PrintAsmOperand() that is > supposed to handle these things, and while it does have some handling > for 'a', the comment around it says > > case 'a': // This is an address. Currently only 'i' and 'r' are expected. > > and I think our use ends up just confusing the heck out of clang. Of > course, clang also does this: > > case 'P': // This is the operand of a call, treat specially. > PrintPCRelImm(MI, OpNo, O); > return false; > > so clang *already* generates those 'current' accesses as PCrelative, and I see > > movq %gs:pcpu_hot(%rip), %r13 > > in the generated code. > > End result: clang actually generates what we want just using 'P', and > the whole "P vs a" is only a gcc thing. Ugh, this isn't exactly following Clang's claim that "In general, Clang is highly compatible with the GCC inline assembly extensions, allowing the same set of constraints, modifiers and operands as GCC inline assembly." [1] https://clang.llvm.org/compatibility.html#inline-asm > Why *does* gcc do that silly thing of dropping '(%rip)' from the address, btw? The documentation says: [p] Print raw symbol name (without syntax-specific prefixes). [P] If used for a function, print the PLT suf-fix and generate PIC code. For example, emit foo@PLT instead of ’foo’ for the function foo(). If used for a constant, drop all syntax-specific prefixes and issue the bare constant. See p above. I'd say that "bare constant" is something without (%pic). Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 18:01 ` Uros Bizjak @ 2023-10-13 9:38 ` Uros Bizjak 2023-10-13 11:53 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-13 9:38 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers [-- Attachment #1: Type: text/plain, Size: 2767 bytes --] On Thu, Oct 12, 2023 at 8:01 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Thu, Oct 12, 2023 at 7:47 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Thu, 12 Oct 2023 at 10:10, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > The fix seems to be a simple one-liner, ie just > > > > > > - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \ > > > + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \ > > > > Nope. That doesn't work at all. > > > > It turns out that we're not the only ones that didn't know about the > > 'a' modifier. > > > > clang has also never heard of it in this context, and the above > > one-liner results in an endless sea of errors, with > > > > error: invalid operand in inline asm: 'movq %gs:${1:a}, $0' > > > > Looking around, I think it's X86AsmPrinter::PrintAsmOperand() that is > > supposed to handle these things, and while it does have some handling > > for 'a', the comment around it says > > > > case 'a': // This is an address. Currently only 'i' and 'r' are expected. > > > > and I think our use ends up just confusing the heck out of clang. Of > > course, clang also does this: > > > > case 'P': // This is the operand of a call, treat specially. > > PrintPCRelImm(MI, OpNo, O); > > return false; > > > > so clang *already* generates those 'current' accesses as PCrelative, and I see > > > > movq %gs:pcpu_hot(%rip), %r13 > > > > in the generated code. > > > > End result: clang actually generates what we want just using 'P', and > > the whole "P vs a" is only a gcc thing. > > Ugh, this isn't exactly following Clang's claim that "In general, > Clang is highly compatible with the GCC inline assembly extensions, > allowing the same set of constraints, modifiers and operands as GCC > inline assembly." For added fun I obtained some old clang: $ clang --version clang version 11.0.0 (Fedora 11.0.0-3.fc33) and tried to compile this: int m; __seg_gs int n; void foo (void) { asm ("# %a0 %a1" :: "p" (&m), "p" (&n)); asm ("# %P0 %P1" :: "p" (&m), "p" (&n)); } clang-11: # m n # m n clang-11 -fpie: # m(%rip) n(%rip) # m n clang-11 -m32: # m n # m n gcc: # m(%rip) n(%rip) # m n gcc -fpie: # m(%rip) n(%rip) # m n gcc -m32: # m n # m n Please find attached a patch that should bring some order to this issue. The patch includes two demonstration sites, the generated code for mem_encrypt_identity.c does not change while the change in percpu.h brings expected 4kB code size reduction. Uros. [-- Attachment #2: memref.diff.txt --] [-- Type: text/plain, Size: 2614 bytes --] diff --git a/arch/x86/include/asm/compiler.h b/arch/x86/include/asm/compiler.h new file mode 100644 index 000000000000..37c9dea50be6 --- /dev/null +++ b/arch/x86/include/asm/compiler.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_X86_COMPILER_H +#define __ASM_X86_COMPILER_H + +/* + * Substitute a memory reference, with the actual + * operand treated as the address. + */ + +#ifdef __clang__ +#define __memref(x) "%P" #x +#else +#define __memref(x) "%a" #x +#endif + +#endif /* __ASM_X86_COMPILER_H */ diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 60ea7755c0fe..92b8d60a3bf5 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -26,6 +26,7 @@ #include <linux/kernel.h> #include <linux/stringify.h> +#include <asm/compiler.h> #ifdef CONFIG_SMP @@ -71,7 +72,7 @@ #define __my_cpu_ptr(ptr) (__my_cpu_type(*ptr) *)(uintptr_t)(ptr) #define __my_cpu_var(var) (*__my_cpu_ptr(&var)) #define __percpu_arg(x) __percpu_prefix "%" #x -#define __force_percpu_arg(x) __force_percpu_prefix "%" #x +#define __force_percpu_memref(x) __force_percpu_prefix __memref(x) /* * Initialized pointers to per-cpu variables needed for the boot @@ -175,7 +176,7 @@ do { \ #define percpu_stable_op(size, op, _var) \ ({ \ __pcpu_type_##size pfo_val__; \ - asm(__pcpu_op2_##size(op, __force_percpu_arg(P[var]), "%[val]") \ + asm(__pcpu_op2_##size(op, __force_percpu_memref([var]), "%[val]") \ : [val] __pcpu_reg_##size("=", pfo_val__) \ : [var] "p" (&(_var))); \ (typeof(_var))(unsigned long) pfo_val__; \ diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c index d73aeb16417f..6768f586ab51 100644 --- a/arch/x86/mm/mem_encrypt_identity.c +++ b/arch/x86/mm/mem_encrypt_identity.c @@ -43,6 +43,7 @@ #include <asm/setup.h> #include <asm/sections.h> +#include <asm/compiler.h> #include <asm/cmdline.h> #include <asm/coco.h> #include <asm/sev.h> @@ -582,13 +583,13 @@ void __init sme_enable(struct boot_params *bp) * identity mapped, so we must obtain the address to the SME command * line argument data using rip-relative addressing. */ - asm ("lea sme_cmdline_arg(%%rip), %0" + asm ("lea " __memref(1) ", %0" : "=r" (cmdline_arg) : "p" (sme_cmdline_arg)); - asm ("lea sme_cmdline_on(%%rip), %0" + asm ("lea " __memref(1) ", %0" : "=r" (cmdline_on) : "p" (sme_cmdline_on)); - asm ("lea sme_cmdline_off(%%rip), %0" + asm ("lea " __memref(1) ", %0" : "=r" (cmdline_off) : "p" (sme_cmdline_off)); ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-13 9:38 ` Uros Bizjak @ 2023-10-13 11:53 ` Uros Bizjak 2023-10-13 16:38 ` Linus Torvalds 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-13 11:53 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers [-- Attachment #1: Type: text/plain, Size: 2385 bytes --] On Fri, Oct 13, 2023 at 11:38 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Thu, Oct 12, 2023 at 8:01 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > On Thu, Oct 12, 2023 at 7:47 PM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > On Thu, 12 Oct 2023 at 10:10, Linus Torvalds > > > <torvalds@linux-foundation.org> wrote: > > > > > > > > The fix seems to be a simple one-liner, ie just > > > > > > > > - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \ > > > > + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \ > > > > > > Nope. That doesn't work at all. > > > > > > It turns out that we're not the only ones that didn't know about the > > > 'a' modifier. > > > > > > clang has also never heard of it in this context, and the above > > > one-liner results in an endless sea of errors, with > > > > > > error: invalid operand in inline asm: 'movq %gs:${1:a}, $0' > > > > > > Looking around, I think it's X86AsmPrinter::PrintAsmOperand() that is > > > supposed to handle these things, and while it does have some handling > > > for 'a', the comment around it says > > > > > > case 'a': // This is an address. Currently only 'i' and 'r' are expected. > > > > > > and I think our use ends up just confusing the heck out of clang. Of > > > course, clang also does this: > > > > > > case 'P': // This is the operand of a call, treat specially. > > > PrintPCRelImm(MI, OpNo, O); > > > return false; > > > > > > so clang *already* generates those 'current' accesses as PCrelative, and I see > > > > > > movq %gs:pcpu_hot(%rip), %r13 > > > > > > in the generated code. > > > > > > End result: clang actually generates what we want just using 'P', and > > > the whole "P vs a" is only a gcc thing. Maybe we should go with what Clang expects. %a with "i" constraint is also what GCC handles, because ‘i’: An immediate integer operand (one with constant value) is allowed. This includes symbolic constants whose values will be known only at assembly time or later. Attached patch patches both cases: the generated code for mem_encrypt_identity.c does not change while the change in percpu.h brings expected 4kB code size reduction. I think this is the correct solution that will work for both compilers. Uros. [-- Attachment #2: memref-2.diff.txt --] [-- Type: text/plain, Size: 2021 bytes --] diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 60ea7755c0fe..12f37b951fa7 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -175,9 +175,9 @@ do { \ #define percpu_stable_op(size, op, _var) \ ({ \ __pcpu_type_##size pfo_val__; \ - asm(__pcpu_op2_##size(op, __force_percpu_arg(P[var]), "%[val]") \ + asm(__pcpu_op2_##size(op, __force_percpu_arg(a[var]), "%[val]") \ : [val] __pcpu_reg_##size("=", pfo_val__) \ - : [var] "p" (&(_var))); \ + : [var] "i" (&(_var))); \ (typeof(_var))(unsigned long) pfo_val__; \ }) diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c index d73aeb16417f..2b1e0e781f9d 100644 --- a/arch/x86/mm/mem_encrypt_identity.c +++ b/arch/x86/mm/mem_encrypt_identity.c @@ -346,9 +346,9 @@ void __init sme_encrypt_kernel(struct boot_params *bp) * We're running identity mapped, so we must obtain the address to the * SME encryption workarea using rip-relative addressing. */ - asm ("lea sme_workarea(%%rip), %0" + asm ("lea %a1, %0" : "=r" (workarea_start) - : "p" (sme_workarea)); + : "i" (sme_workarea)); /* * Calculate required number of workarea bytes needed: @@ -582,15 +582,15 @@ void __init sme_enable(struct boot_params *bp) * identity mapped, so we must obtain the address to the SME command * line argument data using rip-relative addressing. */ - asm ("lea sme_cmdline_arg(%%rip), %0" + asm ("lea %a1, %0" : "=r" (cmdline_arg) - : "p" (sme_cmdline_arg)); - asm ("lea sme_cmdline_on(%%rip), %0" + : "i" (sme_cmdline_arg)); + asm ("lea %a1, %0" : "=r" (cmdline_on) - : "p" (sme_cmdline_on)); - asm ("lea sme_cmdline_off(%%rip), %0" + : "i" (sme_cmdline_on)); + asm ("lea %a1, %0" : "=r" (cmdline_off) - : "p" (sme_cmdline_off)); + : "i" (sme_cmdline_off)); if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT)) active_by_default = true; ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-13 11:53 ` Uros Bizjak @ 2023-10-13 16:38 ` Linus Torvalds 0 siblings, 0 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-13 16:38 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Fri, 13 Oct 2023 at 04:53, Uros Bizjak <ubizjak@gmail.com> wrote: > > Maybe we should go with what Clang expects. %a with "i" constraint is > also what GCC handles, because > > ‘i’: An immediate integer operand (one with constant value) is > allowed. This includes symbolic constants whose values will be known > only at assembly time or later. This looks fine to me, and would seem to be the simplest way to have both gcc and clang be happy with things. All these uses seem to be immediate addresses, as any actual dynamic ones will use a proper "m" constraint (and no operand modifiers) Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 17:10 ` Linus Torvalds 2023-10-12 17:47 ` Linus Torvalds @ 2023-10-12 17:52 ` Uros Bizjak 2023-11-20 9:39 ` Use %a asm operand modifier to obtain %rip-relative addressing Uros Bizjak 2 siblings, 0 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-12 17:52 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, Oct 12, 2023 at 7:10 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 12 Oct 2023 at 09:55, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > An example: > > Oh, I'm convinced. > > The fix seems to be a simple one-liner, ie just > > - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \ > + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \ The effect of the change: 25542442 4387686 808452 30738580 1d50894 vmlinux-new.o 25546484 4387686 808452 30742622 1d5185e vmlinux-old.o > and it turns out that we have other places where I think we could use that '%a', > > For example, we have things like this: > > asm ("lea sme_cmdline_arg(%%rip), %0" > : "=r" (cmdline_arg) > : "p" (sme_cmdline_arg)); > > and I think the only reason we do that ridiculous asm is that the code > in question really does want that (%rip) encoding. It sounds like this > could just do > > asm ("lea %a1, %0" > : "=r" (cmdline_arg) > : "p" (sme_cmdline_arg)); > > instead. Once again, I claim ignorance of the operand modifiers as the > reason for these kinds of things. > > But coming back to the stable op thing, I do wonder if there is some > way we could avoid the unnecessary reload. > > I don't hate Nadav's patch, so that part is fine, but I'd like to > understand what it is that makes gcc think it needs to reload. We have > other cases (like the ALTERNATIVE() uses) where we *have* to use > inline asm, so it would be good to know... > > Is it just that "p" (in the constraint, not "P" in the modifier) ends > up always being seen as a memory access, even when we only use the > address? > > That part has never really been something we've been entirely clear > on. We *are* passing in just the address, so the hope in *that* place > is that it's only an address dependency, not a memory one. Let's see the difference of: --cut here-- int m; void foo (void) { asm ("# %a0" :: "p" (&m)); } void bar (void) { asm ("# %0" :: "m" (m)); } --cut here-- The internal dump shows: (insn:TI 5 2 15 2 (parallel [ (asm_operands/v ("# %a0") ("") 0 [ (symbol_ref:DI ("m") [flags 0x2] <var_decl 0x7f3175011bd0 m>) ] [ (asm_input:DI ("p") rip.c:5) ] [] rip.c:5) (clobber (reg:CC 17 flags)) ]) "rip.c":5:3 -1 (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) vs: (insn:TI 5 2 13 2 (parallel [ (asm_operands/v ("# %0") ("") 0 [ (mem/c:SI (symbol_ref:DI ("m") [flags 0x2] <var_decl 0x7f3175011bd0 m>) [1 m+0 S4 A32]) ] [ (asm_input:SI ("m") rip.c:10) ] [] rip.c:10) (clobber (reg:CC 17 flags)) ]) "rip.c":10:3 -1 (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) The first argument is internally regarded as "constant": -- Macro: CONSTANT_P (X) 'CONSTANT_P', which is defined by target-independent code, accepts integer-values expressions whose values are not explicitly known, such as 'symbol_ref', 'label_ref', and 'high' expressions and 'const' arithmetic expressions, in addition to 'const_int' and 'const_double' expressions. So, it should not have any dependency. Perhaps a testcase should be created and posted to gcc-bugs for further analysis. Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Use %a asm operand modifier to obtain %rip-relative addressing 2023-10-12 17:10 ` Linus Torvalds 2023-10-12 17:47 ` Linus Torvalds 2023-10-12 17:52 ` Uros Bizjak @ 2023-11-20 9:39 ` Uros Bizjak 2 siblings, 0 replies; 106+ messages in thread From: Uros Bizjak @ 2023-11-20 9:39 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, Oct 12, 2023 at 7:10 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 12 Oct 2023 at 09:55, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > An example: > > Oh, I'm convinced. > > The fix seems to be a simple one-liner, ie just > > - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \ > + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \ > > and it turns out that we have other places where I think we could use that '%a', > > For example, we have things like this: > > asm ("lea sme_cmdline_arg(%%rip), %0" > : "=r" (cmdline_arg) > : "p" (sme_cmdline_arg)); > > and I think the only reason we do that ridiculous asm is that the code > in question really does want that (%rip) encoding. It sounds like this > could just do > > asm ("lea %a1, %0" > : "=r" (cmdline_arg) > : "p" (sme_cmdline_arg)); > > instead. Once again, I claim ignorance of the operand modifiers as the > reason for these kinds of things. I have looked a bit in the above code. From the compiler PoV, the above can be written as: asm ("lea %1, %0" : "=r" (cmdline_arg) : "m" (sme_cmdline_arg)); and it will always result in %rip-relative asm address: #APP # 585 "arch/x86/mm/mem_encrypt_identity.c" 1 lea sme_cmdline_arg(%rip), %rsi # 0 "" 2 #NO_APP Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 16:33 ` Uros Bizjak 2023-10-12 16:55 ` Uros Bizjak @ 2023-10-12 16:56 ` Linus Torvalds 1 sibling, 0 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-12 16:56 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 12 Oct 2023 at 09:33, Uros Bizjak <ubizjak@gmail.com> wrote: > > For some reason existing percpu_stable_op asm uses %P operand > modifier. This will drop all syntax-specific prefixes and issue the > bare constant. It will also remove the (%rip) suffix. What we want > here is a generic %a modifier (See 6.47.2.8 Generic Operand Modifiers > [1]) that will substitute a memory reference, with the actual operand > treated as the address. In combination with "p" constraint will DTRT > and will emit symbol with the (%rip) suffix when available, also when > -fpie is in effect. Well, I have to admit that I think the main reason we used "p" as a constraint was simply that we knew about it, and I don't think I've ever even realized "a" existed. In fact, we historically didn't use a lot of operand modifiers, and the only common ones (at least for x86) tend to be the "register size" modifiers (b/w/q) I just did git grep 'asm.*[^%]%[a-z][0-9]' arch/x86 and while that will only catch one-liner inline asm cases, all it finds is indeed just the size ones. A slightly smarter grep finds a couple of uses of '%c' for bare constants. So I think the "we used P for percpu_stable_op" is really mostly a "we are not actually very familiar with the operand modifiers". We use the constraints fairly wildly, but the operand modifiers are a relative rarity. I suspect - but it's much too long ago - that some gcc person just told us that "solve this using 'P'" and we never went any further. So changing it to use 'a' sounds like the right thing to do, and we can only plead ignorance, not wilful stupidity. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 15:19 ` Nadav Amit 2023-10-12 16:33 ` Uros Bizjak @ 2023-10-12 17:16 ` Linus Torvalds 2023-10-12 19:32 ` Nadav Amit 2023-10-16 18:52 ` Uros Bizjak 2 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-12 17:16 UTC (permalink / raw) To: Nadav Amit Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 12 Oct 2023 at 08:19, Nadav Amit <namit@vmware.com> wrote: > > +/* > + * Hold a constant alias for current_task, which would allow to avoid caching of > + * current task. > + * > + * We must mark const_current_task with the segment qualifiers, as otherwise gcc > + * would do redundant reads of const_current_task. > + */ > +DECLARE_PER_CPU(struct pcpu_hot const __percpu_seg_override, const_pcpu_hot); Hmm. The only things I'm not super-happy about with your patch is (a) it looks like this depends on the alias analysis knowing that the __seg_gs isn't affected by normal memory ops. That implies that this will not work well with compiler versions that don't do that? (b) This declaration doesn't match the other one. So now there are two *different* declarations for const_pcpu_hot, which I really don't like. That second one would seem to be trivial to just fix (or maybe not, and you do it that way for some horrible reason). The first one sounds bad to me - basically making the *reason* for this patch go away - but maybe the compilers that don't support address spaces are so rare that we can ignore it. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 17:16 ` Linus Torvalds @ 2023-10-12 19:32 ` Nadav Amit 2023-10-12 19:40 ` Linus Torvalds 0 siblings, 1 reply; 106+ messages in thread From: Nadav Amit @ 2023-10-12 19:32 UTC (permalink / raw) To: Linus Torvalds Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers [-- Attachment #1: Type: text/plain, Size: 2805 bytes --] > On Oct 12, 2023, at 8:16 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > !! External Email > > On Thu, 12 Oct 2023 at 08:19, Nadav Amit <namit@vmware.com> wrote: >> >> +/* >> + * Hold a constant alias for current_task, which would allow to avoid caching of >> + * current task. >> + * >> + * We must mark const_current_task with the segment qualifiers, as otherwise gcc >> + * would do redundant reads of const_current_task. >> + */ >> +DECLARE_PER_CPU(struct pcpu_hot const __percpu_seg_override, const_pcpu_hot); > > Hmm. The only things I'm not super-happy about with your patch is > > (a) it looks like this depends on the alias analysis knowing that the > __seg_gs isn't affected by normal memory ops. That implies that this > will not work well with compiler versions that don't do that? > > (b) This declaration doesn't match the other one. So now there are > two *different* declarations for const_pcpu_hot, which I really don't > like. > > That second one would seem to be trivial to just fix (or maybe not, > and you do it that way for some horrible reason). If you refer to the difference between DECLARE_PER_CPU_ALIGNED() and DECLARE_PER_CPU() - that’s just a silly mistake that I made porting my old patch (I also put “const” in the wrong place of the declaration, sorry). > > The first one sounds bad to me - basically making the *reason* for > this patch go away - but maybe the compilers that don't support > address spaces are so rare that we can ignore it. As far as I understand it has nothing to do with the address spaces, and IIRC the compiler does not regard gs/fs address spaces as independent from the main one. That’s the reason a compiler barrier affects regular loads with __seg_gs. The “trick” that the patch does is to expose a new const_pcpu_hot symbol that has a “const” qualifier. For compilation units from which the symbol is effectively constant, we use const_pcpu_hot. The compiler then knows that the value would not change. Later, when we actually define the const_pcpu_hot, we tell the compiler using __attribute__((alias("pcpu_hot”)) that this symbol is actually an alias to pcpu_hot. Although it is a bit of a trick that I have never seen elsewhere, I don’t see it violating GCC specifications (“except for top-level qualifiers the alias target must have the same type as the alias” [1]), and there is nothing that is specific to the gs address-space. I still have the concern of its interaction with LTO though, and perhaps using “-fno-lto” when compiling compilation units that modify current (e.g., arch/x86/kernel/process_64.o) is necessary. I hope it makes sense. [1] https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 19:32 ` Nadav Amit @ 2023-10-12 19:40 ` Linus Torvalds 0 siblings, 0 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-12 19:40 UTC (permalink / raw) To: Nadav Amit Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 12 Oct 2023 at 12:32, Nadav Amit <namit@vmware.com> wrote: > > If you refer to the difference between DECLARE_PER_CPU_ALIGNED() and > DECLARE_PER_CPU() - that’s just a silly mistake that I made porting my > old patch (I also put “const” in the wrong place of the declaration, sorry). Yes, the only difference is the ALIGNED and the 'const', but I think also the alias attribute. However, I'd be happier if we had just one place that declares them, not two. Even if the two were identical, it seems wrong to have two declarations for the same thing. > The “trick” that the patch does is to expose a new const_pcpu_hot symbol that has > a “const” qualifier. For compilation units from which the symbol is effectively > constant, we use const_pcpu_hot. The compiler then knows that the value would not > change. Oh, I don't disagree with that part. I just don't see why the 'asm' version would have any difference. For that too, the compiler should see that the result of the asm doesn't change. So my confusion / worry is not about the const alias. I like that part. My worry is literally "in other situations we _have_ to use asm(), and it's not clear why gcc wouldn't do as well for it". Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-12 15:19 ` Nadav Amit 2023-10-12 16:33 ` Uros Bizjak 2023-10-12 17:16 ` Linus Torvalds @ 2023-10-16 18:52 ` Uros Bizjak 2023-10-16 19:24 ` Linus Torvalds 2023-10-16 21:09 ` Uros Bizjak 2 siblings, 2 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-16 18:52 UTC (permalink / raw) To: Nadav Amit Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers [-- Attachment #1: Type: text/plain, Size: 7759 bytes --] On Thu, Oct 12, 2023 at 5:19 PM Nadav Amit <namit@vmware.com> wrote: > > > > On Oct 12, 2023, at 12:54 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > !! External Email > > > > On Wed, 11 Oct 2023 at 14:33, Uros Bizjak <ubizjak@gmail.com> wrote: > >> > >> Reading the above, it looks to me that we don't want to play games > >> with "const aliased" versions of current_task [1], as proposed by > >> Nadav in his patch series. > > > > Well, maybe I'd like it if I saw what the effect of it was, but that > > patch mentions "sync_mm_rss()" which doesn't actually exist > > (SPLIT_RSS_COUNTING is never defined, the split version is gone and > > hasn't existed since commit f1a7941243c1 "mm: convert mm's rss stats > > into percpu_counter") > > So I added a new version of the current aliasing (well, actually pcpu_hot > in the new version) on top of Uros’s patches, and the effect can be seen > in many functions. I don’t want to bother with many examples so here is > a common and simple one: > > Currently syscall_exit_work() that starts with: > > 0xffffffff8111e120 <+0>: push %rbp > 0xffffffff8111e121 <+1>: mov %rdi,%rbp > 0xffffffff8111e124 <+4>: push %rbx > 0xffffffff8111e125 <+5>: mov %rsi,%rbx > 0xffffffff8111e128 <+8>: and $0x20,%esi > 0xffffffff8111e12b <+11>: je 0xffffffff8111e143 <syscall_exit_work+35> > 0xffffffff8111e12d <+13>: mov %gs:0x2ac80,%rax > 0xffffffff8111e136 <+22>: cmpb $0x0,0x800(%rax) > 0xffffffff8111e13d <+29>: jne 0xffffffff8111e22a <syscall_exit_work+266> > 0xffffffff8111e143 <+35>: mov %gs:0x2ac80,%rax > 0xffffffff8111e14c <+44>: cmpq $0x0,0x7c8(%rax) > > Using the const-alias changes the beginning of syscall_exit_work to: > > 0xffffffff8111cb80 <+0>: push %r12 > 0xffffffff8111cb82 <+2>: mov %gs:0x7ef0e0f6(%rip),%r12 # 0x2ac80 <pcpu_hot> > 0xffffffff8111cb8a <+10>: push %rbp > 0xffffffff8111cb8b <+11>: mov %rdi,%rbp > 0xffffffff8111cb8e <+14>: push %rbx > 0xffffffff8111cb8f <+15>: mov %rsi,%rbx > 0xffffffff8111cb92 <+18>: and $0x20,%esi > 0xffffffff8111cb95 <+21>: je 0xffffffff8111cba6 <syscall_exit_work+38> > 0xffffffff8111cb97 <+23>: cmpb $0x0,0x800(%r12) > 0xffffffff8111cba0 <+32>: jne 0xffffffff8111cc7a <syscall_exit_work+250> > 0xffffffff8111cba6 <+38>: cmpq $0x0,0x7c8(%r12) > > So we both see RIP-relative addressing is being used (hence the instruction is > one byte shorter) and the reload going away. > > Now, I am not a compiler expert as for the rationale, but it googling around > I can see Nick explaining the rationale [1] - if you use “p” your read memory. > BTW: It is related to discussion you had [2], in which you encountered an issue > I also encountered before [3]. My bad for pushing it in. > > Anyhow, I created a similar code on godbolt ( https://godbolt.org/z/dPqKKzPs4 ) > to show this behavior - how compiler barriers cause reload. It seems that this > behavior happens on GCC and CLANG on various versions. > > The idea behind the patch is that it communicates - in the compilation unit > granularity - that current is fixed. There is an issue of whether it works with > LTO, which I have never checked. > > > [1] https://reviews.llvm.org/D145416 > [2] https://lore.kernel.org/lkml/20230306120106.GE1267364@hirez.programming.kicks-ass.net/ > [3] https://lore.kernel.org/all/20190823224424.15296-5-namit@vmware.com/ > > -- > > Here’s the updated patch - but I didn’t really boot a machine with it so new > issues might have come since my last patch-set: Unfortunately, it does not work and dies early in the boot with: [ 4.939358] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 4.940090] #PF: supervisor write access in kernel mode [ 4.940090] #PF: error_code(0x0002) - not-present page [ 4.940090] PGD 0 P4D 0 [ 4.940090] Oops: 0002 [#1] PREEMPT SMP PTI [ 4.940090] CPU: 1 PID: 52 Comm: kworker/u4:1 Not tainted 6.6.0-rc6-00365-g0c09c1d70838-dirty #7 [ 4.940090] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014 [ 4.940090] RIP: 0010:begin_new_exec+0x8f2/0xa30 [ 4.940090] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89 ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea fa ff ff <f0> 41 ff 0c 24 0f 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb [ 4.940090] RSP: 0000:ffff9c84c01e3d68 EFLAGS: 00010246 [ 4.940090] RAX: 0000000000000000 RBX: ffff9946e30c1f00 RCX: 0000000000000000 [ 4.940090] RDX: 0000000000000000 RSI: ffff9946e2ff0000 RDI: ffff9946e30c2718 [ 4.940090] RBP: ffff9946c03a7c00 R08: 00000000fffffffe R09: 00000000ffffffff [ 4.940090] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000 [ 4.940090] R13: 0000000000000000 R14: ffff9946e30c2718 R15: ffff9946e2ff0000 [ 4.940090] FS: 0000000000000000(0000) GS:ffff994763f00000(0000) knlGS:0000000000000000 [ 4.940090] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4.940090] CR2: 0000000000000000 CR3: 00000001003a8000 CR4: 00000000000406f0 [ 4.940090] Call Trace: [ 4.940090] <TASK> [ 4.940090] ? __die+0x1e/0x60 [ 4.940090] ? page_fault_oops+0x17b/0x470 [ 4.940090] ? search_module_extables+0x14/0x50 [ 4.940090] ? exc_page_fault+0x66/0x140 [ 4.940090] ? asm_exc_page_fault+0x26/0x30 [ 4.940090] ? begin_new_exec+0x8f2/0xa30 [ 4.940090] ? begin_new_exec+0x3ce/0xa30 [ 4.940090] ? load_elf_phdrs+0x67/0xb0 [ 4.940090] load_elf_binary+0x2bb/0x1770 [ 4.940090] ? __kernel_read+0x136/0x2d0 [ 4.940090] bprm_execve+0x277/0x630 [ 4.940090] kernel_execve+0x145/0x1a0 [ 4.940090] call_usermodehelper_exec_async+0xcb/0x180 [ 4.940090] ? __pfx_call_usermodehelper_exec_async+0x10/0x10 [ 4.940090] ret_from_fork+0x2f/0x50 [ 4.940090] ? __pfx_call_usermodehelper_exec_async+0x10/0x10 [ 4.940090] ret_from_fork_asm+0x1b/0x30 [ 4.940090] </TASK> [ 4.940090] Modules linked in: [ 4.940090] CR2: 0000000000000000 [ 5.017606] ---[ end trace 0000000000000000 ]--- [ 5.018957] RIP: 0010:begin_new_exec+0x8f2/0xa30 [ 5.020299] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89 ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea fa ff ff <f0> 41 ff 0c 24 0f 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb [ 5.024765] RSP: 0000:ffff9c84c01e3d68 EFLAGS: 00010246 [ 5.026150] RAX: 0000000000000000 RBX: ffff9946e30c1f00 RCX: 0000000000000000 [ 5.027916] RDX: 0000000000000000 RSI: ffff9946e2ff0000 RDI: ffff9946e30c2718 [ 5.029714] RBP: ffff9946c03a7c00 R08: 00000000fffffffe R09: 00000000ffffffff [ 5.031461] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000 [ 5.033186] R13: 0000000000000000 R14: ffff9946e30c2718 R15: ffff9946e2ff0000 [ 5.034908] FS: 0000000000000000(0000) GS:ffff994763f00000(0000) knlGS:0000000000000000 [ 5.036907] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.038341] CR2: 0000000000000000 CR3: 00000001003a8000 CR4: 00000000000406f0 [ 5.040044] Kernel panic - not syncing: Fatal exception [ 5.040647] Kernel Offset: 0x22e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) It looks that aliasing a structure from another namespace is a no-go, since the patch (attached, slightly changed your patch) without __percpu_seg_override decorations bootstraps OK. The working patch (without __percpu_seg_override) is not effective (no effect in syscall_exit_work) and increases the number of current_task reads from 3841 to 4711. Uros. [-- Attachment #2: p.diff.txt --] [-- Type: text/plain, Size: 2366 bytes --] diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h index a1168e7b69e5..5b46505ff99e 100644 --- a/arch/x86/include/asm/current.h +++ b/arch/x86/include/asm/current.h @@ -36,10 +36,23 @@ static_assert(sizeof(struct pcpu_hot) == 64); DECLARE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot); +/* + * + */ +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override, + const_pcpu_hot); + +#ifdef CONFIG_USE_X86_SEG_SUPPORT +static __always_inline struct task_struct *get_current(void) +{ + return this_cpu_read_const(const_pcpu_hot.current_task); +} +#else static __always_inline struct task_struct *get_current(void) { return this_cpu_read_stable(pcpu_hot.current_task); } +#endif #define current get_current() diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 54746903b8c3..268d1ceaa4b2 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -453,6 +453,8 @@ do { \ #define this_cpu_write_8(pcp, val) __raw_cpu_write(volatile, pcp, val) #endif +#define this_cpu_read_const(pcp) __raw_cpu_read(, pcp) + #else /* CONFIG_USE_X86_SEG_SUPPORT */ #define raw_cpu_read_1(pcp) percpu_from_op(1, , "mov", pcp) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index b14fc8c1c953..f284c08aaeca 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -2050,6 +2050,10 @@ DEFINE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot) = { }; EXPORT_PER_CPU_SYMBOL(pcpu_hot); +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override, + const_pcpu_hot) __attribute__((alias("pcpu_hot"))); +EXPORT_PER_CPU_SYMBOL(const_pcpu_hot); + #ifdef CONFIG_X86_64 DEFINE_PER_CPU_FIRST(struct fixed_percpu_data, fixed_percpu_data) __aligned(PAGE_SIZE) __visible; diff --git a/include/linux/compiler.h b/include/linux/compiler.h index d7779a18b24f..bf9815eaf4aa 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -212,7 +212,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val, */ #define ___ADDRESSABLE(sym, __attrs) \ static void * __used __attrs \ - __UNIQUE_ID(__PASTE(__addressable_,sym)) = (void *)&sym; + __UNIQUE_ID(__PASTE(__addressable_,sym)) = (void *)(uintptr_t)&sym; #define __ADDRESSABLE(sym) \ ___ADDRESSABLE(sym, __section(".discard.addressable")) ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-16 18:52 ` Uros Bizjak @ 2023-10-16 19:24 ` Linus Torvalds 2023-10-16 20:35 ` Nadav Amit 2023-10-16 21:09 ` Uros Bizjak 1 sibling, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-16 19:24 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Mon, 16 Oct 2023 at 11:53, Uros Bizjak <ubizjak@gmail.com> wrote: > > Unfortunately, it does not work and dies early in the boot with: Side note: build the kernel with debug info (the limited form is sufficient), and then run oopses through ./scripts/decode_stacktrace.sh to get much nicer oops information that has line numbers and inlining information in the backtrace. > [ 4.939358] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [ 4.940090] RIP: 0010:begin_new_exec+0x8f2/0xa30 > [ 4.940090] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89 > ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea > fa ff ff <f0> 41 ff 0c 24 0f > 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb That decodes to 0: 31 f6 xor %esi,%esi 2: e8 c1 49 f9 ff call 0xfffffffffff949c8 7: e9 3c fa ff ff jmp 0xfffffffffffffa48 c: 31 f6 xor %esi,%esi e: 4c 89 ef mov %r13,%rdi 11: e8 b2 4a f9 ff call 0xfffffffffff94ac8 16: e9 19 fa ff ff jmp 0xfffffffffffffa34 1b: 31 f6 xor %esi,%esi 1d: 4c 89 ef mov %r13,%rdi 20: e8 23 4a f9 ff call 0xfffffffffff94a48 25: e9 ea fa ff ff jmp 0xfffffffffffffb14 2a:* f0 41 ff 0c 24 lock decl (%r12) <-- trapping instruction 2f: 0f 85 55 fb ff ff jne 0xfffffffffffffb8a 35: 4c 89 e7 mov %r12,%rdi 38: e8 4b 02 df ff call 0xffffffffffdf0288 but without a nicer backtrace it's nasty to guess where this is. The "lock decl ; jne" is a good hint, though - that sequence is most definitely "atomic_dec_and_test()". And that in turn means that it's almost certainly mmdrop(), which is if (unlikely(atomic_dec_and_test(&mm->mm_count))) __mmdrop(mm); where that 35: 4c 89 e7 mov %r12,%rdi 38: e8 4b 02 df ff call 0xffffffffffdf0288 is exactly the unlikely "__mmdrop(mm)" part (and gcc decided to make the likely branch a branch-out for some reason - presumably with the inlining the code around it meant that was the better layout - maybe this was all inside another "unlikely()" branch. And if I read that right, this has all been inlined from begin_new_exec() -> exec_mmap() -> mmdrop_lazy_tlb(). Now, how and why 'mm' would be NULL in that path, and why any 'current' reloading optimization would matter in this all I very much can't see. The call site in begin_new_exec() is /* * Release all of the old mmap stuff */ acct_arg_size(bprm, 0); retval = exec_mmap(bprm->mm); if (retval) goto out; bprm->mm = NULL; and "bprm->mm" is most definitely non-NULL there because we earlier did So I suspect the problem happened much earlier, caused some nasty internal corruption, and the odd 'mm is NULL' is just a symptom. retval = set_mm_exe_file(bprm->mm, bprm->file); using it, and that would have oopsed had bprm->mm been NULL then. So there's some serious corruption there, but from the oops itself I can't tell the source. I guess if we get 'current' wrong anywhere, all bets are off. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-16 19:24 ` Linus Torvalds @ 2023-10-16 20:35 ` Nadav Amit 2023-10-16 20:59 ` Linus Torvalds 2023-10-16 23:02 ` Linus Torvalds 0 siblings, 2 replies; 106+ messages in thread From: Nadav Amit @ 2023-10-16 20:35 UTC (permalink / raw) To: Linus Torvalds, Uros Bizjak Cc: the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers > On Oct 16, 2023, at 10:24 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So there's some serious corruption there, but from the oops itself I > can't tell the source. I guess if we get 'current' wrong anywhere, all > bets are off. I don’t think it means that it the aliasing does not work; I think it all means that it actually works *too well*. I have encountered several such issues before [1], and while some have been fixed, some have not (I looked at switch_fpu_finish()), and might under the right/wrong circumstances use the wrongly-“cached” current. Moreover, perhaps new problems have been added since my old patch. Perhaps the whack-a-mole approach that I took in [1] is wrong. Instead, perhaps it is better to use an uncached version of current when compiling arch/x86/kernel/process_64.c and arch/x86/kernel/process_32.c , so they would not use the const alias, but instead use the original (non-const) one. Some macro magic and an additional "-D” build flag can do that. But first, there is a need to confirm that’s actually the problem. I’ll try to do it tomorrow. Regards, Nadav [1] https://lore.kernel.org/all/20190823224424.15296-5-namit@vmware.com/ ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-16 20:35 ` Nadav Amit @ 2023-10-16 20:59 ` Linus Torvalds 2023-10-16 23:02 ` Linus Torvalds 1 sibling, 0 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-16 20:59 UTC (permalink / raw) To: Nadav Amit Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Mon, 16 Oct 2023 at 13:35, Nadav Amit <namit@vmware.com> wrote: > > I don’t think it means that it the aliasing does not work; I think it all means that > it actually works *too well*. Hmm. The *only* case I can think of is __switch_to() itself when doing that raw_cpu_write(pcpu_hot.current_task, next_p); and arguably that part should never have been done in C at all, but here we are. I do think that from a sanity standpoint, it would be good to split "__switch_to()" into two: a "before stack switch" and "after stack switch", and change 'current' from within the asm section (ie do that part in __switch_to_asm"). Or maybe we should just keep the single "__switch_to()" function, but make it clear that it happens *after* current has been changed (and maybe rename it to 'funish_switch_to()" or something like that to make it clearer). Because the situation with __switch_to() right now is outright confusing, where it has 'prevp' and 'nextp', but then implicitly uses 'current' in two different ways. Ugh. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-16 20:35 ` Nadav Amit 2023-10-16 20:59 ` Linus Torvalds @ 2023-10-16 23:02 ` Linus Torvalds 2023-10-16 23:14 ` Linus Torvalds 2023-10-17 7:23 ` Nadav Amit 1 sibling, 2 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-16 23:02 UTC (permalink / raw) To: Nadav Amit Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers [-- Attachment #1: Type: text/plain, Size: 2448 bytes --] On Mon, 16 Oct 2023 at 13:35, Nadav Amit <namit@vmware.com> wrote: > > I have encountered several such issues before [1], and while some have been fixed, > some have not (I looked at switch_fpu_finish()), and might under the right/wrong > circumstances use the wrongly-“cached” current. Moreover, perhaps new problems > have been added since my old patch. Yeah, that fpu switching is disgusting and borderline buggy. And yes, it would trigger problems when caching the value of 'current'. I don't particularly love the patch you pointed at, because it seems to have only fixed the switch_fpu_finish() case, which is the one that presumably triggered issues, but that's not a very pretty fix. switch_fpu_prepare() has the exact same problem, and in fact is likely the *source* of the issue, because that's the original "load current" that then ends up being cached incorrectly later in __switch_to(). The whole struct fpu *prev_fpu = &prev->fpu; thing in __switch_to() is pretty ugly. There's no reason why we should look at that 'prev_fpu' pointer there, or pass it down. And it only generates worse code, in how it loads 'current' when t__switch_to() has the right task pointers. So the attached patch is, I think, the right thing to do. It may not be the *complete* fix, but at least for the config I tested, this makes all loads of 'current' go away in the resulting generated assembly, and the only access to '%gs:pcpu_hot(%rip)' is the write to update it: movq %rbx, %gs:pcpu_hot(%rip) from that raw_cpu_write(pcpu_hot.current_task, next_p); code. Thomas, I think you've touched this code last, but even that isn't very recent. The attached patch not only cleans this up, it actually generates better code too: (a) it removes one push/pop pair at entry/exit because there's one less register used (no 'current') (b) it removes that pointless load of 'current' because it just uses the right argument: - #APP - movq %gs:pcpu_hot(%rip), %r12 - #NO_APP - testq $16384, (%r12) + testq $16384, (%rdi) so I think this is the right thing to do. I checked that the 32-bit code builds and looks sane too. I do think the 'old/new' naming in the FPU code should probably be 'prev/next' to match the switch_to() naming, but I didn't do that. Comments? Linus Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 3686 bytes --] arch/x86/include/asm/fpu/sched.h | 10 ++++++---- arch/x86/kernel/process_32.c | 7 +++---- arch/x86/kernel/process_64.c | 7 +++---- 3 files changed, 12 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h index ca6e5e5f16b2..c485f1944c5f 100644 --- a/arch/x86/include/asm/fpu/sched.h +++ b/arch/x86/include/asm/fpu/sched.h @@ -37,10 +37,12 @@ extern void fpu_flush_thread(void); * The FPU context is only stored/restored for a user task and * PF_KTHREAD is used to distinguish between kernel and user threads. */ -static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu) +static inline void switch_fpu_prepare(struct task_struct *old, int cpu) { if (cpu_feature_enabled(X86_FEATURE_FPU) && - !(current->flags & (PF_KTHREAD | PF_USER_WORKER))) { + !(old->flags & (PF_KTHREAD | PF_USER_WORKER))) { + struct fpu *old_fpu = &old->thread.fpu; + save_fpregs_to_fpstate(old_fpu); /* * The save operation preserved register state, so the @@ -60,10 +62,10 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu) * Delay loading of the complete FPU state until the return to userland. * PKRU is handled separately. */ -static inline void switch_fpu_finish(void) +static inline void switch_fpu_finish(struct task_struct *new) { if (cpu_feature_enabled(X86_FEATURE_FPU)) - set_thread_flag(TIF_NEED_FPU_LOAD); + set_tsk_thread_flag(new, TIF_NEED_FPU_LOAD); } #endif /* _ASM_X86_FPU_SCHED_H */ diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 708c87b88cc1..0917c7f25720 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -156,13 +156,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) { struct thread_struct *prev = &prev_p->thread, *next = &next_p->thread; - struct fpu *prev_fpu = &prev->fpu; int cpu = smp_processor_id(); /* never put a printk in __switch_to... printk() calls wake_up*() indirectly */ - if (!test_thread_flag(TIF_NEED_FPU_LOAD)) - switch_fpu_prepare(prev_fpu, cpu); + if (!test_tsk_thread_flag(prev_p, TIF_NEED_FPU_LOAD)) + switch_fpu_prepare(prev_p, cpu); /* * Save away %gs. No need to save %fs, as it was saved on the @@ -209,7 +208,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) raw_cpu_write(pcpu_hot.current_task, next_p); - switch_fpu_finish(); + switch_fpu_finish(next_p); /* Load the Intel cache allocation PQR MSR. */ resctrl_sched_in(next_p); diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 33b268747bb7..1553e19904e0 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -562,14 +562,13 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) { struct thread_struct *prev = &prev_p->thread; struct thread_struct *next = &next_p->thread; - struct fpu *prev_fpu = &prev->fpu; int cpu = smp_processor_id(); WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) && this_cpu_read(pcpu_hot.hardirq_stack_inuse)); - if (!test_thread_flag(TIF_NEED_FPU_LOAD)) - switch_fpu_prepare(prev_fpu, cpu); + if (!test_tsk_thread_flag(prev_p, TIF_NEED_FPU_LOAD)) + switch_fpu_prepare(prev_p, cpu); /* We must save %fs and %gs before load_TLS() because * %fs and %gs may be cleared by load_TLS(). @@ -623,7 +622,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) raw_cpu_write(pcpu_hot.current_task, next_p); raw_cpu_write(pcpu_hot.top_of_stack, task_top_of_stack(next_p)); - switch_fpu_finish(); + switch_fpu_finish(next_p); /* Reload sp0. */ update_task_stack(next_p); ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-16 23:02 ` Linus Torvalds @ 2023-10-16 23:14 ` Linus Torvalds 2023-10-17 7:23 ` Nadav Amit 1 sibling, 0 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-16 23:14 UTC (permalink / raw) To: Nadav Amit Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Mon, 16 Oct 2023 at 16:02, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > so I think this is the right thing to do. I checked that the 32-bit > code builds and looks sane too. Just to clarify: the 64-bit side I actually booted and am running. The 32-bit side is pretty much identical, but I only checked that that 'process_32.c' file still builds. I didn't do any other testing. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-16 23:02 ` Linus Torvalds 2023-10-16 23:14 ` Linus Torvalds @ 2023-10-17 7:23 ` Nadav Amit 2023-10-17 19:00 ` Linus Torvalds 1 sibling, 1 reply; 106+ messages in thread From: Nadav Amit @ 2023-10-17 7:23 UTC (permalink / raw) To: Linus Torvalds Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers > On Oct 17, 2023, at 2:02 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > - #APP > - movq %gs:pcpu_hot(%rip), %r12 > - #NO_APP > - testq $16384, (%r12) > + testq $16384, (%rdi) > > so I think this is the right thing to do. I checked that the 32-bit > code builds and looks sane too. > > I do think the 'old/new' naming in the FPU code should probably be > 'prev/next' to match the switch_to() naming, but I didn't do that. > > Comments? Yes, the FPU issue is the one that caused me to crash before. I indeed missed the switch_fpu_prepare(). The other issue that I encountered before, with __resctrl_sched_in() has already been taken care of. It would have been nice to somehow prevent such a thing from reoccurring. Presumably objtool could have ensured it is so. But anyhow, I do not know of any other currently open issues. This whole thing (in addition to Uros’s analysis and objdump numbers) show that the const-alias allows much more aggressive optimizations than the current this_cpu_read_stable(). ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-17 7:23 ` Nadav Amit @ 2023-10-17 19:00 ` Linus Torvalds 2023-10-17 19:11 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-17 19:00 UTC (permalink / raw) To: Nadav Amit Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Tue, 17 Oct 2023 at 00:23, Nadav Amit <namit@vmware.com> wrote: > > Yes, the FPU issue is the one that caused me to crash before. Uros, can you verify whether that patch of mine resolves the issue you saw? That patch is _technically_ an actual bug-fix, although right now our existing 'current' caching that depends on just CSE'ing the inline asm (and is apparently limited to only doing so within single basic blocks) doesn't actually trigger the bug in our __switch_to() logic in practice. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-17 19:00 ` Linus Torvalds @ 2023-10-17 19:11 ` Uros Bizjak 2023-10-17 21:05 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-17 19:11 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers [-- Attachment #1: Type: text/plain, Size: 4663 bytes --] On Tue, Oct 17, 2023 at 9:00 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 17 Oct 2023 at 00:23, Nadav Amit <namit@vmware.com> wrote: > > > > Yes, the FPU issue is the one that caused me to crash before. > > Uros, can you verify whether that patch of mine resolves the issue you saw? > > That patch is _technically_ an actual bug-fix, although right now our > existing 'current' caching that depends on just CSE'ing the inline asm > (and is apparently limited to only doing so within single basic > blocks) doesn't actually trigger the bug in our __switch_to() logic in > practice. Unfortunately, it doesn't fix the oops :( I'm testing your patch, together with the attached patch with the current tip tree (that already has all necessary percpu stuff), and get exactly the same oops in: [ 4.969657] cfg80211: Loading compiled-in X.509 certificates for regulatory database [ 4.980712] modprobe (53) used greatest stack depth: 13480 bytes left [ 4.981048] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 4.981830] #PF: supervisor write access in kernel mode [ 4.981830] #PF: error_code(0x0002) - not-present page [ 4.981830] PGD 0 P4D 0 [ 4.981830] Oops: 0002 [#1] PREEMPT SMP PTI [ 4.981830] CPU: 1 PID: 54 Comm: kworker/u4:1 Not tainted 6.6.0-rc6-00406-g84ab57184ff4-dirty #2 [ 4.981830] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014 [ 4.981830] RIP: 0010:begin_new_exec+0x8f2/0xa30 [ 4.981830] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89 ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea fa ff ff <f0> 41 ff 0c 24 0f 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb [ 4.981830] RSP: 0000:ffffa505401f3d68 EFLAGS: 00010246 [ 4.981830] RAX: 0000000000000000 RBX: ffff89ed809e9f00 RCX: 0000000000000000 [ 4.981830] RDX: 0000000000000000 RSI: ffff89ed80e6c000 RDI: ffff89ed809ea718 [ 4.981830] RBP: ffff89ed8039ee00 R08: 00000000fffffffe R09: 00000000ffffffff [ 4.981830] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000 [ 4.981830] R13: 0000000000000000 R14: ffff89ed809ea718 R15: ffff89ed80e6c000 [ 4.981830] FS: 0000000000000000(0000) GS:ffff89ee24900000(0000) knlGS:0000000000000000 [ 4.981830] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4.981830] CR2: 0000000000000000 CR3: 00000001003a0000 CR4: 00000000000406f0 [ 4.981830] Call Trace: [ 4.981830] <TASK> [ 4.981830] ? __die+0x1e/0x60 [ 4.981830] ? page_fault_oops+0x17b/0x470 [ 4.981830] ? search_module_extables+0x14/0x50 [ 4.981830] ? exc_page_fault+0x66/0x140 [ 4.981830] ? asm_exc_page_fault+0x26/0x30 [ 4.981830] ? begin_new_exec+0x8f2/0xa30 [ 4.981830] ? begin_new_exec+0x3ce/0xa30 [ 4.981830] ? load_elf_phdrs+0x67/0xb0 [ 4.981830] load_elf_binary+0x2bb/0x1770 [ 4.981830] ? __kernel_read+0x136/0x2d0 [ 4.981830] bprm_execve+0x277/0x630 [ 4.981830] kernel_execve+0x145/0x1a0 [ 4.981830] call_usermodehelper_exec_async+0xcb/0x180 [ 4.981830] ? __pfx_call_usermodehelper_exec_async+0x10/0x10 [ 4.981830] ret_from_fork+0x2f/0x50 [ 4.981830] ? __pfx_call_usermodehelper_exec_async+0x10/0x10 [ 4.981830] ret_from_fork_asm+0x1b/0x30 [ 4.981830] </TASK> [ 4.981830] Modules linked in: [ 4.981830] CR2: 0000000000000000 [ 5.052612] ---[ end trace 0000000000000000 ]--- [ 5.053833] RIP: 0010:begin_new_exec+0x8f2/0xa30 [ 5.055065] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89 ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea fa ff ff <f0> 41 ff 0c 24 0f 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb [ 5.059476] RSP: 0000:ffffa505401f3d68 EFLAGS: 00010246 [ 5.060780] RAX: 0000000000000000 RBX: ffff89ed809e9f00 RCX: 0000000000000000 [ 5.062483] RDX: 0000000000000000 RSI: ffff89ed80e6c000 RDI: ffff89ed809ea718 [ 5.064190] RBP: ffff89ed8039ee00 R08: 00000000fffffffe R09: 00000000ffffffff [ 5.065908] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000 [ 5.067625] R13: 0000000000000000 R14: ffff89ed809ea718 R15: ffff89ed80e6c000 [ 5.069343] FS: 0000000000000000(0000) GS:ffff89ee24900000(0000) knlGS:0000000000000000 [ 5.071313] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.072732] CR2: 0000000000000000 CR3: 00000001003a0000 CR4: 00000000000406f0 [ 5.074439] Kernel panic - not syncing: Fatal exception [ 5.075028] Kernel Offset: 0xcc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Uros. [-- Attachment #2: current.diff.txt --] [-- Type: text/plain, Size: 1879 bytes --] diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h index a1168e7b69e5..21e8bd4ea44e 100644 --- a/arch/x86/include/asm/current.h +++ b/arch/x86/include/asm/current.h @@ -36,10 +36,23 @@ static_assert(sizeof(struct pcpu_hot) == 64); DECLARE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot); +/* + * + */ +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override, + const_pcpu_hot); + +#ifdef CONFIG_USE_X86_SEG_SUPPORT +static __always_inline struct task_struct *get_current(void) +{ + return const_pcpu_hot.current_task; +} +#else static __always_inline struct task_struct *get_current(void) { return this_cpu_read_stable(pcpu_hot.current_task); } +#endif #define current get_current() diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index b14fc8c1c953..f284c08aaeca 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -2050,6 +2050,10 @@ DEFINE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot) = { }; EXPORT_PER_CPU_SYMBOL(pcpu_hot); +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override, + const_pcpu_hot) __attribute__((alias("pcpu_hot"))); +EXPORT_PER_CPU_SYMBOL(const_pcpu_hot); + #ifdef CONFIG_X86_64 DEFINE_PER_CPU_FIRST(struct fixed_percpu_data, fixed_percpu_data) __aligned(PAGE_SIZE) __visible; diff --git a/include/linux/compiler.h b/include/linux/compiler.h index d7779a18b24f..bf9815eaf4aa 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -212,7 +212,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val, */ #define ___ADDRESSABLE(sym, __attrs) \ static void * __used __attrs \ - __UNIQUE_ID(__PASTE(__addressable_,sym)) = (void *)&sym; + __UNIQUE_ID(__PASTE(__addressable_,sym)) = (void *)(uintptr_t)&sym; #define __ADDRESSABLE(sym) \ ___ADDRESSABLE(sym, __section(".discard.addressable")) ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-17 19:11 ` Uros Bizjak @ 2023-10-17 21:05 ` Uros Bizjak 2023-10-17 21:53 ` Linus Torvalds 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-17 21:05 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers [-- Attachment #1: Type: text/plain, Size: 1025 bytes --] On Tue, Oct 17, 2023 at 9:11 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Tue, Oct 17, 2023 at 9:00 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Tue, 17 Oct 2023 at 00:23, Nadav Amit <namit@vmware.com> wrote: > > > > > > Yes, the FPU issue is the one that caused me to crash before. > > > > Uros, can you verify whether that patch of mine resolves the issue you saw? > > > > That patch is _technically_ an actual bug-fix, although right now our > > existing 'current' caching that depends on just CSE'ing the inline asm > > (and is apparently limited to only doing so within single basic > > blocks) doesn't actually trigger the bug in our __switch_to() logic in > > practice. > > Unfortunately, it doesn't fix the oops :( > > I'm testing your patch, together with the attached patch with the > current tip tree (that already has all necessary percpu stuff), and > get exactly the same oops in: But adding the attached patch on top of both patches boots OK. Uros. [-- Attachment #2: p.diff.txt --] [-- Type: text/plain, Size: 302 bytes --] diff --git a/fs/exec.c b/fs/exec.c index 6518e33ea813..3e810badf9ab 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1037,7 +1037,8 @@ static int exec_mmap(struct mm_struct *mm) mmput(old_mm); return 0; } - mmdrop_lazy_tlb(active_mm); + if (active_mm) + mmdrop_lazy_tlb(active_mm); return 0; } ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-17 21:05 ` Uros Bizjak @ 2023-10-17 21:53 ` Linus Torvalds 2023-10-17 22:06 ` Nadav Amit 2023-10-18 7:46 ` Uros Bizjak 0 siblings, 2 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-17 21:53 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Tue, 17 Oct 2023 at 14:06, Uros Bizjak <ubizjak@gmail.com> wrote: > > But adding the attached patch on top of both patches boots OK. Funky. Mind adding a WARN_ON_ONCE(!active_mm); to there to give a nice backtrace for the odd NULL case. That code *is* related to 'current', in how we do tsk = current; ... local_irq_disable(); active_mm = tsk->active_mm; tsk->active_mm = mm; tsk->mm = mm; ... activate_mm(active_mm, mm); ... mmdrop_lazy_tlb(active_mm); but I don't see how 'active_mm' could *poossibly* be validly NULL here, and why caching 'current' would matter and change it. Strange. Hmm. We do set tsk->active_mm = NULL; in copy_mm(), and then we have that odd kernel thread case: /* * Are we cloning a kernel thread? * * We need to steal a active VM for that.. */ oldmm = current->mm; if (!oldmm) return 0; but none of this should even matter, because by the time we actually *schedule* that thread, we'll set active_mm to the right thing. Can anybody see what's up? Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-17 21:53 ` Linus Torvalds @ 2023-10-17 22:06 ` Nadav Amit 2023-10-17 22:29 ` Nadav Amit 2023-10-18 7:46 ` Uros Bizjak 1 sibling, 1 reply; 106+ messages in thread From: Nadav Amit @ 2023-10-17 22:06 UTC (permalink / raw) To: Linus Torvalds Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers > On Oct 18, 2023, at 12:53 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > but none of this should even matter, because by the time we actually > *schedule* that thread, we'll set active_mm to the right thing. > > Can anybody see what's up? Could it be related to exec_mmap() -> exec_mm_release() -> mm_release() -> deactivate_mm() ? #define deactivate_mm(tsk, mm) \ do { \ if (!tsk->vfork_done) \ shstk_free(tsk); \ load_gs_index(0); \ loadsegment(fs, 0); \ } while (0) We change gs_index(), so perhaps it affects later GS reads. There is also this X86_BUG_NULL_SEG. Need to dive deeper; just initial thoughts though (i.e., I might be completely off). ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-17 22:06 ` Nadav Amit @ 2023-10-17 22:29 ` Nadav Amit 0 siblings, 0 replies; 106+ messages in thread From: Nadav Amit @ 2023-10-17 22:29 UTC (permalink / raw) To: Linus Torvalds Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers > On Oct 18, 2023, at 1:06 AM, Nadav Amit <namit@vmware.com> wrote: > > > >> On Oct 18, 2023, at 12:53 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >> >> but none of this should even matter, because by the time we actually >> *schedule* that thread, we'll set active_mm to the right thing. >> >> Can anybody see what's up? > > Could it be related to exec_mmap() -> exec_mm_release() -> mm_release() -> deactivate_mm() ? I am probably completely wrong. Sorry for the noise. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-17 21:53 ` Linus Torvalds 2023-10-17 22:06 ` Nadav Amit @ 2023-10-18 7:46 ` Uros Bizjak 2023-10-18 9:04 ` Uros Bizjak 1 sibling, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 7:46 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Tue, Oct 17, 2023 at 11:53 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 17 Oct 2023 at 14:06, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > But adding the attached patch on top of both patches boots OK. > > Funky. > > Mind adding a > > WARN_ON_ONCE(!active_mm); > > to there to give a nice backtrace for the odd NULL case. [ 4.907840] Call Trace: [ 4.908909] <TASK> [ 4.909858] ? __warn+0x7b/0x120 [ 4.911108] ? begin_new_exec+0x90f/0xa30 [ 4.912602] ? report_bug+0x164/0x190 [ 4.913929] ? handle_bug+0x3c/0x70 [ 4.915179] ? exc_invalid_op+0x17/0x70 [ 4.916569] ? asm_exc_invalid_op+0x1a/0x20 [ 4.917969] ? begin_new_exec+0x90f/0xa30 [ 4.919303] ? begin_new_exec+0x3ce/0xa30 [ 4.920667] ? load_elf_phdrs+0x67/0xb0 [ 4.921935] load_elf_binary+0x2bb/0x1770 [ 4.923262] ? __kernel_read+0x136/0x2d0 [ 4.924563] bprm_execve+0x277/0x630 [ 4.925703] kernel_execve+0x145/0x1a0 [ 4.926890] call_usermodehelper_exec_async+0xcb/0x180 [ 4.928408] ? __pfx_call_usermodehelper_exec_async+0x10/0x10 [ 4.930515] ret_from_fork+0x2f/0x50 [ 4.931894] ? __pfx_call_usermodehelper_exec_async+0x10/0x10 [ 4.933941] ret_from_fork_asm+0x1b/0x30 [ 4.935371] </TASK> [ 4.936212] ---[ end trace 0000000000000000 ]--- > > That code *is* related to 'current', in how we do > > tsk = current; > ... > local_irq_disable(); > active_mm = tsk->active_mm; > tsk->active_mm = mm; > tsk->mm = mm; > ... > activate_mm(active_mm, mm); > ... > mmdrop_lazy_tlb(active_mm); > > but I don't see how 'active_mm' could *poossibly* be validly NULL > here, and why caching 'current' would matter and change it. I have also added "__attribute__((optimize(0)))" to exec_mmap() to weed out compiler bugs. The result was the same oops in mmdrop_lazy_tlb. Also, when using WARN_ON instead of WARN_ON_ONCE, it triggers only once during the whole boot, with the above trace. Another observation: adding WARN_ON to the top of exec_mmap: WARN_ON(!current->active_mm); /* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; also triggers WARN, suggesting that current does not have active_mm set on the entry to the function. Uros. > Strange. > > Hmm. We do set > > tsk->active_mm = NULL; > > in copy_mm(), and then we have that odd kernel thread case: > > /* > * Are we cloning a kernel thread? > * > * We need to steal a active VM for that.. > */ > oldmm = current->mm; > if (!oldmm) > return 0; > > but none of this should even matter, because by the time we actually > *schedule* that thread, we'll set active_mm to the right thing. > > Can anybody see what's up? > > Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 7:46 ` Uros Bizjak @ 2023-10-18 9:04 ` Uros Bizjak 2023-10-18 10:54 ` Nadav Amit 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 9:04 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 9:46 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Tue, Oct 17, 2023 at 11:53 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Tue, 17 Oct 2023 at 14:06, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > But adding the attached patch on top of both patches boots OK. > > > > Funky. > > > > Mind adding a > > > > WARN_ON_ONCE(!active_mm); > > > > to there to give a nice backtrace for the odd NULL case. > > [ 4.907840] Call Trace: > [ 4.908909] <TASK> > [ 4.909858] ? __warn+0x7b/0x120 > [ 4.911108] ? begin_new_exec+0x90f/0xa30 > [ 4.912602] ? report_bug+0x164/0x190 > [ 4.913929] ? handle_bug+0x3c/0x70 > [ 4.915179] ? exc_invalid_op+0x17/0x70 > [ 4.916569] ? asm_exc_invalid_op+0x1a/0x20 > [ 4.917969] ? begin_new_exec+0x90f/0xa30 > [ 4.919303] ? begin_new_exec+0x3ce/0xa30 > [ 4.920667] ? load_elf_phdrs+0x67/0xb0 > [ 4.921935] load_elf_binary+0x2bb/0x1770 > [ 4.923262] ? __kernel_read+0x136/0x2d0 > [ 4.924563] bprm_execve+0x277/0x630 > [ 4.925703] kernel_execve+0x145/0x1a0 > [ 4.926890] call_usermodehelper_exec_async+0xcb/0x180 > [ 4.928408] ? __pfx_call_usermodehelper_exec_async+0x10/0x10 > [ 4.930515] ret_from_fork+0x2f/0x50 > [ 4.931894] ? __pfx_call_usermodehelper_exec_async+0x10/0x10 > [ 4.933941] ret_from_fork_asm+0x1b/0x30 > [ 4.935371] </TASK> > [ 4.936212] ---[ end trace 0000000000000000 ]--- > > > > > That code *is* related to 'current', in how we do > > > > tsk = current; > > ... > > local_irq_disable(); > > active_mm = tsk->active_mm; > > tsk->active_mm = mm; > > tsk->mm = mm; > > ... > > activate_mm(active_mm, mm); > > ... > > mmdrop_lazy_tlb(active_mm); > > > > but I don't see how 'active_mm' could *poossibly* be validly NULL > > here, and why caching 'current' would matter and change it. > > I have also added "__attribute__((optimize(0)))" to exec_mmap() to > weed out compiler bugs. The result was the same oops in > mmdrop_lazy_tlb. > > Also, when using WARN_ON instead of WARN_ON_ONCE, it triggers only > once during the whole boot, with the above trace. > > Another observation: adding WARN_ON to the top of exec_mmap: > > WARN_ON(!current->active_mm); > /* Notify parent that we're no longer interested in the old VM */ > tsk = current; > old_mm = current->mm; > > also triggers WARN, suggesting that current does not have active_mm > set on the entry to the function. Solved. All that is needed is to patch cpu_init() from arch/x86/kernel/cpu/common.c with: --cut here-- diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index b14fc8c1c953..61b6fcdf6937 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -2228,7 +2232,7 @@ void cpu_init_exception_handling(void) */ void cpu_init(void) { - struct task_struct *cur = current; + struct task_struct *cur = this_cpu_read_stable(pcpu_hot.current_task); int cpu = raw_smp_processor_id(); #ifdef CONFIG_NUMA --cut here-- This is effectively the old get_current(). Since we declare and export +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override, + const_pcpu_hot) __attribute__((alias("pcpu_hot"))); +EXPORT_PER_CPU_SYMBOL(const_pcpu_hot); in the same file, and the "new" current represents just return const_pcpu_hot.current_task; GCC assumes and over-optimizes something and seemingly doesn't fully initialize the cur->active_mm = &init_mm; below. Have to run now, but this will be easy to fix. Uros. ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 9:04 ` Uros Bizjak @ 2023-10-18 10:54 ` Nadav Amit 2023-10-18 12:14 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Nadav Amit @ 2023-10-18 10:54 UTC (permalink / raw) To: Uros Bizjak Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers > On Oct 18, 2023, at 12:04 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > > Solved. > > All that is needed is to patch cpu_init() from > arch/x86/kernel/cpu/common.c with: > > --cut here-- > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c > index b14fc8c1c953..61b6fcdf6937 100644 > --- a/arch/x86/kernel/cpu/common.c > +++ b/arch/x86/kernel/cpu/common.c > @@ -2228,7 +2232,7 @@ void cpu_init_exception_handling(void) > */ > void cpu_init(void) > { > - struct task_struct *cur = current; > + struct task_struct *cur = this_cpu_read_stable(pcpu_hot.current_task); > int cpu = raw_smp_processor_id(); Thanks for solving that, and sorry that I missed it. The reason I didn’t encounter it before is that in my original patch I created a new compilation unit which only defined the alias. Since there might be additional problems (any “current” use in common.c is dangerous, even in included files), I think that while there may be additional solutions, defining the alias in a separate compilation unit - as I did before - is the safest. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 10:54 ` Nadav Amit @ 2023-10-18 12:14 ` Uros Bizjak 2023-10-18 13:15 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 12:14 UTC (permalink / raw) To: Nadav Amit Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 12:54 PM Nadav Amit <namit@vmware.com> wrote: > > > > > On Oct 18, 2023, at 12:04 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > Solved. > > > > All that is needed is to patch cpu_init() from > > arch/x86/kernel/cpu/common.c with: > > > > --cut here-- > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c > > index b14fc8c1c953..61b6fcdf6937 100644 > > --- a/arch/x86/kernel/cpu/common.c > > +++ b/arch/x86/kernel/cpu/common.c > > @@ -2228,7 +2232,7 @@ void cpu_init_exception_handling(void) > > */ > > void cpu_init(void) > > { > > - struct task_struct *cur = current; > > + struct task_struct *cur = this_cpu_read_stable(pcpu_hot.current_task); > > int cpu = raw_smp_processor_id(); > > Thanks for solving that, and sorry that I missed it. > > The reason I didn’t encounter it before is that in my original patch I created > a new compilation unit which only defined the alias. > > Since there might be additional problems (any “current” use in common.c is > dangerous, even in included files), I think that while there may be additional > solutions, defining the alias in a separate compilation unit - as I did before - > is the safest. What happens here can be illustrated with the following testcase: --cut here-- int init_mm; struct task_struct { int *active_mm; }; struct task_struct init_task; struct pcpu_hot { struct task_struct *current_task; }; struct pcpu_hot pcpu_hot = { .current_task = &init_task }; extern const struct pcpu_hot __seg_gs const_pcpu_hot __attribute__((alias("pcpu_hot"))); void foo (void) { struct task_struct *cur = const_pcpu_hot.current_task; cur->active_mm = &init_mm; } --cut here-- gcc -O2 -S: foo: movq $init_mm, init_task(%rip) ret Here, gcc optimizes the access to generic address space, which is allowed to, since *we set the alias to pcpu_hot*, which is in the generic address space. The compiler doesn't care that we actually want: foo: movq %gs:const_pcpu_hot(%rip), %rax movq $init_mm, (%rax) So yes, to prevent the optimization, we have to hide the alias in another TU. BTW: Clang creates: foo: movq %gs:pcpu_hot(%rip), %rax movq $init_mm, (%rax) retq It is a bit more conservative and retains the address space of the aliasing symbol. Looks like another case of underspecified functionality where both compilers differ. Luckily, both DTRT when aliases are hidden in another TU. Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 12:14 ` Uros Bizjak @ 2023-10-18 13:15 ` Uros Bizjak 2023-10-18 14:46 ` Nadav Amit 2023-10-18 16:12 ` Linus Torvalds 0 siblings, 2 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 13:15 UTC (permalink / raw) To: Nadav Amit Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers [-- Attachment #1: Type: text/plain, Size: 2942 bytes --] On Wed, Oct 18, 2023 at 2:14 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Wed, Oct 18, 2023 at 12:54 PM Nadav Amit <namit@vmware.com> wrote: > > > > > > > > > On Oct 18, 2023, at 12:04 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > Solved. > > > > > > All that is needed is to patch cpu_init() from > > > arch/x86/kernel/cpu/common.c with: > > > > > > --cut here-- > > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c > > > index b14fc8c1c953..61b6fcdf6937 100644 > > > --- a/arch/x86/kernel/cpu/common.c > > > +++ b/arch/x86/kernel/cpu/common.c > > > @@ -2228,7 +2232,7 @@ void cpu_init_exception_handling(void) > > > */ > > > void cpu_init(void) > > > { > > > - struct task_struct *cur = current; > > > + struct task_struct *cur = this_cpu_read_stable(pcpu_hot.current_task); > > > int cpu = raw_smp_processor_id(); > > > > Thanks for solving that, and sorry that I missed it. > > > > The reason I didn’t encounter it before is that in my original patch I created > > a new compilation unit which only defined the alias. > > > > Since there might be additional problems (any “current” use in common.c is > > dangerous, even in included files), I think that while there may be additional > > solutions, defining the alias in a separate compilation unit - as I did before - > > is the safest. > > What happens here can be illustrated with the following testcase: > > --cut here-- > int init_mm; > > struct task_struct > { > int *active_mm; > }; > > struct task_struct init_task; > > struct pcpu_hot > { > struct task_struct *current_task; > }; > > struct pcpu_hot pcpu_hot = { .current_task = &init_task }; > > extern const struct pcpu_hot __seg_gs const_pcpu_hot > __attribute__((alias("pcpu_hot"))); > > void foo (void) > { > struct task_struct *cur = const_pcpu_hot.current_task; > > cur->active_mm = &init_mm; > } > --cut here-- > > gcc -O2 -S: > > foo: > movq $init_mm, init_task(%rip) > ret > > Here, gcc optimizes the access to generic address space, which is > allowed to, since *we set the alias to pcpu_hot*, which is in the > generic address space. The compiler doesn't care that we actually > want: > > foo: > movq %gs:const_pcpu_hot(%rip), %rax > movq $init_mm, (%rax) > > So yes, to prevent the optimization, we have to hide the alias in another TU. > > BTW: Clang creates: > > foo: > movq %gs:pcpu_hot(%rip), %rax > movq $init_mm, (%rax) > retq > > It is a bit more conservative and retains the address space of the > aliasing symbol. > > Looks like another case of underspecified functionality where both > compilers differ. Luckily, both DTRT when aliases are hidden in > another TU. Attached is the prototype patch that works for me (together with Linus' FPU switching patch). Uros. [-- Attachment #2: current.diff.txt --] [-- Type: text/plain, Size: 2968 bytes --] diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h index a1168e7b69e5..21e8bd4ea44e 100644 --- a/arch/x86/include/asm/current.h +++ b/arch/x86/include/asm/current.h @@ -36,10 +36,23 @@ static_assert(sizeof(struct pcpu_hot) == 64); DECLARE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot); +/* + * + */ +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override, + const_pcpu_hot); + +#ifdef CONFIG_USE_X86_SEG_SUPPORT +static __always_inline struct task_struct *get_current(void) +{ + return const_pcpu_hot.current_task; +} +#else static __always_inline struct task_struct *get_current(void) { return this_cpu_read_stable(pcpu_hot.current_task); } +#endif #define current get_current() diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile index 93eabf544031..a1d17578f5c1 100644 --- a/arch/x86/kernel/cpu/Makefile +++ b/arch/x86/kernel/cpu/Makefile @@ -19,6 +19,7 @@ KCSAN_SANITIZE_common.o := n obj-y := cacheinfo.o scattered.o topology.o obj-y += common.o +obj-y += percpu-hot.o obj-y += rdrand.o obj-y += match.o obj-y += bugs.o diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index b14fc8c1c953..fb9e106467ba 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -2043,13 +2043,6 @@ static __init int setup_clearcpuid(char *arg) } __setup("clearcpuid=", setup_clearcpuid); -DEFINE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot) = { - .current_task = &init_task, - .preempt_count = INIT_PREEMPT_COUNT, - .top_of_stack = TOP_OF_INIT_STACK, -}; -EXPORT_PER_CPU_SYMBOL(pcpu_hot); - #ifdef CONFIG_X86_64 DEFINE_PER_CPU_FIRST(struct fixed_percpu_data, fixed_percpu_data) __aligned(PAGE_SIZE) __visible; diff --git a/arch/x86/kernel/cpu/percpu-hot.c b/arch/x86/kernel/cpu/percpu-hot.c new file mode 100644 index 000000000000..eca558a3845a --- /dev/null +++ b/arch/x86/kernel/cpu/percpu-hot.c @@ -0,0 +1,15 @@ +// SPDX-License-Identifier: GPL-2.0+ + +#include <linux/sched/task.h> +#include <asm/current.h> + +DEFINE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot) = { + .current_task = &init_task, + .preempt_count = INIT_PREEMPT_COUNT, + .top_of_stack = TOP_OF_INIT_STACK, +}; +EXPORT_PER_CPU_SYMBOL(pcpu_hot); + +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override, + const_pcpu_hot) __attribute__((alias("pcpu_hot"))); +EXPORT_PER_CPU_SYMBOL(const_pcpu_hot); diff --git a/include/linux/compiler.h b/include/linux/compiler.h index d7779a18b24f..bf9815eaf4aa 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -212,7 +212,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val, */ #define ___ADDRESSABLE(sym, __attrs) \ static void * __used __attrs \ - __UNIQUE_ID(__PASTE(__addressable_,sym)) = (void *)&sym; + __UNIQUE_ID(__PASTE(__addressable_,sym)) = (void *)(uintptr_t)&sym; #define __ADDRESSABLE(sym) \ ___ADDRESSABLE(sym, __section(".discard.addressable")) ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 13:15 ` Uros Bizjak @ 2023-10-18 14:46 ` Nadav Amit 2023-10-18 15:17 ` Uros Bizjak 2023-10-18 16:12 ` Linus Torvalds 1 sibling, 1 reply; 106+ messages in thread From: Nadav Amit @ 2023-10-18 14:46 UTC (permalink / raw) To: Uros Bizjak Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers > On Oct 18, 2023, at 4:15 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > >> >> Looks like another case of underspecified functionality where both >> compilers differ. Luckily, both DTRT when aliases are hidden in >> another TU. > > Attached is the prototype patch that works for me (together with > Linus' FPU switching patch). In general looks good. See some minor issues below. > --- a/arch/x86/include/asm/current.h > +++ b/arch/x86/include/asm/current.h > @@ -36,10 +36,23 @@ static_assert(sizeof(struct pcpu_hot) == 64); > > DECLARE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot); > > +/* > + * > + */ Obviously some further comments to clarify why struct pcpu_hot is defined in percpu-hot.c (the GCC manual says: "It is an error if the alias target is not defined in the same translation unit as the alias” which can be used as part of the explanation.) > +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override, > + const_pcpu_hot); > + > +#ifdef CONFIG_USE_X86_SEG_SUPPORT > +static __always_inline struct task_struct *get_current(void) > +{ > + return const_pcpu_hot.current_task; > +} > +#else > static __always_inline struct task_struct *get_current(void) > { > return this_cpu_read_stable(pcpu_hot.current_task); > } > +#endif Please consider using IS_ENABLED() to avoid the ifdef’ry. So this would turn to be: static __always_inline struct task_struct *get_current(void) { if (IS_ENABLED(CONFIG_USE_X86_SEG_SUPPORT)) return const_pcpu_hot.current_task; return this_cpu_read_stable(pcpu_hot.current_task); } ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 14:46 ` Nadav Amit @ 2023-10-18 15:17 ` Uros Bizjak 2023-10-18 16:03 ` Nadav Amit 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 15:17 UTC (permalink / raw) To: Nadav Amit Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers [-- Attachment #1: Type: text/plain, Size: 1954 bytes --] On Wed, Oct 18, 2023 at 4:46 PM Nadav Amit <namit@vmware.com> wrote: > >> Looks like another case of underspecified functionality where both > >> compilers differ. Luckily, both DTRT when aliases are hidden in > >> another TU. > > > > Attached is the prototype patch that works for me (together with > > Linus' FPU switching patch). > > In general looks good. See some minor issues below. > > > --- a/arch/x86/include/asm/current.h > > +++ b/arch/x86/include/asm/current.h > > @@ -36,10 +36,23 @@ static_assert(sizeof(struct pcpu_hot) == 64); > > > > DECLARE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot); > > > > +/* > > + * > > + */ > > Obviously some further comments to clarify why struct pcpu_hot is > defined in percpu-hot.c (the GCC manual says: "It is an error if > the alias target is not defined in the same translation unit as the > alias” which can be used as part of the explanation.) Sure. > > > +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override, > > + const_pcpu_hot); > > + > > +#ifdef CONFIG_USE_X86_SEG_SUPPORT > > +static __always_inline struct task_struct *get_current(void) > > +{ > > + return const_pcpu_hot.current_task; > > +} > > +#else > > static __always_inline struct task_struct *get_current(void) > > { > > return this_cpu_read_stable(pcpu_hot.current_task); > > } > > +#endif > > > Please consider using IS_ENABLED() to avoid the ifdef’ry. > > So this would turn to be: > > static __always_inline struct task_struct *get_current(void) > { > if (IS_ENABLED(CONFIG_USE_X86_SEG_SUPPORT)) > return const_pcpu_hot.current_task; > > return this_cpu_read_stable(pcpu_hot.current_task); > } I am more thinking of moving the ifdeffery to percpu.h, something like the attached part of the patch. This would handle all current and future stable percpu variables. Thanks, Uros. [-- Attachment #2: p.diff.txt --] [-- Type: text/plain, Size: 1157 bytes --] diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 54746903b8c3..607f6b8e16ca 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -417,7 +417,6 @@ do { \ #define this_cpu_read_stable_2(pcp) percpu_stable_op(2, "mov", pcp) #define this_cpu_read_stable_4(pcp) percpu_stable_op(4, "mov", pcp) #define this_cpu_read_stable_8(pcp) percpu_stable_op(8, "mov", pcp) -#define this_cpu_read_stable(pcp) __pcpu_size_call_return(this_cpu_read_stable_, pcp) #ifdef CONFIG_USE_X86_SEG_SUPPORT @@ -453,6 +452,8 @@ do { \ #define this_cpu_write_8(pcp, val) __raw_cpu_write(volatile, pcp, val) #endif +#define this_cpu_read_stable(pcp) const_##pcp + #else /* CONFIG_USE_X86_SEG_SUPPORT */ #define raw_cpu_read_1(pcp) percpu_from_op(1, , "mov", pcp) @@ -477,6 +478,9 @@ do { \ #define this_cpu_write_8(pcp, val) percpu_to_op(8, volatile, "mov", (pcp), val) #endif +#define this_cpu_read_stable(pcp) \ + __pcpu_size_call_return(this_cpu_read_stable_, pcp) + #endif /* CONFIG_USE_X86_SEG_SUPPORT */ #define raw_cpu_add_1(pcp, val) percpu_add_op(1, , (pcp), val) ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 15:17 ` Uros Bizjak @ 2023-10-18 16:03 ` Nadav Amit 2023-10-18 16:26 ` Linus Torvalds 0 siblings, 1 reply; 106+ messages in thread From: Nadav Amit @ 2023-10-18 16:03 UTC (permalink / raw) To: Uros Bizjak Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers > On Oct 18, 2023, at 6:17 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > > I am more thinking of moving the ifdeffery to percpu.h, something like > the attached part of the patch. This would handle all current and > future stable percpu variables. I think that for consistency this_cpu_read_stable() should always be an rvalue, so instead of: > #define this_cpu_read_stable(pcp) const_##pcp You would use a statement expression: #define this_cpu_read_stable(pcp) ({ const_##pcp; }) This would match the other (existing/fallback) definition of this_cpu_read_stable. Having said that, I am not sure what other usages you have in mind. “current” is a pretty obvious straight forward case with considerable impact on code generation. There may be additional variables, but it is likely that there would be more functions/TU in which they would not be constant and would require more refined techniques to avoid mistakes such as the use of stale cached values. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 16:03 ` Nadav Amit @ 2023-10-18 16:26 ` Linus Torvalds 2023-10-18 17:23 ` Uros Bizjak 2023-10-18 18:08 ` Uros Bizjak 0 siblings, 2 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-18 16:26 UTC (permalink / raw) To: Nadav Amit Cc: Uros Bizjak, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, 18 Oct 2023 at 09:03, Nadav Amit <namit@vmware.com> wrote: > > Having said that, I am not sure what other usages you have in mind. > “current” is a pretty obvious straight forward case with considerable > impact on code generation. There may be additional variables, but it is > likely that there would be more functions/TU in which they would not be > constant and would require more refined techniques to avoid mistakes > such as the use of stale cached values. Yeah, I don't think there really are other cases. We do have things that could be considered stable (like "smp_processor_id()" which is stable as long as preemption or migration is disabled (or it's in an irq-off section). And it might be lovely to optimize those too, *BUT* that would require that there be a barrier against that optimization that works. And if there is anything that this thread has made clear, it's that the whole 'load from a constant section' doesn't seem to have any sane barriers. So while the CSE for inline asm statements is a bit too weak with that whole "only CSE within a basic block" thing, the CSE of "load a constant value from memory" is too *strong*, in that we don't seem to have _any_ sane way to say "now you need to reload". The traditional way we've done that is with our "barrier()" macro, which does the whole inline asm with a memory clobber, but even that doesn't act as a barrier for gcc optimizing the constant load. Which means that while we'd probably love for the compiere to optimize smp_processor_id() a bit more, we can't use the 'stable memory location' trick for it. Because I can't think of anything but 'current' that would be _that_ stable as far as C code is concerned. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 16:26 ` Linus Torvalds @ 2023-10-18 17:23 ` Uros Bizjak 2023-10-18 18:11 ` Linus Torvalds 2023-10-18 18:08 ` Uros Bizjak 1 sibling, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 17:23 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 6:26 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 18 Oct 2023 at 09:03, Nadav Amit <namit@vmware.com> wrote: > > > > Having said that, I am not sure what other usages you have in mind. > > “current” is a pretty obvious straight forward case with considerable > > impact on code generation. There may be additional variables, but it is > > likely that there would be more functions/TU in which they would not be > > constant and would require more refined techniques to avoid mistakes > > such as the use of stale cached values. > > Yeah, I don't think there really are other cases. In processor.h, we have: static __always_inline unsigned long current_top_of_stack(void) { /* * We can't read directly from tss.sp0: sp0 on x86_32 is special in * and around vm86 mode and sp0 on x86_64 is special because of the * entry trampoline. */ return this_cpu_read_stable(pcpu_hot.top_of_stack); } But I don't know how much it is used. Uros. > We do have things that could be considered stable (like > "smp_processor_id()" which is stable as long as preemption or > migration is disabled (or it's in an irq-off section). > > And it might be lovely to optimize those too, *BUT* that would require > that there be a barrier against that optimization that works. > > And if there is anything that this thread has made clear, it's that > the whole 'load from a constant section' doesn't seem to have any sane > barriers. > > So while the CSE for inline asm statements is a bit too weak with that > whole "only CSE within a basic block" thing, the CSE of "load a > constant value from memory" is too *strong*, in that we don't seem to > have _any_ sane way to say "now you need to reload". > > The traditional way we've done that is with our "barrier()" macro, > which does the whole inline asm with a memory clobber, but even that > doesn't act as a barrier for gcc optimizing the constant load. > > Which means that while we'd probably love for the compiere to optimize > smp_processor_id() a bit more, we can't use the 'stable memory > location' trick for it. > > Because I can't think of anything but 'current' that would be _that_ > stable as far as C code is concerned. > > Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 17:23 ` Uros Bizjak @ 2023-10-18 18:11 ` Linus Torvalds 0 siblings, 0 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-18 18:11 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, 18 Oct 2023 at 10:23, Uros Bizjak <ubizjak@gmail.com> wrote: > > In processor.h, we have: > > static __always_inline unsigned long current_top_of_stack(void) Yeah, but that is never used multiple times afaik. I think it's purely for things like WARN_ON_ONCE(!on_thread_stack()); in the entry code, for example. So I guess it can use the same infrastructure, but I doubt it matters in any practical way. Grepping around for it, it looks like the 32-bit code has some stale commentary: * Reload esp0 and pcpu_hot.top_of_stack. This changes * current_thread_info(). but that seems entirely bogus. We historically picked up current_thread_info() from %esp, but that hasn't been true in ages, afaik. Now it's all based on 'current'. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 16:26 ` Linus Torvalds 2023-10-18 17:23 ` Uros Bizjak @ 2023-10-18 18:08 ` Uros Bizjak 2023-10-18 18:15 ` Linus Torvalds 2023-10-18 18:29 ` Nadav Amit 1 sibling, 2 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 18:08 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 6:26 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 18 Oct 2023 at 09:03, Nadav Amit <namit@vmware.com> wrote: > > > > Having said that, I am not sure what other usages you have in mind. > > “current” is a pretty obvious straight forward case with considerable > > impact on code generation. There may be additional variables, but it is > > likely that there would be more functions/TU in which they would not be > > constant and would require more refined techniques to avoid mistakes > > such as the use of stale cached values. > > Yeah, I don't think there really are other cases. > > We do have things that could be considered stable (like > "smp_processor_id()" which is stable as long as preemption or > migration is disabled (or it's in an irq-off section). > > And it might be lovely to optimize those too, *BUT* that would require > that there be a barrier against that optimization that works. But loads from non-const memory work like the above. Please consider: --cut here-- extern __seg_gs int m; int foo (void) { int r; r = m; r += m; asm volatile ("" ::: "memory"); r += m; return r; } int bar (void) { int r; r = m; r += m; r += m; return r; } --cut here-- gcc -O2: foo: movl %gs:m(%rip), %eax addl %eax, %eax addl %gs:m(%rip), %eax ret bar: movl %gs:m(%rip), %eax leal (%rax,%rax,2), %eax ret Please note the __barrier(), implemented with asm volatile. > And if there is anything that this thread has made clear, it's that > the whole 'load from a constant section' doesn't seem to have any sane > barriers. > > So while the CSE for inline asm statements is a bit too weak with that > whole "only CSE within a basic block" thing, the CSE of "load a > constant value from memory" is too *strong*, in that we don't seem to > have _any_ sane way to say "now you need to reload". We can use alias to __seg_gs non-const memory, so the value can be accessed without asm. __barrier() will then force reload. Please note that any memory clobber, hidden inside asm will also force reload. > The traditional way we've done that is with our "barrier()" macro, > which does the whole inline asm with a memory clobber, but even that > doesn't act as a barrier for gcc optimizing the constant load. > > Which means that while we'd probably love for the compiere to optimize > smp_processor_id() a bit more, we can't use the 'stable memory > location' trick for it. We should get rid of asm statements, and as shown above, __barrier() will do the trick. > Because I can't think of anything but 'current' that would be _that_ > stable as far as C code is concerned. Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 18:08 ` Uros Bizjak @ 2023-10-18 18:15 ` Linus Torvalds 2023-10-18 18:26 ` Uros Bizjak 2023-10-18 18:29 ` Nadav Amit 1 sibling, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-18 18:15 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, 18 Oct 2023 at 11:08, Uros Bizjak <ubizjak@gmail.com> wrote: > > But loads from non-const memory work like the above. Yes, I'm certainly ok with the move to use plain loads from __seg_gs for the percpu accesses. If they didn't honor the memory clobber, we could never use it at all. I was just saying that the 'const' alias trick isn't useful for anything else than 'current', because everything else needs to at least honor our existing barriers. (And yes, there's the other user of this_cpu_read_stable() - 'top_of_stack', but as mentioned that doesn't really matter). Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 18:15 ` Linus Torvalds @ 2023-10-18 18:26 ` Uros Bizjak 2023-10-18 19:33 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 18:26 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 8:16 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 18 Oct 2023 at 11:08, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > But loads from non-const memory work like the above. > > Yes, I'm certainly ok with the move to use plain loads from __seg_gs > for the percpu accesses. If they didn't honor the memory clobber, we > could never use it at all. > > I was just saying that the 'const' alias trick isn't useful for > anything else than 'current', because everything else needs to at > least honor our existing barriers. FYI, smp_processor_id() is implemented as: #define __smp_processor_id() __this_cpu_read(pcpu_hot.cpu_number) where __this_* forces volatile access which disables CSE. *If* the variable is really stable, then it should use __raw_cpu_read. Both, __raw_* and __this_* were recently (tip/percpu branch) implemented for SEG_SUPPORT as: #define __raw_cpu_read(qual, pcp) \ ({ \ *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \ }) where "qual" can be volatile. To enable smp_processor_id() optimization, it just needs to be moved from __this to __raw accessor. > (And yes, there's the other user of this_cpu_read_stable() - > 'top_of_stack', but as mentioned that doesn't really matter). Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 18:26 ` Uros Bizjak @ 2023-10-18 19:33 ` Uros Bizjak 2023-10-18 20:17 ` Nadav Amit 2023-10-18 20:22 ` Linus Torvalds 0 siblings, 2 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 19:33 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 8:26 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Wed, Oct 18, 2023 at 8:16 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Wed, 18 Oct 2023 at 11:08, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > But loads from non-const memory work like the above. > > > > Yes, I'm certainly ok with the move to use plain loads from __seg_gs > > for the percpu accesses. If they didn't honor the memory clobber, we > > could never use it at all. > > > > I was just saying that the 'const' alias trick isn't useful for > > anything else than 'current', because everything else needs to at > > least honor our existing barriers. > > FYI, smp_processor_id() is implemented as: > > #define __smp_processor_id() __this_cpu_read(pcpu_hot.cpu_number) > > where __this_* forces volatile access which disables CSE. > > *If* the variable is really stable, then it should use __raw_cpu_read. > Both, __raw_* and __this_* were recently (tip/percpu branch) > implemented for SEG_SUPPORT as: This pach works for me: --cut here-- diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index 4fab2ed454f3..6eda4748bf64 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -141,8 +141,7 @@ __visible void smp_call_function_single_interrupt(struct pt_regs *r); * This function is needed by all SMP systems. It must _always_ be valid * from the initial startup. */ -#define raw_smp_processor_id() this_cpu_read(pcpu_hot.cpu_number) -#define __smp_processor_id() __this_cpu_read(pcpu_hot.cpu_number) +#define raw_smp_processor_id() raw_cpu_read(pcpu_hot.cpu_number) #ifdef CONFIG_X86_32 extern int safe_smp_processor_id(void); --cut here-- But removes merely 10 reads from 3219. BTW: I also don't understand the comment from include/linux/smp.h: /* * Allow the architecture to differentiate between a stable and unstable read. * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a * regular asm read for the stable. */ #ifndef __smp_processor_id #define __smp_processor_id(x) raw_smp_processor_id(x) #endif All reads up to word size on x86 are atomic, so IRQ safe. asm-volatile is not some IRQ property, but prevents the compiler from CSE the asm and scheduling (moving) asm around too much. Uros. ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 19:33 ` Uros Bizjak @ 2023-10-18 20:17 ` Nadav Amit 2023-10-18 20:22 ` Linus Torvalds 1 sibling, 0 replies; 106+ messages in thread From: Nadav Amit @ 2023-10-18 20:17 UTC (permalink / raw) To: Uros Bizjak Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers > On Oct 18, 2023, at 10:33 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > > This pach works for me: > > --cut here-- > diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h > index 4fab2ed454f3..6eda4748bf64 100644 > --- a/arch/x86/include/asm/smp.h > +++ b/arch/x86/include/asm/smp.h > @@ -141,8 +141,7 @@ __visible void > smp_call_function_single_interrupt(struct pt_regs *r); > * This function is needed by all SMP systems. It must _always_ be valid > * from the initial startup. > */ > -#define raw_smp_processor_id() this_cpu_read(pcpu_hot.cpu_number) > -#define __smp_processor_id() __this_cpu_read(pcpu_hot.cpu_number) > +#define raw_smp_processor_id() raw_cpu_read(pcpu_hot.cpu_number) I don’t think that’s correct. IIUC, although changing __smp_processor_id() to read pcpu_hot.cpu_number through raw_cpu_read, raw_smp_processor_id() should not be changed in this manner. raw_smp_processor_id() does not assume that preemption or at least migration of the task to another core is possible. So “volatile” keyword and inline assembly are used to ensure the ordering and that the read is read without some compiler optimization that reads the value multiple times or tears the reads. In contrast raw_cpu_read() does not use the volatile keyword, so it does not provide the same guarantees. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 19:33 ` Uros Bizjak 2023-10-18 20:17 ` Nadav Amit @ 2023-10-18 20:22 ` Linus Torvalds 2023-10-18 20:34 ` Linus Torvalds ` (2 more replies) 1 sibling, 3 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-18 20:22 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, 18 Oct 2023 at 12:33, Uros Bizjak <ubizjak@gmail.com> wrote: > > This pach works for me: Looks fine. But you actually bring up another issue: > BTW: I also don't understand the comment from include/linux/smp.h: > > /* > * Allow the architecture to differentiate between a stable and unstable read. > * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a > * regular asm read for the stable. I think the comment is badly worded, but I think the issue may actually be real. One word: rematerialization. The thing is, turning inline asm accesses to regular compiler loads has a *very* bad semantic problem: the compiler may now feel like it can not only combine the loads (ok), but also possibly rematerialize values by re-doing the loads (NOT OK!). IOW, the kernel often has very strict requirements of "at most once" behavior, because doing two loads might give different results. The cpu number is a good example of this. And yes, sometimes we use actual volatile accesses for them (READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general, and are much too strict. Not only does gcc generally lose its mind when it sees volatile (ie it stops doing various sane combinations that would actually be perfectly valid), but it obviously also stops doing CSE on the loads (as it has to). So the "non-volatile asm" has been a great way to get the "at most one" behavior: it's safe wrt interrupts changing the value, because you will see *one* value, not two. As far as we know, gcc never rematerializes the output of an inline asm. So when you use an inline asm, you may have the result CSE'd, but you'll never see it generate more than *one* copy of the inline asm. (Of course, as with so much about inline asm, that "knowledge" is not necessarily explicitly spelled out anywhere, and it's just "that's how it has always worked"). IOW, look at code like the one in swiotlb_pool_find_slots(), which does this: int start = raw_smp_processor_id() & (pool->nareas - 1); and the use of 'start' really is meant to be just a good heuristic, in that different concurrent CPU's will start looking in different pools. So that code is basically "cpu-local by default", but it's purely about locality, it's not some kind of correctness issue, and it's not necessarily run when the code is *tied* to a particular CPU. But what *is* important is that 'start' have *one* value, and one value only. So look at that loop, which hasically does do { .. use the 'i' based on 'start' .. if (++i >= pool->nareas) i = 0; } while (i != start); and it is very important indeed that the compiler does *not* think "Oh, I can rematerialize the 'start' value". See what I'm saying? Using 'volatile' for loading the current CPU value would be bad for performance for no good reason. But loading it multiple times would be a *bug*. Using inline asm is basically perfect here: the compiler can *combine* two inline asms into one, but once we have a value for 'start', it won't change, because the compiler is not going to decide "I can drop this value, and just re-do the inline asm to rematerialize it". This all makes me worried about the __seg_fs thing. For 'current', this is all perfect. Rematerializing current is actually better than spilling and reloading the value. But for something like raw_smp_processor_id(), rematerializing would be a correctness problem, and a really horrible one (because in practice, the code would work 99.9999% of the time, and then once in a blue moon, it would rematerialize a different value). See the problem? I guess we could use the stdatomics to try to explain these issues to the compiler, but I don't even know what the C interfaces look like or whether they are stable and usable across the range of compilers we use. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 20:22 ` Linus Torvalds @ 2023-10-18 20:34 ` Linus Torvalds 2023-10-18 20:51 ` Uros Bizjak 2023-10-18 20:42 ` Uros Bizjak 2023-10-19 16:32 ` Uros Bizjak 2 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-18 20:34 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, 18 Oct 2023 at 13:22, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And yes, sometimes we use actual volatile accesses for them > (READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general, > and are much too strict. Not only does gcc generally lose its mind > when it sees volatile (ie it stops doing various sane combinations > that would actually be perfectly valid), but it obviously also stops > doing CSE on the loads (as it has to). Note, in case you wonder what I mean by "lose its mind", try this (extremely stupid) test program: void a(volatile int *i) { ++*i; } void b(int *i) { ++*i; } and note that the non-volatile version does addl $1, (%rdi) but the volatile version then refuses to combine the read+write into a rmw instruction, and generates movl (%rdi), %eax addl $1, %eax movl %eax, (%rdi) instead. Sure, it's correct, but it's an example of how 'volatile' ends up disabling a lot of other optimizations than just the "don't remove the access". Doing the volatile as one rmw instruction would still have been very obviously valid - it's still doing a read and a write. You don't need two instructions for that. I'm not complaining, and I understand *why* it happens - compiler writers very understandably go "oh, I'm not touching that". I'm just trying to point out that volatile really screws up code generation even aside from the "access _exactly_ once" issue. So using inline asm and relying on gcc doing (minimal) CSE will then generate better code than volatile ever could, even when we just use a simple 'mov" instruction. At least you get that basic combining effect, even if it's not great. And for memory ops, *not* using volatile is dangerous when they aren't stable. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 20:34 ` Linus Torvalds @ 2023-10-18 20:51 ` Uros Bizjak 2023-10-18 21:09 ` Uros Bizjak 2023-10-18 21:10 ` Linus Torvalds 0 siblings, 2 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 20:51 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 10:34 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 18 Oct 2023 at 13:22, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > And yes, sometimes we use actual volatile accesses for them > > (READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general, > > and are much too strict. Not only does gcc generally lose its mind > > when it sees volatile (ie it stops doing various sane combinations > > that would actually be perfectly valid), but it obviously also stops > > doing CSE on the loads (as it has to). > > Note, in case you wonder what I mean by "lose its mind", try this > (extremely stupid) test program: > > void a(volatile int *i) { ++*i; } > void b(int *i) { ++*i; } > > and note that the non-volatile version does > > addl $1, (%rdi) > > but the volatile version then refuses to combine the read+write into a > rmw instruction, and generates > > movl (%rdi), %eax > addl $1, %eax > movl %eax, (%rdi) > > instead. > > Sure, it's correct, but it's an example of how 'volatile' ends up > disabling a lot of other optimizations than just the "don't remove the > access". > > Doing the volatile as one rmw instruction would still have been very > obviously valid - it's still doing a read and a write. You don't need > two instructions for that. FYI: This is the reason RMW instructions in percpu.h are not (blindly) converted to C ops. They will remain in their (volatile or not) asm form because of the above reason, and due to the fact that they don't combine with anything. > I'm not complaining, and I understand *why* it happens - compiler > writers very understandably go "oh, I'm not touching that". > > I'm just trying to point out that volatile really screws up code > generation even aside from the "access _exactly_ once" issue. > > So using inline asm and relying on gcc doing (minimal) CSE will then > generate better code than volatile ever could, even when we just use a > simple 'mov" instruction. At least you get that basic combining > effect, even if it's not great. Actually, RMW insns are better written in asm, while simple "mov" should be converted to (volatile or not) memory access. On x86 "mov"s from memory (reads) will combine nicely with almost all other instructions. > And for memory ops, *not* using volatile is dangerous when they aren't stable. True. Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 20:51 ` Uros Bizjak @ 2023-10-18 21:09 ` Uros Bizjak 2023-10-18 21:10 ` Linus Torvalds 1 sibling, 0 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 21:09 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 10:51 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Wed, Oct 18, 2023 at 10:34 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Wed, 18 Oct 2023 at 13:22, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > And yes, sometimes we use actual volatile accesses for them > > > (READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general, > > > and are much too strict. Not only does gcc generally lose its mind > > > when it sees volatile (ie it stops doing various sane combinations > > > that would actually be perfectly valid), but it obviously also stops > > > doing CSE on the loads (as it has to). > > > > Note, in case you wonder what I mean by "lose its mind", try this > > (extremely stupid) test program: > > > > void a(volatile int *i) { ++*i; } > > void b(int *i) { ++*i; } > > > > and note that the non-volatile version does > > > > addl $1, (%rdi) > > > > but the volatile version then refuses to combine the read+write into a > > rmw instruction, and generates > > > > movl (%rdi), %eax > > addl $1, %eax > > movl %eax, (%rdi) > > > > instead. > > > > Sure, it's correct, but it's an example of how 'volatile' ends up > > disabling a lot of other optimizations than just the "don't remove the > > access". > > > > Doing the volatile as one rmw instruction would still have been very > > obviously valid - it's still doing a read and a write. You don't need > > two instructions for that. > > FYI: This is the reason RMW instructions in percpu.h are not (blindly) > converted to C ops. They will remain in their (volatile or not) asm > form because of the above reason, and due to the fact that they don't > combine with anything. > > > I'm not complaining, and I understand *why* it happens - compiler > > writers very understandably go "oh, I'm not touching that". > > > > I'm just trying to point out that volatile really screws up code > > generation even aside from the "access _exactly_ once" issue. > > > > So using inline asm and relying on gcc doing (minimal) CSE will then > > generate better code than volatile ever could, even when we just use a > > simple 'mov" instruction. At least you get that basic combining > > effect, even if it's not great. > > Actually, RMW insns are better written in asm, while simple "mov" > should be converted to (volatile or not) memory access. On x86 "mov"s > from memory (reads) will combine nicely with almost all other > instructions. BTW: There was a discussion that GCC should construct RMW instructions also when the memory location is marked volatile, but there was no resolution reached. So, the "I'm not touching that" approach remains. However, GCC *will* combine a volatile read with a follow-up instruction. Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 20:51 ` Uros Bizjak 2023-10-18 21:09 ` Uros Bizjak @ 2023-10-18 21:10 ` Linus Torvalds 2023-10-18 21:40 ` Uros Bizjak 1 sibling, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-18 21:10 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, 18 Oct 2023 at 13:52, Uros Bizjak <ubizjak@gmail.com> wrote: > > FYI: This is the reason RMW instructions in percpu.h are not (blindly) > converted to C ops. They will remain in their (volatile or not) asm > form because of the above reason, and due to the fact that they don't > combine with anything. Well, also because converting them to C would be HORRIBYL BUGGY. They absolutely have to be a single instruction. We have architectures that can't do rmw instructions, and then they have to do lots of extra horrid crud (disable interrupts or whatever) to make a percpu 'add' be a valid thing. > Actually, RMW insns are better written in asm, while simple "mov" > should be converted to (volatile or not) memory access. No. This remat issue has convinced me that the *only* thing that should be converted to a memory access is the "stable" case (which in practice is mainly just 'current'). Because if you make them 'volatile' memory instructions, then the simple "mov" inline asm is simply better. It still allows CSE on the asm (in the "raw" form). And if you make them memory instructions _without_ the 'volatile', the memory access is simply buggy until we have some 'nomaterialize' model. So the *only* situation where a memory access is better is that 'stable' case. In all other cases they are the same or strictly worse than 'asm'. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 21:10 ` Linus Torvalds @ 2023-10-18 21:40 ` Uros Bizjak 2023-10-18 22:40 ` Linus Torvalds 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 21:40 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 11:11 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 18 Oct 2023 at 13:52, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > FYI: This is the reason RMW instructions in percpu.h are not (blindly) > > converted to C ops. They will remain in their (volatile or not) asm > > form because of the above reason, and due to the fact that they don't > > combine with anything. > > Well, also because converting them to C would be HORRIBYL BUGGY. > > They absolutely have to be a single instruction. We have architectures > that can't do rmw instructions, and then they have to do lots of extra > horrid crud (disable interrupts or whatever) to make a percpu 'add' be > a valid thing. > > > > Actually, RMW insns are better written in asm, while simple "mov" > > should be converted to (volatile or not) memory access. > > No. > > This remat issue has convinced me that the *only* thing that should be > converted to a memory access is the "stable" case (which in practice > is mainly just 'current'). > > Because if you make them 'volatile' memory instructions, then the > simple "mov" inline asm is simply better. It still allows CSE on the > asm (in the "raw" form). The ones in "raw" form are not IRQ safe and these are implemented without volatile qualifier. The safe variant are ones with "this" form. These were implemented as volatile-asm and are now implemented as volatile reads. They do not rematerialize, the number of memory accesses stays the same. They do not CSE (volatile-asm also doesn't), but they can propagate into follow-up instructions. > And if you make them memory instructions _without_ the 'volatile', the > memory access is simply buggy until we have some 'nomaterialize' > model. This is the reason that almost all percpu access is implemented using this_* accessors. raw_* is a relaxed version without IRQ guarantees that should be (and is) used in a controlled manner in a special places: https://elixir.bootlin.com/linux/latest/A/ident/this_cpu_read https://elixir.bootlin.com/linux/latest/A/ident/raw_cpu_read > > So the *only* situation where a memory access is better is that > 'stable' case. In all other cases they are the same or strictly worse > than 'asm'. No, argument propagation is non-existent with "asm" version. Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 21:40 ` Uros Bizjak @ 2023-10-18 22:40 ` Linus Torvalds 2023-10-18 23:06 ` Linus Torvalds ` (4 more replies) 0 siblings, 5 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-18 22:40 UTC (permalink / raw) To: Uros Bizjak, peterz Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, 18 Oct 2023 at 14:40, Uros Bizjak <ubizjak@gmail.com> wrote: > > The ones in "raw" form are not IRQ safe and these are implemented > without volatile qualifier. You are misreading it. Both *are* irq safe - on x86. The difference between "this_cpu_xyz()" and "raw_cpu_xyz()" is that on *other* architectures, "raw_cpu_xyz():" can be a lot more efficient, because other architectures may need to do extra work to make the "this" version be atomic on a particular CPU. See for example __count_vm_event() vs count_vm_event(). In fact, that particular use isn't even in an interrupt-safe context, that's an example of literally "I'd rather be fast that correct for certain statistics that aren't all that important". They two versions generate the same code on x86, but on other architectures, __count_vm_event() can *much* simpler and faster because it doesn't disable interrupts or do other special things. But on x86, the whole "interrupt safety" is a complete red herring. Both of them generate the exact same instruction. On x86, the "volatile" is actually for a completely different reason: to avoid too much CSE by the compiler. See commit b59167ac7baf ("x86/percpu: Fix this_cpu_read()"). In fact, that commit went overboard, and just added "volatile" to *every* percpu read. So then people complained about *that*, and PeterZ did commit 0b9ccc0a9b14 ("x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()"), which basically made that "qual or not" be a macro choice. And in the process, it now got added to all the RMW ops, that didn't actually need it or want it in the first place, since they won't be CSE'd, since they depend on the input. So that commit basically generalized the whole thing entirely pointlessly, and caused your current confusion. End result: we should remove 'volatile' from the RMW ops. It doesn't do anything on x86. All it does is make us have two subtly different versions that we don't care about the difference. End result two: we should make it clear that "this_cpu_read()" vs "raw_cpu_read()" are *NOT* about interrupts. Even on architectures where the RMW ops need to have irq protection (so that they are atomic wrt interrupts also modifying the value), the *READ* operation obviously has no such issue. For the raw_cpu_read() vs this_cpu_read() case, the only issue is whether you can CSE the result. And in 99% of all cases, you can - and want to - CSE it. But as that commit b59167ac7baf shows, sometimes you cannot. Side note: the code that caused that problem is this: __always_inline void __cyc2ns_read(struct cyc2ns_data *data) { int seq, idx; do { seq = this_cpu_read(cyc2ns.seq.seqcount.sequence); ... } while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence))); } where the issue is that the this_cpu_read() of that sequence number needs to be ordered. Honestly, that code is just buggy and bad. We should never have "fixed" it by changing the semantics of this_cpu_read() in the first place. The problem is that it re-implements its own locking model, and as so often happens when people do that, they do it completely wrongly. Look at the *REAL* sequence counter code in <linux/seqlock.h>. Notice how in raw_read_seqcount_begin() we have unsigned _seq = __read_seqcount_begin(s); smp_rmb(); because it actually does the proper barriers. Notice how the garbage code in __cyc2ns_read() doesn't have them - and how it was buggy as a result. (Also notice how this all predates our "we should use load_acquire() instead of smb_rmb()", but whatever). IOW, all the "volatiles" in the x86 <asm/percpu.h> file are LITERAL GARBAGE and should not exist, and are due to a historical mistake. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 22:40 ` Linus Torvalds @ 2023-10-18 23:06 ` Linus Torvalds 2023-10-19 7:04 ` Uros Bizjak ` (3 subsequent siblings) 4 siblings, 0 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-18 23:06 UTC (permalink / raw) To: Uros Bizjak, peterz Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, 18 Oct 2023 at 15:40, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > See for example __count_vm_event() vs count_vm_event(). > > In fact, that particular use isn't even in an interrupt-safe context, > that's an example of literally "I'd rather be fast that correct for > certain statistics that aren't all that important". .. just to clarify - I don't think the VM statistics code isn't even updated from interrupts, but it is still incorrect to do "raw_cpu_add()" even in just process context, because on architectures where it results in separate load-op-store instructions, you can get preempted in the middle, and now your loaded value is some old stale one. So when you get back, somebody else might have updated the count, but you'll still end up doing the store using the stale value. For VM statistics like the BALLOON_MIGRATE, nobody cares. The stats may be incorrect, but they aren't a correctness issue, and they'll be in the right ballpark because the race is not generally hit. So "interrupt safe" here is not necessarily about actual interrupts themselves directly. You *can* have that too, but it can also be about just an interrupt causing preemption. Anyway, again, none of this is relevant on x86, since the single-instruction rmw percpu sequence is obviously non-interruptible, The one oddity on x86 is that because 'xchg' always has an implied lock, so there we *do* have a multi-instruction sequence. And then - and *ONLY* then - the raw-vs-this matters even on x86: "raw" just does a "load-store" pair, while "this" does a cmpxchg loop (the latter of which is safe for both irq use and preemption because the cmpxchg obviously re-checks the original value). But even in that xchg case, the "volatile" part of the asm is a complete red herring and shouldn't exist. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 22:40 ` Linus Torvalds 2023-10-18 23:06 ` Linus Torvalds @ 2023-10-19 7:04 ` Uros Bizjak 2023-10-19 16:59 ` Linus Torvalds 2023-10-19 8:44 ` Peter Zijlstra ` (2 subsequent siblings) 4 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-19 7:04 UTC (permalink / raw) To: Linus Torvalds Cc: peterz, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, Oct 19, 2023 at 12:40 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 18 Oct 2023 at 14:40, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > The ones in "raw" form are not IRQ safe and these are implemented > > without volatile qualifier. > > You are misreading it. > > Both *are* irq safe - on x86. > > The difference between "this_cpu_xyz()" and "raw_cpu_xyz()" is that on > *other* architectures, "raw_cpu_xyz():" can be a lot more efficient, > because other architectures may need to do extra work to make the > "this" version be atomic on a particular CPU. > > See for example __count_vm_event() vs count_vm_event(). > > In fact, that particular use isn't even in an interrupt-safe context, > that's an example of literally "I'd rather be fast that correct for > certain statistics that aren't all that important". > > They two versions generate the same code on x86, but on other > architectures, __count_vm_event() can *much* simpler and faster > because it doesn't disable interrupts or do other special things. > > But on x86, the whole "interrupt safety" is a complete red herring. > Both of them generate the exact same instruction. > > On x86, the "volatile" is actually for a completely different reason: > to avoid too much CSE by the compiler. Let me explain how the compiler handles volatile. Basically, it disables most of the optimizations when volatile is encountered, be it on volatile asm or volatile memory access. It is important that argument propagation is NOT disabled with volatile memory, so the compiler can still propagate arguments into the following instruction, as long as no memory access is crossed. This is all that [1] implements, while keeping volatile around. [1] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/percpu&id=ca4256348660cb2162668ec3d13d1f921d05374a Remark: You raised a question on why "testl $0xf0000,%gs:0x0(%rip)" was not optimized to testb - because of volatile access on memory, the compiler won't shrink memory read. The compiler won't CSE volatile-asm or volatile reads, so the number of memory accesses stay the same. This is an important property, promised by this_* API, and this_* should be used almost everywhere because of this property. raw_* is a relaxed version that allows CSE of asm and memory accesses. The number of memory accesses can change, and additionally, it *can* rematerialize arguments from memory [2]. [2] If there is an instruction that uses all registers in between the load from memory and the insn that uses the loaded value, then the compiler will try to accommodate the instruction by pushing all active registers to the stack (stack is owned by the compiler). Instead of pushing and popping the register with a loaded value, it can read it from the original non-volatile location as well. If the location is qualified as volatile, this optimization is off. This is the rematerialization that you are worrying about. As well as CSE, it can be allowed under special conditions, in a section, where it is *guaranteed* that the value in memory won't change in the whole section. This can be guaranteed in a (kind of) critical section (e.g. when interrupts are disabled), otherwise the rematerialized value can differ, because some interrupt handler changed value in memory. Also, the opposite is true, we will miss the changed value when the read is CSE'd. That means, that even when the instruction is the same (and IRQ safe), "volatile" inhibits unwanted optimisations, and should be there because of the compiler, to guarantee the access to memory. Since RMW asms are accessing memory, they should be volatile, too for non-relaxed versions. Following with the raw_* issue: These are relaxed versions that require stable operands, in a section, marked with some barriers (so, critical section). Those barriers (e.g. volatile asm memory clobber) will prevent scheduling of memory accesses outside this critical section, and memory operand stability will be achieved by e.g. disabling interrupts. Since the user *guarantees* memory stability, the compiler can relax its CSE and rematerialization requirement *for this section*. Accessors that are implemented without "volatile" communicate this relaxation to the compiler. The above also means that when raw_* versions are used on non-stable operands, it is the user's fault to use it at the first place, not the compiler's. raw_ versions should be used with great care. To illustrate the above, let's look at the "xchg" RMW implementation. The this_* one is interrupt safe in the sense that it is implemented with cmpxchg operation that is atomic w.r.t interrupts (that can change contents of the memory) [note, this is not "locked", and does not have full atomic property], but raw_* is implemented simply by raw_ reads and writes. In the raw_ case, we don't care what happens with memory accesses, because they are required to be stable in the section where raw_xchg is used. Then we have this_cpu_read_stable that has even stricter requirements on operand stability. It can be considered as a constant memory, and now we have the means to communicate this to the compiler. [Side note: this_cpu_read_stable is a misnomer. It should be named raw_cpu_read_stable to reflect the operand stability, required by raw_ versions. From the compiler standpoint, "volatiles" are not optional, they communicate operand stability state to the compiler. I have reviewed percpu.h many times, and the current state of operations is correct, even if they result in the same instruction. BTW: About that raw_smp_processor_id(): The raw_ version should stay defined as this_cpu_read(pcpu_hot.cpu_number), while __smp_processor_id() can be redefined to raw_cpu_read. Please note the usage in include/linux/smp.h: #define get_cpu() ({ preempt_disable(); __smp_processor_id(); }) #define put_cpu() preempt_enable() where preempt_disable and preempt_enable mark the boundaries of the critical section. Uros. > See commit b59167ac7baf ("x86/percpu: Fix this_cpu_read()"). > > In fact, that commit went overboard, and just added "volatile" to > *every* percpu read. > > So then people complained about *that*, and PeterZ did commit > 0b9ccc0a9b14 ("x86/percpu: Differentiate this_cpu_{}() and > __this_cpu_{}()"), which basically made that "qual or not" be a macro > choice. > > And in the process, it now got added to all the RMW ops, that didn't > actually need it or want it in the first place, since they won't be > CSE'd, since they depend on the input. > > So that commit basically generalized the whole thing entirely > pointlessly, and caused your current confusion. > > End result: we should remove 'volatile' from the RMW ops. It doesn't > do anything on x86. All it does is make us have two subtly different > versions that we don't care about the difference. > > End result two: we should make it clear that "this_cpu_read()" vs > "raw_cpu_read()" are *NOT* about interrupts. Even on architectures > where the RMW ops need to have irq protection (so that they are atomic > wrt interrupts also modifying the value), the *READ* operation > obviously has no such issue. > > For the raw_cpu_read() vs this_cpu_read() case, the only issue is > whether you can CSE the result. > > And in 99% of all cases, you can - and want to - CSE it. But as that > commit b59167ac7baf shows, sometimes you cannot. > > Side note: the code that caused that problem is this: > > __always_inline void __cyc2ns_read(struct cyc2ns_data *data) > { > int seq, idx; > > do { > seq = this_cpu_read(cyc2ns.seq.seqcount.sequence); > ... > } while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence))); > } > > where the issue is that the this_cpu_read() of that sequence number > needs to be ordered. > > Honestly, that code is just buggy and bad. We should never have > "fixed" it by changing the semantics of this_cpu_read() in the first > place. > > The problem is that it re-implements its own locking model, and as so > often happens when people do that, they do it completely wrongly. > > Look at the *REAL* sequence counter code in <linux/seqlock.h>. Notice > how in raw_read_seqcount_begin() we have > > unsigned _seq = __read_seqcount_begin(s); > smp_rmb(); > > because it actually does the proper barriers. Notice how the garbage > code in __cyc2ns_read() doesn't have them - and how it was buggy as a > result. > > (Also notice how this all predates our "we should use load_acquire() > instead of smb_rmb()", but whatever). > > IOW, all the "volatiles" in the x86 <asm/percpu.h> file are LITERAL > GARBAGE and should not exist, and are due to a historical mistake. > > Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 7:04 ` Uros Bizjak @ 2023-10-19 16:59 ` Linus Torvalds 2023-10-19 17:21 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-19 16:59 UTC (permalink / raw) To: Uros Bizjak Cc: peterz, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 19 Oct 2023 at 00:04, Uros Bizjak <ubizjak@gmail.com> wrote: > > Let me explain how the compiler handles volatile. We're talking past each other. You are talking about the volatile *memory* ops, and the the difference that "raw" vs "this" would cause with and without the "volatile". While *I* am now convinced that the memory ops aren't even an option, because they will generate worse code, because pretty much all users use the "this" version (which would have to use volatile), Because if we just stick with inline asms, the need for "volatile" simply goes away. The existing volatile on those percpu inline asms is *wrong*. It's a historical mistake. And with just a plain non-volatile inline asm, the inline asm wins. It doesn't have the (bad) read-once behavior of a volatile memory op. And it also doesn't have the (horrible correctness issue) rematerialization behavior of a non-volatile memory op. A compiler that were to rematerializes an inline asm (instead of spilling) would be a bad joke. That's not an optimization, that's just a crazy bad compiler with a code generation bug. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 16:59 ` Linus Torvalds @ 2023-10-19 17:21 ` Uros Bizjak 2023-10-19 18:06 ` Linus Torvalds 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-19 17:21 UTC (permalink / raw) To: Linus Torvalds Cc: peterz, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, Oct 19, 2023 at 7:00 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 19 Oct 2023 at 00:04, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > Let me explain how the compiler handles volatile. > > We're talking past each other. > > You are talking about the volatile *memory* ops, and the the > difference that "raw" vs "this" would cause with and without the > "volatile". > > While *I* am now convinced that the memory ops aren't even an option, > because they will generate worse code, because pretty much all users > use the "this" version (which would have to use volatile), Please see [1]. Even with volatile access, with memory ops the compiler can propagate operands, resulting in ~8k code size reduction, and many hundreds (if not thousands) MOVs propagated into subsequent instructions. Please note many code examples in [1]. This is not possible with the asm variant. [1] https://lore.kernel.org/lkml/20231004192404.31733-1-ubizjak@gmail.com/ > Because if we just stick with inline asms, the need for "volatile" > simply goes away. No, the compiler is then free to remove or duplicate the asm (plus other unwanted optimizations), please see the end of chapter 6.47.2.1 in [2]. [2] https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gcc/Extended-Asm.html#Volatile-1 > The existing volatile on those percpu inline asms is *wrong*. It's a > historical mistake. Please see above. > And with just a plain non-volatile inline asm, the inline asm wins. Please see [1] for the code propagation argument. > It doesn't have the (bad) read-once behavior of a volatile memory op. > > And it also doesn't have the (horrible correctness issue) > rematerialization behavior of a non-volatile memory op. Unfortunately, it does. Without volatile, asm can be rematerialized in the same way as it can be CSEd. OTOH, the memory op with memory-ops approach is casted to volatile in this_* case, so it for sure won't get rematerialized. > A compiler that were to rematerializes an inline asm (instead of > spilling) would be a bad joke. That's not an optimization, that's just > a crazy bad compiler with a code generation bug. But that is what the compiler does without volatile. Thanks, Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 17:21 ` Uros Bizjak @ 2023-10-19 18:06 ` Linus Torvalds 2023-10-19 18:16 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-19 18:06 UTC (permalink / raw) To: Uros Bizjak Cc: peterz, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 19 Oct 2023 at 10:21, Uros Bizjak <ubizjak@gmail.com> wrote: > > > A compiler that were to rematerializes an inline asm (instead of > > spilling) would be a bad joke. That's not an optimization, that's just > > a crazy bad compiler with a code generation bug. > > But that is what the compiler does without volatile. Do you actually have a real case of that, or are basing it purely off insane documentation? Because remat of inline asm really _is_ insane. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 18:06 ` Linus Torvalds @ 2023-10-19 18:16 ` Uros Bizjak 2023-10-19 18:49 ` Linus Torvalds 2023-10-19 21:04 ` Linus Torvalds 0 siblings, 2 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-19 18:16 UTC (permalink / raw) To: Linus Torvalds Cc: peterz, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, Oct 19, 2023 at 8:06 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 19 Oct 2023 at 10:21, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > A compiler that were to rematerializes an inline asm (instead of > > > spilling) would be a bad joke. That's not an optimization, that's just > > > a crazy bad compiler with a code generation bug. > > > > But that is what the compiler does without volatile. > > Do you actually have a real case of that, or are basing it purely off > insane documentation? > > Because remat of inline asm really _is_ insane. The following testcase pushes the compiler to the limit: --cut here-- extern void ex (int); static int read (void) { int ret; asm ("# -> %0" : "=r"(ret)); return ret; } int foo (void) { int ret = read (); ex (ret); asm volatile ("clobber" : : : "ax", "cx", "dx", "bx", "bp", "si", "di"); return ret; } extern int m; int bar (void) { int ret = m; ex (ret); asm volatile ("clobber" : : : "ax", "cx", "dx", "bx", "bp", "si", "di"); return ret; } --cut here-- Please compile the above with -S -O2 -m32 (so we don't have to list all 16 x86_64 registers). And NO (whee...), there is no rematerialization of asm (foo() ). OTOH, there is also no rematerialization from non-volatile memory (bar() ), although it would be more optimal than spill to/fill from a frame pair of moves. I wonder what are "certain circumstances" that the documentation is referring to. Uros. Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 18:16 ` Uros Bizjak @ 2023-10-19 18:49 ` Linus Torvalds 2023-10-19 19:07 ` Linus Torvalds 2023-10-19 21:04 ` Linus Torvalds 1 sibling, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-19 18:49 UTC (permalink / raw) To: Uros Bizjak Cc: peterz, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 19 Oct 2023 at 11:16, Uros Bizjak <ubizjak@gmail.com> wrote: > > And NO (whee...), there is no rematerialization of asm (foo() ). OTOH, > there is also no rematerialization from non-volatile memory (bar() ), > although it would be more optimal than spill to/fill from a frame pair > of moves. I wonder what are "certain circumstances" that the > documentation is referring to. Honestly, I've actually never seen gcc rematerialize anything at all. I really only started worrying about remat issues in a theoretical sense, and because I feel it would be relatively *easy* to do for something where the source is a load. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 18:49 ` Linus Torvalds @ 2023-10-19 19:07 ` Linus Torvalds 2023-10-20 7:57 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-19 19:07 UTC (permalink / raw) To: Uros Bizjak Cc: peterz, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 19 Oct 2023 at 11:49, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Honestly, I've actually never seen gcc rematerialize anything at all. > > I really only started worrying about remat issues in a theoretical > sense, and because I feel it would be relatively *easy* to do for > something where the source is a load. .. I started looking around, since I actually have gcc sources around. At least lra-remat.cc explicitly says o no any memory (as access to memory is non-profitable) so if we could just *rely* on that, it would actually allow us to use memory ops without the volatile. That would be the best of all worlds, of course. I do have clang sources too, but I've looked at gcc enough that I at least can do the "grep and look for patterns" and tend to have an idea of what the passes are. Clang, not so much. From my "monkey see patterns" check, it does look like clang mainly just rematerializes immediates (and some address generation), but also memory accesses without a base register (allowing a "base register" of (%rip)). See X86InstrInfo::isReallyTriviallyReMaterializable() in case anybody cares. So it does look like clang might actually rematerialize exactly percpu loads with a constant address, and my "those are easy to rematerialize" worry may have been correct. HOWEVER, I'd like to once again say that I know so little about llvm that my "monkey with 'grep' and some pattern matching ability" thing really means that I'm just guessing. Now, PeterZ is obviously worried about even just CSE and re-ordering, so remat isn't the *only* issue. I do agree that we've had 'volatile' on many of the asms possibly hiding any issues for the last five years. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 19:07 ` Linus Torvalds @ 2023-10-20 7:57 ` Uros Bizjak 0 siblings, 0 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-20 7:57 UTC (permalink / raw) To: Linus Torvalds Cc: peterz, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, Oct 19, 2023 at 9:07 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 19 Oct 2023 at 11:49, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Honestly, I've actually never seen gcc rematerialize anything at all. > > > > I really only started worrying about remat issues in a theoretical > > sense, and because I feel it would be relatively *easy* to do for > > something where the source is a load. > > .. I started looking around, since I actually have gcc sources around. > > At least lra-remat.cc explicitly says > > o no any memory (as access to memory is non-profitable) > > so if we could just *rely* on that, it would actually allow us to use > memory ops without the volatile. > > That would be the best of all worlds, of course. I have made an experiment and changed: #define __raw_cpu_read(qual, pcp) \ ({ \ - *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \ + *(__my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \ }) #define __raw_cpu_write(qual, pcp, val) \ do { \ - *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \ + *(__my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \ } while (0) Basically, I removed "volatile" from read/write accessors. With all new percpu patches in place the difference in all percpu accesses is: Reference: 15990 accesses Patched: 15976 accesses. So, the difference is 14 fewer accesses. Waaay too low of a gain for a potential pain. The code size savings are: text data bss dec hex filename 25476129 4389468 808452 30674049 1d40c81 vmlinux-new.o 25476021 4389444 808452 30673917 1d40bfd vmlinux-ref.o So, 108 bytes for the default build. Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 18:16 ` Uros Bizjak 2023-10-19 18:49 ` Linus Torvalds @ 2023-10-19 21:04 ` Linus Torvalds 2023-10-19 22:39 ` Linus Torvalds 1 sibling, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-19 21:04 UTC (permalink / raw) To: Uros Bizjak Cc: peterz, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 19 Oct 2023 at 11:16, Uros Bizjak <ubizjak@gmail.com> wrote: > > I wonder what are "certain circumstances" that the > documentation is referring to. Looking more at that "under certain circumstances" statement, I actually think it refers even to the situation *with* "asm volatile". In particular, when doing loop unrolling, gcc will obviously duplicate the asm (both with and without volatile). That would obviously lead to exactly the kinds of problems that snippet of documentation then talks about: "This can lead to unexpected duplicate symbol errors during compilation if your asm code defines symbols or labels" so that makes complete sense. It also matches up with the fact that this is all actually documented very much under the "volatile" label - ie this is a generic thing that happens even *with* volatile in place, and we should not expect that "one asm statement" will generate exactly one copy of the resulting assembler. It also matches up with the whole earlier preceding about "Note that the compiler can move even volatile asm instructions relative to other code, including across jump instructions". So I think what happened is exactly that somebody was declaring a variable or local label inside the asm, and then the docs were clarified to state that the asm can be duplicated in the output. Of course, this is all just by me reading the docs and looking at gcc output for way too many years. It's not based on any knowledge of the original issue. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 21:04 ` Linus Torvalds @ 2023-10-19 22:39 ` Linus Torvalds 2023-10-20 8:08 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-19 22:39 UTC (permalink / raw) To: Uros Bizjak Cc: peterz, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers Unrelated question to the gcc people (well, related in the way that this discussion made me *test* this). Lookie here: int test(void) { unsigned int sum = 0; for (int i = 0; i < 4; i++) { unsigned int val; #if ONE asm("magic1 %0":"=r" (val): :"memory"); #else asm volatile("magic2 %0":"=r" (val)); #endif sum += val; } return sum; } and now build this with gcc -O2 -S -DONE -funroll-all-loops t.c and I get a *completely* nonsensical end result. What gcc generates is literally insane. What I *expected* to happen was that the two cases (with "-DONE" and without) would generate the same code, since one has a "asm volatile", and the other has a memory clobber. IOW, neither really should be something that can be combined. But no. The '-DONE" version is completely crazy with my gcc-13.2.1 setup. First off, it does actually CSE all the asm's despite the memory clobber. Which I find quite debatable, but whatever. But not only does it CSE them, it then does *not* just multiply the result by four. No. It generates this insanity: magic1 %eax movl %eax, %edx addl %eax, %eax addl %edx, %eax addl %edx, %eax ret so it has apparently done the CSE _after_ the other optimizations. Very strange. Honestly, the CSE part looks like an obvious bug to me. The gcc documentation states: The "memory" clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters). so CSE'ing any inline asm with a memory clobber sounds *very* dubious. The asm literally told the compiler that it has side effects in unrelated memory locations! I don't think we actually care in the kernel (and yes, I think it would always be safer to use "asm volatile" if there's some unrelated memory locations that change), but since I was testing this and was surprised, and since the obvious reading of the documented behavior of a memory clobber really does scream "you can't combine those asms", I thought I'd mention this. Also, *without* the memory clobber, gcc obviously still does CSE the asm, but also, gcc ends up doing just magic1 %eax sall $2, %eax ret so the memory clobber clearly does actually make a difference. Just not a _sane_ one. In testing, clang does *not* have this apparently buggy behavior (but clang annoyingly actually checks the instruction mnemonics, so I had to change "magic" into "strl" instead to make clang happy). Hmm? Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 22:39 ` Linus Torvalds @ 2023-10-20 8:08 ` Uros Bizjak 0 siblings, 0 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-20 8:08 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Fri, Oct 20, 2023 at 12:39 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Unrelated question to the gcc people (well, related in the way that > this discussion made me *test* this). Perhaps you should report this in the gcc bugzilla and move the discussion there. This thread already has more than 100 messages... Thanks, Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 22:40 ` Linus Torvalds 2023-10-18 23:06 ` Linus Torvalds 2023-10-19 7:04 ` Uros Bizjak @ 2023-10-19 8:44 ` Peter Zijlstra 2023-10-19 8:54 ` Peter Zijlstra 2023-10-19 9:07 ` Peter Zijlstra 4 siblings, 0 replies; 106+ messages in thread From: Peter Zijlstra @ 2023-10-19 8:44 UTC (permalink / raw) To: Linus Torvalds Cc: Uros Bizjak, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 03:40:05PM -0700, Linus Torvalds wrote: > Side note: the code that caused that problem is this: > > __always_inline void __cyc2ns_read(struct cyc2ns_data *data) > { > int seq, idx; > > do { > seq = this_cpu_read(cyc2ns.seq.seqcount.sequence); > ... > } while (unlikely(seq != this_cpu_read(cyc2ns.seq.seqcount.sequence))); > } > > where the issue is that the this_cpu_read() of that sequence number > needs to be ordered. I have very vague memories of other code also relying on this_cpu_read() implying READ_ONCE(). And that code really only is buggy if you do not have that. Since it is cpu local, the smp_rmb() would be confusing, as would smp_load_acquire() be -- there is no cross-cpu data ordering. The other option is of couse adding explicit barrier(), but that's entirely superfluous when all the loads are READ_ONCE(). If you want to make this_cpu_read() not imply READ_ONCE(), then we should go audit all users :/ Can be done ofc. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 22:40 ` Linus Torvalds ` (2 preceding siblings ...) 2023-10-19 8:44 ` Peter Zijlstra @ 2023-10-19 8:54 ` Peter Zijlstra 2023-10-19 17:04 ` Linus Torvalds 2023-10-19 9:07 ` Peter Zijlstra 4 siblings, 1 reply; 106+ messages in thread From: Peter Zijlstra @ 2023-10-19 8:54 UTC (permalink / raw) To: Linus Torvalds Cc: Uros Bizjak, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 03:40:05PM -0700, Linus Torvalds wrote: > Look at the *REAL* sequence counter code in <linux/seqlock.h>. Notice > how in raw_read_seqcount_begin() we have > > unsigned _seq = __read_seqcount_begin(s); > smp_rmb(); > > because it actually does the proper barriers. Notice how the garbage > code in __cyc2ns_read() doesn't have them - and how it was buggy as a > result. > > (Also notice how this all predates our "we should use load_acquire() > instead of smb_rmb()", but whatever). seqlock actually wants rmb even today, the pattern is: do { seq = load-seq rmb load-data rmb } while (seq != re-load-seq) we specifically only care about loads, and the data loads must be between the sequence number loads. As such, load-acquire is not a natural match. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 8:54 ` Peter Zijlstra @ 2023-10-19 17:04 ` Linus Torvalds 2023-10-19 18:13 ` Peter Zijlstra 0 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-19 17:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Uros Bizjak, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 19 Oct 2023 at 01:54, Peter Zijlstra <peterz@infradead.org> wrote: > > seqlock actually wants rmb even today, the pattern is: That's the pattern today, and yes, it makes superficial sense, but no, it's not a "really wants" issue. > do { > seq = load-seq > rmb > load-data > rmb > } while (seq != re-load-seq) > > we specifically only care about loads, and the data loads must be > between the sequence number loads. > > As such, load-acquire is not a natural match. You are correct that "rmb" _sounds_ logical. We do, after all, want to order reads wrt each other. So rmb is the obvious choice. But the thing is, "read_acquire" actually does that too. So if you do seq = load_acquire(orig_seq); load-data then that acquire actually makes that first 'rmb' pointless. Acquire already guarantees that all subsequent memory operations are ordered wrt that read. And 'acquire' is likely faster than 'rmb' on sane modern architectures. On x86 it doesn't matter (rmb is a no-op, and all loads are acquires). But on arm64, for example, you can do a 'ld.acq' in one instruction and you're done - while a rmb then ends up being a barrier (ok, the asm mnemonics are horrible: it's not "ld.acq", it's "ldar", but whatever - I like arm64 as an architecture, but I think they made the standard assembly syntax pointlessly and actively hostile to humans). Of course then microarchitectures may end up doing basically the same thing, but at least technically the 'load acquire' is likely more targeted and more optimized. The second rmb is then harder to change, and that is going to stay an rmb ( you could say "do an acquire on the last data load, but that doesn't fit the sane locking semantics of a sequence lock). But I do think our sequence counters would be better off using "smp_load_acquire()" for that initial read. Of course, then the percpu case doesn't care about the SMP ordering, but it should still use an UP barrier to make sure things don't get re-ordered. Relying on our "percpu_read()" ordering other reads around it is *wrong*. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 17:04 ` Linus Torvalds @ 2023-10-19 18:13 ` Peter Zijlstra 2023-10-19 18:22 ` Linus Torvalds 0 siblings, 1 reply; 106+ messages in thread From: Peter Zijlstra @ 2023-10-19 18:13 UTC (permalink / raw) To: Linus Torvalds Cc: Uros Bizjak, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, Oct 19, 2023 at 10:04:56AM -0700, Linus Torvalds wrote: > So if you do > > seq = load_acquire(orig_seq); > load-data > > then that acquire actually makes that first 'rmb' pointless. Acquire > already guarantees that all subsequent memory operations are ordered > wrt that read. > > And 'acquire' is likely faster than 'rmb' on sane modern architectures. > > On x86 it doesn't matter (rmb is a no-op, and all loads are acquires). > > But on arm64, for example, you can do a 'ld.acq' in one instruction > and you're done - while a rmb then ends up being a barrier (ok, the > asm mnemonics are horrible: it's not "ld.acq", it's "ldar", but > whatever - I like arm64 as an architecture, but I think they made the > standard assembly syntax pointlessly and actively hostile to humans). > > Of course then microarchitectures may end up doing basically the same > thing, but at least technically the 'load acquire' is likely more > targeted and more optimized. Sure, acquire should work fine here. > The second rmb is then harder to change, and that is going to stay an > rmb ( you could say "do an acquire on the last data load, but that > doesn't fit the sane locking semantics of a sequence lock). Wouldn't even work, acquire allows an earlier load to pass it. It only constraints later loads to not happen before. > Of course, then the percpu case doesn't care about the SMP ordering, > but it should still use an UP barrier to make sure things don't get > re-ordered. Relying on our "percpu_read()" ordering other reads around > it is *wrong*. I'm happy to put barrier() in there if it makes you feel better. But are you really saying this_cpu_read() should not imply READ_ONCE()? If so, we should probably go audit a ton of code :/ ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 18:13 ` Peter Zijlstra @ 2023-10-19 18:22 ` Linus Torvalds 2023-10-19 18:37 ` Uros Bizjak 0 siblings, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-19 18:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Uros Bizjak, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 19 Oct 2023 at 11:14, Peter Zijlstra <peterz@infradead.org> wrote: > > But are you really saying this_cpu_read() should not imply READ_ONCE()? Well, Uros is saying that we may be *forced* to have that implication, much as I really hate it (and wonder at the competence of a compiler that forces the code-pessimizing 'volatile'). And the "it's not volatile" is actually our historical behavior. The volatile really is new, and didn't exist before your commit b59167ac7baf ("x86/percpu: Fix this_cpu_read()"). So the whole "implies READ_ONCE()" really seems to be due to that *one* mistake in our percpu sequence locking code. Yes, it's been that way for 5 years now, but it was the other way around for the preceding decade.... Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 18:22 ` Linus Torvalds @ 2023-10-19 18:37 ` Uros Bizjak 0 siblings, 0 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-19 18:37 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, Oct 19, 2023 at 8:22 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 19 Oct 2023 at 11:14, Peter Zijlstra <peterz@infradead.org> wrote: > > > > But are you really saying this_cpu_read() should not imply READ_ONCE()? > > Well, Uros is saying that we may be *forced* to have that implication, > much as I really hate it (and wonder at the competence of a compiler > that forces the code-pessimizing 'volatile'). Please note that my patch mitigates exactly this. The propagation of volatile(!) arguments allows huge instruction and code savings. By using non-volatile asm, a very limited BB CSE can perhaps remove a few asms. However, if there is no READ_ONCE requirement, then we can simply remove "volatile" qualification for this_cpu_read from the memory-ops patch. It will be like a field trip for the compiler, because *then* it will be able to optimize everything without limitations. Uros. > And the "it's not volatile" is actually our historical behavior. The > volatile really is new, and didn't exist before your commit > b59167ac7baf ("x86/percpu: Fix this_cpu_read()"). > > So the whole "implies READ_ONCE()" really seems to be due to that > *one* mistake in our percpu sequence locking code. > > Yes, it's been that way for 5 years now, but it was the other way > around for the preceding decade.... > > Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 22:40 ` Linus Torvalds ` (3 preceding siblings ...) 2023-10-19 8:54 ` Peter Zijlstra @ 2023-10-19 9:07 ` Peter Zijlstra 2023-10-19 9:23 ` Uros Bizjak 4 siblings, 1 reply; 106+ messages in thread From: Peter Zijlstra @ 2023-10-19 9:07 UTC (permalink / raw) To: Linus Torvalds Cc: Uros Bizjak, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 03:40:05PM -0700, Linus Torvalds wrote: > On Wed, 18 Oct 2023 at 14:40, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > The ones in "raw" form are not IRQ safe and these are implemented > > without volatile qualifier. > > You are misreading it. > > Both *are* irq safe - on x86. Stronger, x86 arch code very much relies on them being NMI-safe. Which makes the generic implementation insufficient. They *must* be single RmW instructions on x86. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 9:07 ` Peter Zijlstra @ 2023-10-19 9:23 ` Uros Bizjak 0 siblings, 0 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-19 9:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, Oct 19, 2023 at 11:09 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Oct 18, 2023 at 03:40:05PM -0700, Linus Torvalds wrote: > > On Wed, 18 Oct 2023 at 14:40, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > The ones in "raw" form are not IRQ safe and these are implemented > > > without volatile qualifier. > > > > You are misreading it. > > > > Both *are* irq safe - on x86. > > Stronger, x86 arch code very much relies on them being NMI-safe. Which > makes the generic implementation insufficient. > > They *must* be single RmW instructions on x86. Maybe I should rephrase my quoted claim above: "raw" versions are not needed to be IRQ safe [*]. [*] Memory arguments need to be stable, so IRQ and NMI handlers can not change them outside of the critical section where the "raw" version operates. When memory arguments are stable, the compiler can omit reads (cache) the arguments, or re-reads them (rematerialize) from memory. The atomicity of the operation is irrelevant in the "raw" context, so the implementation of raw_percpu_xchg_op using raw_cpu_read/write is OK in this context. Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 20:22 ` Linus Torvalds 2023-10-18 20:34 ` Linus Torvalds @ 2023-10-18 20:42 ` Uros Bizjak 2023-10-19 16:32 ` Uros Bizjak 2 siblings, 0 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 20:42 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 10:22 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 18 Oct 2023 at 12:33, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > This pach works for me: > > Looks fine. > > But you actually bring up another issue: > > > BTW: I also don't understand the comment from include/linux/smp.h: > > > > /* > > * Allow the architecture to differentiate between a stable and unstable read. > > * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a > > * regular asm read for the stable. > > I think the comment is badly worded, but I think the issue may actually be real. > > One word: rematerialization. > > The thing is, turning inline asm accesses to regular compiler loads > has a *very* bad semantic problem: the compiler may now feel like it > can not only combine the loads (ok), but also possibly rematerialize > values by re-doing the loads (NOT OK!). > > IOW, the kernel often has very strict requirements of "at most once" > behavior, because doing two loads might give different results. > > The cpu number is a good example of this. > > And yes, sometimes we use actual volatile accesses for them > (READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general, > and are much too strict. Not only does gcc generally lose its mind > when it sees volatile (ie it stops doing various sane combinations > that would actually be perfectly valid), but it obviously also stops > doing CSE on the loads (as it has to). > > So the "non-volatile asm" has been a great way to get the "at most > one" behavior: it's safe wrt interrupts changing the value, because > you will see *one* value, not two. As far as we know, gcc never > rematerializes the output of an inline asm. So when you use an inline > asm, you may have the result CSE'd, but you'll never see it generate > more than *one* copy of the inline asm. > > (Of course, as with so much about inline asm, that "knowledge" is not > necessarily explicitly spelled out anywhere, and it's just "that's how > it has always worked"). > > IOW, look at code like the one in swiotlb_pool_find_slots(), which does this: > > int start = raw_smp_processor_id() & (pool->nareas - 1); > > and the use of 'start' really is meant to be just a good heuristic, in > that different concurrent CPU's will start looking in different pools. > So that code is basically "cpu-local by default", but it's purely > about locality, it's not some kind of correctness issue, and it's not > necessarily run when the code is *tied* to a particular CPU. > > But what *is* important is that 'start' have *one* value, and one > value only. So look at that loop, which hasically does > > do { > .. use the 'i' based on 'start' .. > if (++i >= pool->nareas) > i = 0; > } while (i != start); > > and it is very important indeed that the compiler does *not* think > "Oh, I can rematerialize the 'start' value". > > See what I'm saying? Using 'volatile' for loading the current CPU > value would be bad for performance for no good reason. But loading it > multiple times would be a *bug*. > > Using inline asm is basically perfect here: the compiler can *combine* > two inline asms into one, but once we have a value for 'start', it > won't change, because the compiler is not going to decide "I can drop > this value, and just re-do the inline asm to rematerialize it". > > This all makes me worried about the __seg_fs thing. Please note that there is a difference between this_* and raw_* accessors. this_* has "volatile" qualification, and for sure it won't ever be rematerialized (this would defeat the purpose of "volatile"). Previously, this_* was defined as asm-volatile, and now it is defined as a volatile memory access. GCC will disable almost all optimizations when volatile memory is accessed. IIRC, volatile-asm also won't be combined, since the compiler does not know about secondary effects of asm. Side note: The raw_smp_processor_id() uses this_, so it has volatile qualification. Perhaps we can do +#define __smp_processor_id() raw_cpu_read(pcpu_hot.cpu_number) as this seems like the relaxed version of smp_processor_id(). So, guarantees of asm-volatile are the same as guarantees of volatile memory access. Uros. > > For 'current', this is all perfect. Rematerializing current is > actually better than spilling and reloading the value. > > But for something like raw_smp_processor_id(), rematerializing would > be a correctness problem, and a really horrible one (because in > practice, the code would work 99.9999% of the time, and then once in a > blue moon, it would rematerialize a different value). > > See the problem? > > I guess we could use the stdatomics to try to explain these issues to > the compiler, but I don't even know what the C interfaces look like or > whether they are stable and usable across the range of compilers we > use. > > Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 20:22 ` Linus Torvalds 2023-10-18 20:34 ` Linus Torvalds 2023-10-18 20:42 ` Uros Bizjak @ 2023-10-19 16:32 ` Uros Bizjak 2023-10-19 17:08 ` Linus Torvalds 2 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-19 16:32 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, Oct 18, 2023 at 10:22 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 18 Oct 2023 at 12:33, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > This pach works for me: > > Looks fine. > > But you actually bring up another issue: > > > BTW: I also don't understand the comment from include/linux/smp.h: > > > > /* > > * Allow the architecture to differentiate between a stable and unstable read. > > * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a > > * regular asm read for the stable. > > I think the comment is badly worded, but I think the issue may actually be real. > > One word: rematerialization. > > The thing is, turning inline asm accesses to regular compiler loads > has a *very* bad semantic problem: the compiler may now feel like it > can not only combine the loads (ok), but also possibly rematerialize > values by re-doing the loads (NOT OK!). > > IOW, the kernel often has very strict requirements of "at most once" > behavior, because doing two loads might give different results. > > The cpu number is a good example of this. > > And yes, sometimes we use actual volatile accesses for them > (READ_ONCE() and WRITE_ONCE()) but those are *horrendous* in general, > and are much too strict. Not only does gcc generally lose its mind > when it sees volatile (ie it stops doing various sane combinations > that would actually be perfectly valid), but it obviously also stops > doing CSE on the loads (as it has to). > > So the "non-volatile asm" has been a great way to get the "at most > one" behavior: it's safe wrt interrupts changing the value, because > you will see *one* value, not two. As far as we know, gcc never > rematerializes the output of an inline asm. So when you use an inline > asm, you may have the result CSE'd, but you'll never see it generate > more than *one* copy of the inline asm. > > (Of course, as with so much about inline asm, that "knowledge" is not > necessarily explicitly spelled out anywhere, and it's just "that's how > it has always worked"). Perhaps you will be interested in chapter 6.47.2.1, "Volatile" of GCC manual that says: " Under certain circumstances, GCC may duplicate (or remove duplicates of) your assembly code when optimizing." The compiler may re-materialize non-volatile asm in the same way it may re-materialize arguments from non-volatile memory. To avoid this, volatile asm is necessary when unstable memory arguments are accessed using this_* variants. Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-19 16:32 ` Uros Bizjak @ 2023-10-19 17:08 ` Linus Torvalds 0 siblings, 0 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-19 17:08 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Thu, 19 Oct 2023 at 09:32, Uros Bizjak <ubizjak@gmail.com> wrote: > > Perhaps you will be interested in chapter 6.47.2.1, "Volatile" of GCC > manual that says: > > " Under certain circumstances, GCC may duplicate (or remove duplicates > of) your assembly code when optimizing." > > The compiler may re-materialize non-volatile asm in the same way it > may re-materialize arguments from non-volatile memory. To avoid this, > volatile asm is necessary when unstable memory arguments are accessed > using this_* variants. That's disgusting. The whole (and only) point of rematerialization is as an optimization. When gcc doesn't know how expensive an inline asm is (and they can be very expensive indeed), doing remat on it would be an obvious mistake. I think what you say is "we're technically allowed to do it, but we'd be crazy to *actually* do it". The kernel does require a sane compiler. We will turn off options that make it do insane things - even when said insane things are allowed by a bad standard. Do you actually have gcc code that rematerializes an inline asm? Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 18:08 ` Uros Bizjak 2023-10-18 18:15 ` Linus Torvalds @ 2023-10-18 18:29 ` Nadav Amit 1 sibling, 0 replies; 106+ messages in thread From: Nadav Amit @ 2023-10-18 18:29 UTC (permalink / raw) To: Uros Bizjak Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers > On Oct 18, 2023, at 9:08 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > > We can use alias to __seg_gs non-const memory, so the value can be > accessed without asm. __barrier() will then force reload. Please note > that any memory clobber, hidden inside asm will also force reload. For the record, at the time I tried to find a creative solution to have fine granularity barrier control. I looked into address-spaces, pure and const function attributes, and use of the restrict keyword. I did not find a better solution. The behavior of most of these mechanisms was non-intuitive for me and inconsistent across compilers. You may succeed where I have failed as you seem more familiar with the compiler code, but be aware it is a rabbit hole... ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 13:15 ` Uros Bizjak 2023-10-18 14:46 ` Nadav Amit @ 2023-10-18 16:12 ` Linus Torvalds 2023-10-18 17:07 ` Uros Bizjak 1 sibling, 1 reply; 106+ messages in thread From: Linus Torvalds @ 2023-10-18 16:12 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, 18 Oct 2023 at 06:15, Uros Bizjak <ubizjak@gmail.com> wrote: > > Attached is the prototype patch that works for me (together with > Linus' FPU switching patch). That looks reasonable, but I think the separate compilation unit is unnecessary, and I still absolutely hate how that const_pcpu_hot thing is declared twice (your patch does it in both current.h and in percpu-hot.c). How about we just do the whole alias as a linker thing instead? So the same way that we just do jiffies = jiffies_64; in our arch/x86/kernel/vmlinux.lds.S file, we could just do const_pcpu_hot = pcpu_hot; in there. Then, as far as the compiler is concerned, we just have DECLARE_PER_CPU_ALIGNED( const struct pcpu_hot __percpu_seg_override, const_pcpu_hot) and the compiler doesn't know that it's aliased to anything else. And please do that declaration in just *one* place. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 16:12 ` Linus Torvalds @ 2023-10-18 17:07 ` Uros Bizjak 2023-10-18 18:01 ` Linus Torvalds 0 siblings, 1 reply; 106+ messages in thread From: Uros Bizjak @ 2023-10-18 17:07 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers [-- Attachment #1: Type: text/plain, Size: 1334 bytes --] On Wed, Oct 18, 2023 at 6:13 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 18 Oct 2023 at 06:15, Uros Bizjak <ubizjak@gmail.com> wrote: > > > > Attached is the prototype patch that works for me (together with > > Linus' FPU switching patch). > > That looks reasonable, but I think the separate compilation unit is > unnecessary, and I still absolutely hate how that const_pcpu_hot thing > is declared twice (your patch does it in both current.h and in > percpu-hot.c). > > How about we just do the whole alias as a linker thing instead? So the > same way that we just do > > jiffies = jiffies_64; > > in our arch/x86/kernel/vmlinux.lds.S file, we could just do > > const_pcpu_hot = pcpu_hot; > > in there. > > Then, as far as the compiler is concerned, we just have > > DECLARE_PER_CPU_ALIGNED( > const struct pcpu_hot __percpu_seg_override, > const_pcpu_hot) > > and the compiler doesn't know that it's aliased to anything else. Oh... ... this works, too! Please see the attached patch. (Note to self: Leave linking stuff to linker) ;) > And please do that declaration in just *one* place. Sure. Now the patch looks quite slim, but works as expected, reducing the number of current_task accesses from 3841 to 3220. Thanks, Uros. [-- Attachment #2: current-2.diff.txt --] [-- Type: text/plain, Size: 1502 bytes --] diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h index a1168e7b69e5..36902d61fd55 100644 --- a/arch/x86/include/asm/current.h +++ b/arch/x86/include/asm/current.h @@ -36,8 +36,17 @@ static_assert(sizeof(struct pcpu_hot) == 64); DECLARE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot); +/* + * + */ +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override, + const_pcpu_hot); + static __always_inline struct task_struct *get_current(void) { + if (IS_ENABLED(CONFIG_USE_X86_SEG_SUPPORT)) + return const_pcpu_hot.current_task; + return this_cpu_read_stable(pcpu_hot.current_task); } diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S index 9cdb1a7332c4..7cbc42619e16 100644 --- a/arch/x86/kernel/vmlinux.lds.S +++ b/arch/x86/kernel/vmlinux.lds.S @@ -46,6 +46,7 @@ ENTRY(phys_startup_64) #endif jiffies = jiffies_64; +const_pcpu_hot = pcpu_hot; #if defined(CONFIG_X86_64) /* diff --git a/include/linux/compiler.h b/include/linux/compiler.h index d7779a18b24f..bf9815eaf4aa 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -212,7 +212,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val, */ #define ___ADDRESSABLE(sym, __attrs) \ static void * __used __attrs \ - __UNIQUE_ID(__PASTE(__addressable_,sym)) = (void *)&sym; + __UNIQUE_ID(__PASTE(__addressable_,sym)) = (void *)(uintptr_t)&sym; #define __ADDRESSABLE(sym) \ ___ADDRESSABLE(sym, __section(".discard.addressable")) ^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-18 17:07 ` Uros Bizjak @ 2023-10-18 18:01 ` Linus Torvalds 0 siblings, 0 replies; 106+ messages in thread From: Linus Torvalds @ 2023-10-18 18:01 UTC (permalink / raw) To: Uros Bizjak Cc: Nadav Amit, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Wed, 18 Oct 2023 at 10:08, Uros Bizjak <ubizjak@gmail.com> wrote: > > Sure. Now the patch looks quite slim, but works as expected, reducing > the number of current_task accesses from 3841 to 3220. Thanks, that patch looks lovely to me. Since you've done all the hard lifting and the testing, I'd suggest you submit this all to the x86, including my fpu patch. Take my sign-off, and the commit message might be something along the lines of x86: clean up fpu switching in the middle of task switching It happens to work, but it's very very wrong, because our 'current' macro is magic that is supposedly loading a stable value. It just happens to be not quite stable enough and the compilers re-load the value enough for this code to work. But it's wrong. It also generates worse code. So fix it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> or add any verbiage you feel appropriate. Linus ^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() 2023-10-16 18:52 ` Uros Bizjak 2023-10-16 19:24 ` Linus Torvalds @ 2023-10-16 21:09 ` Uros Bizjak 1 sibling, 0 replies; 106+ messages in thread From: Uros Bizjak @ 2023-10-16 21:09 UTC (permalink / raw) To: Nadav Amit Cc: Linus Torvalds, the arch/x86 maintainers, Linux Kernel Mailing List, Andy Lutomirski, Brian Gerst, Denys Vlasenko, H . Peter Anvin, Peter Zijlstra, Thomas Gleixner, Josh Poimboeuf, Nick Desaulniers On Mon, Oct 16, 2023 at 8:52 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Thu, Oct 12, 2023 at 5:19 PM Nadav Amit <namit@vmware.com> wrote: > > > > > > > On Oct 12, 2023, at 12:54 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > !! External Email > > > > > > On Wed, 11 Oct 2023 at 14:33, Uros Bizjak <ubizjak@gmail.com> wrote: > > >> > > >> Reading the above, it looks to me that we don't want to play games > > >> with "const aliased" versions of current_task [1], as proposed by > > >> Nadav in his patch series. > > > > > > Well, maybe I'd like it if I saw what the effect of it was, but that > > > patch mentions "sync_mm_rss()" which doesn't actually exist > > > (SPLIT_RSS_COUNTING is never defined, the split version is gone and > > > hasn't existed since commit f1a7941243c1 "mm: convert mm's rss stats > > > into percpu_counter") > > > > So I added a new version of the current aliasing (well, actually pcpu_hot > > in the new version) on top of Uros’s patches, and the effect can be seen > > in many functions. I don’t want to bother with many examples so here is > > a common and simple one: > > > > Currently syscall_exit_work() that starts with: > > > > 0xffffffff8111e120 <+0>: push %rbp > > 0xffffffff8111e121 <+1>: mov %rdi,%rbp > > 0xffffffff8111e124 <+4>: push %rbx > > 0xffffffff8111e125 <+5>: mov %rsi,%rbx > > 0xffffffff8111e128 <+8>: and $0x20,%esi > > 0xffffffff8111e12b <+11>: je 0xffffffff8111e143 <syscall_exit_work+35> > > 0xffffffff8111e12d <+13>: mov %gs:0x2ac80,%rax > > 0xffffffff8111e136 <+22>: cmpb $0x0,0x800(%rax) > > 0xffffffff8111e13d <+29>: jne 0xffffffff8111e22a <syscall_exit_work+266> > > 0xffffffff8111e143 <+35>: mov %gs:0x2ac80,%rax > > 0xffffffff8111e14c <+44>: cmpq $0x0,0x7c8(%rax) > > > > Using the const-alias changes the beginning of syscall_exit_work to: > > > > 0xffffffff8111cb80 <+0>: push %r12 > > 0xffffffff8111cb82 <+2>: mov %gs:0x7ef0e0f6(%rip),%r12 # 0x2ac80 <pcpu_hot> > > 0xffffffff8111cb8a <+10>: push %rbp > > 0xffffffff8111cb8b <+11>: mov %rdi,%rbp > > 0xffffffff8111cb8e <+14>: push %rbx > > 0xffffffff8111cb8f <+15>: mov %rsi,%rbx > > 0xffffffff8111cb92 <+18>: and $0x20,%esi > > 0xffffffff8111cb95 <+21>: je 0xffffffff8111cba6 <syscall_exit_work+38> > > 0xffffffff8111cb97 <+23>: cmpb $0x0,0x800(%r12) > > 0xffffffff8111cba0 <+32>: jne 0xffffffff8111cc7a <syscall_exit_work+250> > > 0xffffffff8111cba6 <+38>: cmpq $0x0,0x7c8(%r12) > > > > So we both see RIP-relative addressing is being used (hence the instruction is > > one byte shorter) and the reload going away. > > > > Now, I am not a compiler expert as for the rationale, but it googling around > > I can see Nick explaining the rationale [1] - if you use “p” your read memory. > > BTW: It is related to discussion you had [2], in which you encountered an issue > > I also encountered before [3]. My bad for pushing it in. > > > > Anyhow, I created a similar code on godbolt ( https://godbolt.org/z/dPqKKzPs4 ) > > to show this behavior - how compiler barriers cause reload. It seems that this > > behavior happens on GCC and CLANG on various versions. > > > > The idea behind the patch is that it communicates - in the compilation unit > > granularity - that current is fixed. There is an issue of whether it works with > > LTO, which I have never checked. > > > > > > [1] https://reviews.llvm.org/D145416 > > [2] https://lore.kernel.org/lkml/20230306120106.GE1267364@hirez.programming.kicks-ass.net/ > > [3] https://lore.kernel.org/all/20190823224424.15296-5-namit@vmware.com/ > > > > -- > > > > Here’s the updated patch - but I didn’t really boot a machine with it so new > > issues might have come since my last patch-set: > > Unfortunately, it does not work and dies early in the boot with: > > [ 4.939358] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [ 4.940090] #PF: supervisor write access in kernel mode > [ 4.940090] #PF: error_code(0x0002) - not-present page > [ 4.940090] PGD 0 P4D 0 > [ 4.940090] Oops: 0002 [#1] PREEMPT SMP PTI > [ 4.940090] CPU: 1 PID: 52 Comm: kworker/u4:1 Not tainted > 6.6.0-rc6-00365-g0c09c1d70838-dirty #7 > [ 4.940090] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.16.2-1.fc37 04/01/2014 > [ 4.940090] RIP: 0010:begin_new_exec+0x8f2/0xa30 > [ 4.940090] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89 > ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea > fa ff ff <f0> 41 ff 0c 24 0f > 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb > [ 4.940090] RSP: 0000:ffff9c84c01e3d68 EFLAGS: 00010246 > [ 4.940090] RAX: 0000000000000000 RBX: ffff9946e30c1f00 RCX: 0000000000000000 > [ 4.940090] RDX: 0000000000000000 RSI: ffff9946e2ff0000 RDI: ffff9946e30c2718 > [ 4.940090] RBP: ffff9946c03a7c00 R08: 00000000fffffffe R09: 00000000ffffffff > [ 4.940090] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000 > [ 4.940090] R13: 0000000000000000 R14: ffff9946e30c2718 R15: ffff9946e2ff0000 > [ 4.940090] FS: 0000000000000000(0000) GS:ffff994763f00000(0000) > knlGS:0000000000000000 > [ 4.940090] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 4.940090] CR2: 0000000000000000 CR3: 00000001003a8000 CR4: 00000000000406f0 > [ 4.940090] Call Trace: > [ 4.940090] <TASK> > [ 4.940090] ? __die+0x1e/0x60 > [ 4.940090] ? page_fault_oops+0x17b/0x470 > [ 4.940090] ? search_module_extables+0x14/0x50 > [ 4.940090] ? exc_page_fault+0x66/0x140 > [ 4.940090] ? asm_exc_page_fault+0x26/0x30 > [ 4.940090] ? begin_new_exec+0x8f2/0xa30 > [ 4.940090] ? begin_new_exec+0x3ce/0xa30 > [ 4.940090] ? load_elf_phdrs+0x67/0xb0 > [ 4.940090] load_elf_binary+0x2bb/0x1770 > [ 4.940090] ? __kernel_read+0x136/0x2d0 > [ 4.940090] bprm_execve+0x277/0x630 > [ 4.940090] kernel_execve+0x145/0x1a0 > [ 4.940090] call_usermodehelper_exec_async+0xcb/0x180 > [ 4.940090] ? __pfx_call_usermodehelper_exec_async+0x10/0x10 > [ 4.940090] ret_from_fork+0x2f/0x50 > [ 4.940090] ? __pfx_call_usermodehelper_exec_async+0x10/0x10 > [ 4.940090] ret_from_fork_asm+0x1b/0x30 > [ 4.940090] </TASK> > [ 4.940090] Modules linked in: > [ 4.940090] CR2: 0000000000000000 > [ 5.017606] ---[ end trace 0000000000000000 ]--- > [ 5.018957] RIP: 0010:begin_new_exec+0x8f2/0xa30 > [ 5.020299] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89 > ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea > fa ff ff <f0> 41 ff 0c 24 0f 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 > 48 fb > [ 5.024765] RSP: 0000:ffff9c84c01e3d68 EFLAGS: 00010246 > [ 5.026150] RAX: 0000000000000000 RBX: ffff9946e30c1f00 RCX: 0000000000000000 > [ 5.027916] RDX: 0000000000000000 RSI: ffff9946e2ff0000 RDI: ffff9946e30c2718 > [ 5.029714] RBP: ffff9946c03a7c00 R08: 00000000fffffffe R09: 00000000ffffffff > [ 5.031461] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000 > [ 5.033186] R13: 0000000000000000 R14: ffff9946e30c2718 R15: ffff9946e2ff0000 > [ 5.034908] FS: 0000000000000000(0000) GS:ffff994763f00000(0000) > knlGS:0000000000000000 > [ 5.036907] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 5.038341] CR2: 0000000000000000 CR3: 00000001003a8000 CR4: 00000000000406f0 > [ 5.040044] Kernel panic - not syncing: Fatal exception > [ 5.040647] Kernel Offset: 0x22e00000 from 0xffffffff81000000 > (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > > It looks that aliasing a structure from another namespace is a no-go, > since the patch (attached, slightly changed your patch) without > __percpu_seg_override decorations bootstraps OK. The working patch > (without __percpu_seg_override) is not effective (no effect in > syscall_exit_work) and increases the number of current_task reads from > 3841 to 4711. Forgot to say that the "nonworking" patch reduces the number of current_task reads to 3221. +#ifdef CONFIG_USE_X86_SEG_SUPPORT +static __always_inline struct task_struct *get_current(void) +{ + return this_cpu_read_const(const_pcpu_hot.current_task); +} FWIW, const_pcpu_hot is in __seg_gs space when decorated with __percpu_seg_override, so plain "return const_pcpu_hot.current_task;" would work here, too. Uros. ^ permalink raw reply [flat|nested] 106+ messages in thread
end of thread, other threads:[~2023-11-20 9:39 UTC | newest] Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-10-10 16:42 [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() Uros Bizjak 2023-10-10 17:32 ` Linus Torvalds 2023-10-10 18:22 ` Uros Bizjak 2023-10-10 18:25 ` Nadav Amit 2023-10-10 18:42 ` Linus Torvalds 2023-10-10 18:37 ` Linus Torvalds 2023-10-10 18:41 ` Uros Bizjak 2023-10-10 18:52 ` Linus Torvalds 2023-10-11 7:27 ` Uros Bizjak 2023-10-11 7:45 ` Uros Bizjak 2023-10-11 19:40 ` Linus Torvalds 2023-10-11 18:42 ` Uros Bizjak 2023-10-11 19:51 ` Linus Torvalds 2023-10-11 19:52 ` Linus Torvalds 2023-10-11 20:00 ` Uros Bizjak 2023-10-11 22:37 ` Ingo Molnar 2023-10-11 23:15 ` H. Peter Anvin 2023-10-12 1:35 ` Josh Poimboeuf 2023-10-12 6:19 ` Ingo Molnar 2023-10-12 16:08 ` Josh Poimboeuf 2023-10-12 17:59 ` Ingo Molnar 2023-10-12 21:30 ` Josh Poimboeuf 2023-10-13 10:52 ` Ingo Molnar 2023-10-11 7:41 ` Nadav Amit 2023-10-11 19:37 ` Linus Torvalds 2023-10-11 21:32 ` Uros Bizjak 2023-10-11 21:54 ` Linus Torvalds 2023-10-12 15:19 ` Nadav Amit 2023-10-12 16:33 ` Uros Bizjak 2023-10-12 16:55 ` Uros Bizjak 2023-10-12 17:10 ` Linus Torvalds 2023-10-12 17:47 ` Linus Torvalds 2023-10-12 18:01 ` Uros Bizjak 2023-10-13 9:38 ` Uros Bizjak 2023-10-13 11:53 ` Uros Bizjak 2023-10-13 16:38 ` Linus Torvalds 2023-10-12 17:52 ` Uros Bizjak 2023-11-20 9:39 ` Use %a asm operand modifier to obtain %rip-relative addressing Uros Bizjak 2023-10-12 16:56 ` [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr() Linus Torvalds 2023-10-12 17:16 ` Linus Torvalds 2023-10-12 19:32 ` Nadav Amit 2023-10-12 19:40 ` Linus Torvalds 2023-10-16 18:52 ` Uros Bizjak 2023-10-16 19:24 ` Linus Torvalds 2023-10-16 20:35 ` Nadav Amit 2023-10-16 20:59 ` Linus Torvalds 2023-10-16 23:02 ` Linus Torvalds 2023-10-16 23:14 ` Linus Torvalds 2023-10-17 7:23 ` Nadav Amit 2023-10-17 19:00 ` Linus Torvalds 2023-10-17 19:11 ` Uros Bizjak 2023-10-17 21:05 ` Uros Bizjak 2023-10-17 21:53 ` Linus Torvalds 2023-10-17 22:06 ` Nadav Amit 2023-10-17 22:29 ` Nadav Amit 2023-10-18 7:46 ` Uros Bizjak 2023-10-18 9:04 ` Uros Bizjak 2023-10-18 10:54 ` Nadav Amit 2023-10-18 12:14 ` Uros Bizjak 2023-10-18 13:15 ` Uros Bizjak 2023-10-18 14:46 ` Nadav Amit 2023-10-18 15:17 ` Uros Bizjak 2023-10-18 16:03 ` Nadav Amit 2023-10-18 16:26 ` Linus Torvalds 2023-10-18 17:23 ` Uros Bizjak 2023-10-18 18:11 ` Linus Torvalds 2023-10-18 18:08 ` Uros Bizjak 2023-10-18 18:15 ` Linus Torvalds 2023-10-18 18:26 ` Uros Bizjak 2023-10-18 19:33 ` Uros Bizjak 2023-10-18 20:17 ` Nadav Amit 2023-10-18 20:22 ` Linus Torvalds 2023-10-18 20:34 ` Linus Torvalds 2023-10-18 20:51 ` Uros Bizjak 2023-10-18 21:09 ` Uros Bizjak 2023-10-18 21:10 ` Linus Torvalds 2023-10-18 21:40 ` Uros Bizjak 2023-10-18 22:40 ` Linus Torvalds 2023-10-18 23:06 ` Linus Torvalds 2023-10-19 7:04 ` Uros Bizjak 2023-10-19 16:59 ` Linus Torvalds 2023-10-19 17:21 ` Uros Bizjak 2023-10-19 18:06 ` Linus Torvalds 2023-10-19 18:16 ` Uros Bizjak 2023-10-19 18:49 ` Linus Torvalds 2023-10-19 19:07 ` Linus Torvalds 2023-10-20 7:57 ` Uros Bizjak 2023-10-19 21:04 ` Linus Torvalds 2023-10-19 22:39 ` Linus Torvalds 2023-10-20 8:08 ` Uros Bizjak 2023-10-19 8:44 ` Peter Zijlstra 2023-10-19 8:54 ` Peter Zijlstra 2023-10-19 17:04 ` Linus Torvalds 2023-10-19 18:13 ` Peter Zijlstra 2023-10-19 18:22 ` Linus Torvalds 2023-10-19 18:37 ` Uros Bizjak 2023-10-19 9:07 ` Peter Zijlstra 2023-10-19 9:23 ` Uros Bizjak 2023-10-18 20:42 ` Uros Bizjak 2023-10-19 16:32 ` Uros Bizjak 2023-10-19 17:08 ` Linus Torvalds 2023-10-18 18:29 ` Nadav Amit 2023-10-18 16:12 ` Linus Torvalds 2023-10-18 17:07 ` Uros Bizjak 2023-10-18 18:01 ` Linus Torvalds 2023-10-16 21:09 ` Uros Bizjak
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).