* Candidate Linux ABI for Intel AMX and hypothetical new related features @ 2021-03-26 23:12 Andy Lutomirski 2021-03-26 23:18 ` Andy Lutomirski ` (2 more replies) 0 siblings, 3 replies; 130+ messages in thread From: Andy Lutomirski @ 2021-03-26 23:12 UTC (permalink / raw) To: Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer Hi all- After some discussion on IRC, I have a proposal for a Linux ABI for using Intel AMX and other similar features. It works like this: First, we make XCR0 dynamic. This looks a lot like Keno's patch but with a different API, outlined below. Different tasks can have different XCR0 values. The default XCR0 for new tasks does not include big features like AMX. XMM and YMM are still there. The AVX2 states are debatable -- see below. To detect features and control XCR0, we add some new arch_prctls: arch_prctl(ARCH_GET_XCR0_SUPPORT, 0, ...); returns the set of XCR0 bits supported on the current kernel. arch_prctl(ARCH_GET_XCR0_LAZY_SUPPORT, 0, ...); returns 0. See below. arch_prctl(ARCH_SET_XCR0, xcr0, lazy_states, sigsave_states, sigclear_states, 0); Sets xcr0. All states are preallocated except that states in lazy_states may be unallocated in the kernel until used. (Not supported at all in v1. lazy_states & ~xcr0 != 0 is illegal.) States in sigsave_states are saved in the signal frame. States in sigclear_states are reset to the init state on signal delivery. States in sigsave_states are restored by sigreturn, and states not in sigsave_states are left alone by sigreturn. Optionally we do not support PKRU at all in XCR0 -- it doesn't make that much sense as an XSAVE feature, and I'm not convinced that trying to correctly context switch XINUSE[PKRU] is worthwhile. I doubt we get it right today. Optionally we come up with a new format for new features in the signal frame, since the current format is showing its age. Taking 8kB for a signal with AMX is one thing. Taking another 8kB for a nested signal if AMX is not in use is worse. Optionally we make AVX-512 also default off, which fixes what is arguably a serious ABI break with AVX-512: lots of programs, following POSIX (!), seem to think that they know much much space to allocate for sigaltstack(). AVX-512 is too big. Thoughts? --Andy ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-26 23:12 Candidate Linux ABI for Intel AMX and hypothetical new related features Andy Lutomirski @ 2021-03-26 23:18 ` Andy Lutomirski 2021-03-27 3:39 ` Len Brown 2021-03-28 0:53 ` Thomas Gleixner 2021-03-31 8:24 ` Borislav Petkov [not found] ` <87lf9nk2ku.fsf@oldenburg.str.redhat.com> 2 siblings, 2 replies; 130+ messages in thread From: Andy Lutomirski @ 2021-03-26 23:18 UTC (permalink / raw) To: Andy Lutomirski Cc: Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API Sigh, cc linux-api, not linux-abi. On Fri, Mar 26, 2021 at 4:12 PM Andy Lutomirski <luto@kernel.org> wrote: > > Hi all- > > After some discussion on IRC, I have a proposal for a Linux ABI for > using Intel AMX and other similar features. It works like this: > > First, we make XCR0 dynamic. This looks a lot like Keno's patch but > with a different API, outlined below. Different tasks can have > different XCR0 values. The default XCR0 for new tasks does not > include big features like AMX. XMM and YMM are still there. The AVX2 > states are debatable -- see below. > > To detect features and control XCR0, we add some new arch_prctls: > > arch_prctl(ARCH_GET_XCR0_SUPPORT, 0, ...); > > returns the set of XCR0 bits supported on the current kernel. > > arch_prctl(ARCH_GET_XCR0_LAZY_SUPPORT, 0, ...); > > returns 0. See below. > > arch_prctl(ARCH_SET_XCR0, xcr0, lazy_states, sigsave_states, > sigclear_states, 0); > > Sets xcr0. All states are preallocated except that states in > lazy_states may be unallocated in the kernel until used. (Not > supported at all in v1. lazy_states & ~xcr0 != 0 is illegal.) States > in sigsave_states are saved in the signal frame. States in > sigclear_states are reset to the init state on signal delivery. > States in sigsave_states are restored by sigreturn, and states not in > sigsave_states are left alone by sigreturn. > > Optionally we do not support PKRU at all in XCR0 -- it doesn't make > that much sense as an XSAVE feature, and I'm not convinced that trying > to correctly context switch XINUSE[PKRU] is worthwhile. I doubt we > get it right today. > > Optionally we come up with a new format for new features in the signal > frame, since the current format is showing its age. Taking 8kB for a > signal with AMX is one thing. Taking another 8kB for a nested signal > if AMX is not in use is worse. > > Optionally we make AVX-512 also default off, which fixes what is > arguably a serious ABI break with AVX-512: lots of programs, following > POSIX (!), seem to think that they know much much space to allocate > for sigaltstack(). AVX-512 is too big. > > Thoughts? > > --Andy ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-26 23:18 ` Andy Lutomirski @ 2021-03-27 3:39 ` Len Brown 2021-03-27 9:14 ` Borislav Petkov 2021-03-27 9:58 ` Greg KH 2021-03-28 0:53 ` Thomas Gleixner 1 sibling, 2 replies; 130+ messages in thread From: Len Brown @ 2021-03-27 3:39 UTC (permalink / raw) To: Andy Lutomirski Cc: Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API Hi Andy, Say a mainline links with a math library that uses AMX without the knowledge of the mainline. Say the mainline is also linked with a userspace threading library that thinks it has a concept of XSAVE area size. Wouldn't the change in XCR0, resulting in XSAVE size change, risk confusing the threading library? thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-27 3:39 ` Len Brown @ 2021-03-27 9:14 ` Borislav Petkov 2021-03-27 9:58 ` Greg KH 1 sibling, 0 replies; 130+ messages in thread From: Borislav Petkov @ 2021-03-27 9:14 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote: > Say a mainline links with a math library that uses AMX without the > knowledge of the mainline. What is a "mainline"? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-27 3:39 ` Len Brown 2021-03-27 9:14 ` Borislav Petkov @ 2021-03-27 9:58 ` Greg KH 2021-03-29 15:47 ` Len Brown 1 sibling, 1 reply; 130+ messages in thread From: Greg KH @ 2021-03-27 9:58 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote: > Hi Andy, > > Say a mainline links with a math library that uses AMX without the > knowledge of the mainline. What does this mean? What happened to the context here? > Say the mainline is also linked with a userspace threading library > that thinks it has a concept of XSAVE area size. How can the kernel (what I think you mean by "mainline" here) be linked with a userspace library at all? > Wouldn't the change in XCR0, resulting in XSAVE size change, risk > confusing the threading library? Shouldn't that be the job of the kernel and not userspace? totally confused, greg k-h ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-27 9:58 ` Greg KH @ 2021-03-29 15:47 ` Len Brown 2021-03-29 16:38 ` Len Brown 2021-03-29 18:16 ` Andy Lutomirski 0 siblings, 2 replies; 130+ messages in thread From: Len Brown @ 2021-03-29 15:47 UTC (permalink / raw) To: Greg KH Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Sat, Mar 27, 2021 at 5:58 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote: > > Hi Andy, > > > > Say a mainline links with a math library that uses AMX without the > > knowledge of the mainline. sorry for the confusion. mainline = main(). ie. the part of the program written by you, and not the library you linked with. In particular, the library may use instructions that main() doesn't know exist. -- Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-29 15:47 ` Len Brown @ 2021-03-29 16:38 ` Len Brown 2021-03-29 16:48 ` Florian Weimer 2021-03-29 18:14 ` Andy Lutomirski 2021-03-29 18:16 ` Andy Lutomirski 1 sibling, 2 replies; 130+ messages in thread From: Len Brown @ 2021-03-29 16:38 UTC (permalink / raw) To: Greg KH Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API > In particular, the library may use instructions that main() doesn't know exist. And so I'll ask my question another way. How is it okay to change the value of XCR0 during the run time of a program? I submit that it is not, and that is a deal-killer for a request/release API. eg. main() doesn't know that the math library wants to use AMX, and neither does the threading library. So main() doesn't know to call the API before either library is invoked. The threading library starts up and creates user-space threads based on the initial value from XCR0. Then the math library calls the API, which adds bits to XCRO, and then the user-space context switch in the threading library corrupts data because the new XCR0 size doesn't match the initial size. -Len ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-29 16:38 ` Len Brown @ 2021-03-29 16:48 ` Florian Weimer 2021-03-29 18:14 ` Andy Lutomirski 1 sibling, 0 replies; 130+ messages in thread From: Florian Weimer @ 2021-03-29 16:48 UTC (permalink / raw) To: Len Brown via Libc-alpha Cc: Greg KH, Len Brown, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Dave Hansen, Kyle Huey, Andy Lutomirski, Keno Fischer * Len Brown via Libc-alpha: >> In particular, the library may use instructions that main() doesn't know exist. > > And so I'll ask my question another way. > > How is it okay to change the value of XCR0 during the run time of a > program? > > I submit that it is not, and that is a deal-killer for a > request/release API. > > eg. main() doesn't know that the math library wants to use AMX, and > neither does the threading library. So main() doesn't know to call > the API before either library is invoked. The threading library > starts up and creates user-space threads based on the initial value > from XCR0. Then the math library calls the API, which adds bits to > XCRO, and then the user-space context switch in the threading > library corrupts data because the new XCR0 size doesn't match the > initial size. I agree that this doesn't quite work. (Today, it's not the thread library, but the glibc dynamic loader trampoline.) I disagree that CPU feature enablement has been a failure. I think we are pretty good at enabling new CPU features on older operating systems, not just bleeding edge mainline kernels. Part of that is that anything but the kernel stays out of the way, and most features are available directly via inline assembly (you can even use .byte hacks if you want). There is no need to switch to new userspace libraries, compile out-of-tree kernel drivers that have specific firmware requirements, and so on. If the operations that need a huge context can be made idempotent, with periodic checkpoints, it might be possible to avoid saving the context completely by some rseq-like construct. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-29 16:38 ` Len Brown 2021-03-29 16:48 ` Florian Weimer @ 2021-03-29 18:14 ` Andy Lutomirski 1 sibling, 0 replies; 130+ messages in thread From: Andy Lutomirski @ 2021-03-29 18:14 UTC (permalink / raw) To: Len Brown Cc: Greg KH, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API > On Mar 29, 2021, at 9:39 AM, Len Brown <lenb@kernel.org> wrote: > > >> >> In particular, the library may use instructions that main() doesn't know exist. > > And so I'll ask my question another way. > > How is it okay to change the value of XCR0 during the run time of a program? > > I submit that it is not, and that is a deal-killer for a request/release API. > > eg. main() doesn't know that the math library wants to use AMX, > and neither does the threading library. So main() doesn't know to > call the API before either library is invoked. The threading library starts up > and creates user-space threads based on the initial value from XCR0. > Then the math library calls the API, which adds bits to XCRO, > and then the user-space context switch in the threading library corrupts data > because the new XCR0 size doesn't match the initial size. > In the most extreme case, userspace could require that every loaded DSO be tagged with a new ELF note indicating support for dynamic XCR0 before changing XCR0. I would like to remind everyone that kernel enablement of AVX512 *already* broke old userspace. AMX will further break something. At least with dynamic XCR0 we can make the breakage opt-in. The ISA could have helped here by allowing the non-compacted XSTATE format to be frozen even in the face of changing XCR0. But it didn’t. At the end of the day, we are faced with the fact that XSTATE is a poor design, and we have to make the best of it. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-29 15:47 ` Len Brown 2021-03-29 16:38 ` Len Brown @ 2021-03-29 18:16 ` Andy Lutomirski 2021-03-29 22:38 ` Len Brown 1 sibling, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-03-29 18:16 UTC (permalink / raw) To: Len Brown Cc: Greg KH, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API > On Mar 29, 2021, at 8:47 AM, Len Brown <lenb@kernel.org> wrote: > > On Sat, Mar 27, 2021 at 5:58 AM Greg KH <gregkh@linuxfoundation.org> wrote: >>> On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote: >>> Hi Andy, >>> Say a mainline links with a math library that uses AMX without the >>> knowledge of the mainline. > > sorry for the confusion. > > mainline = main(). > > ie. the part of the program written by you, and not the library you linked with. > > In particular, the library may use instructions that main() doesn't know exist. If we pretend for a bit that AMX were a separate device instead of a part of the CPU, this would be a no brainer: something would be responsible for opening a device node or otherwise requesting access to the device. Real AMX isn’t so different. Programs acquire access either by syscall or by a fault, they use it, and (hopefully) they release it again using TILERELEASE. The only thing special about it is that, supposedly, acquiring and releasing access (at least after the first time) is quite fast. But holding access is *not* free — despite all your assertions to the contrary, the kernel *will* correctly context switch it to avoid blowing up power consumption, and this will have overhead. We’ve seen the pattern of programs thinking that, just because something is a CPU insn, it’s free and no thought is needed before using it. This happened with AVX and AVX512, and it will happen again with AMX. We *still* have a significant performance regression in the kernel due to screwing up the AVX state machine, and the only way I know about any of the details is that I wrote silly test programs to try to reverse engineer the nonsensical behavior of the CPUs. I might believe that Intel has figured out how to make a well behaved XSTATE feature after Intel demonstrates at least once that it’s possible. That means full documentation of all the weird issues, no new special cases, and the feature actually making sense in the context of XSTATE. This has not happened. Let’s list all of them: - SSE. Look for all the MXCSR special cases in the pseudocode and tell me with a straight face that this one works sensibly. - AVX. Also has special cases in the pseudocode. And has transition issues that are still problems and still not fully documented. L - AVX2. Horrible undocumented performance issues. Otherwise maybe okay? - MPX: maybe the best example, but the compat mode part got flubbed and it’s MPX. - PKRU: Should never have been in XSTATE. (Also, having WRPKRU in the ISA was a major mistake, now unfixable, that seriously limits the usefulness of the whole feature. I suppose Intel could release PKRU2 with a better ISA and deprecate the original PKRU, but I’m not holding my breath.) - AVX512: Yet more uarch-dependent horrible performance issues, and Intel has still not responded about documentation. The web is full of people speculating differently about when, exactly, using AVX512 breaks performance. This is NAKked in kernel until docs arrive. Also, it broke old user programs. If we had noticed a few years ago, AVX512 enablement would have been reverted. - AMX: This mess. The current system of automatic user enablement does not work. We need something better. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-29 18:16 ` Andy Lutomirski @ 2021-03-29 22:38 ` Len Brown 2021-03-30 5:08 ` Andy Lutomirski 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-03-29 22:38 UTC (permalink / raw) To: Andy Lutomirski Cc: Greg KH, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Mon, Mar 29, 2021 at 2:16 PM Andy Lutomirski <luto@amacapital.net> wrote: > > > > On Mar 29, 2021, at 8:47 AM, Len Brown <lenb@kernel.org> wrote: > > > > On Sat, Mar 27, 2021 at 5:58 AM Greg KH <gregkh@linuxfoundation.org> wrote: > >>> On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote: > >>> Hi Andy, > >>> Say a mainline links with a math library that uses AMX without the > >>> knowledge of the mainline. > > > > sorry for the confusion. > > > > mainline = main(). > > > > ie. the part of the program written by you, and not the library you linked with. > > > > In particular, the library may use instructions that main() doesn't know exist. > > If we pretend for a bit that AMX were a separate device instead of a part of the CPU, this would be a no brainer: something would be responsible for opening a device node or otherwise requesting access to the device. > > Real AMX isn’t so different. Programs acquire access either by syscall or by a fault, they use it, and (hopefully) they release it again using TILERELEASE. The only thing special about it is that, supposedly, acquiring and releasing access (at least after the first time) is quite fast. But holding access is *not* free — despite all your assertions to the contrary, the kernel *will* correctly context switch it to avoid blowing up power consumption, and this will have overhead. > > We’ve seen the pattern of programs thinking that, just because something is a CPU insn, it’s free and no thought is needed before using it. This happened with AVX and AVX512, and it will happen again with AMX. We *still* have a significant performance regression in the kernel due to screwing up the AVX state machine, and the only way I know about any of the details is that I wrote silly test programs to try to reverse engineer the nonsensical behavior of the CPUs. > > I might believe that Intel has figured out how to make a well behaved XSTATE feature after Intel demonstrates at least once that it’s possible. That means full documentation of all the weird issues, no new special cases, and the feature actually making sense in the context of XSTATE. This has not happened. Let’s list all of them: > > - SSE. Look for all the MXCSR special cases in the pseudocode and tell me with a straight face that this one works sensibly. > > - AVX. Also has special cases in the pseudocode. And has transition issues that are still problems and still not fully documented. L > > - AVX2. Horrible undocumented performance issues. Otherwise maybe okay? > > - MPX: maybe the best example, but the compat mode part got flubbed and it’s MPX. > > - PKRU: Should never have been in XSTATE. (Also, having WRPKRU in the ISA was a major mistake, now unfixable, that seriously limits the usefulness of the whole feature. I suppose Intel could release PKRU2 with a better ISA and deprecate the original PKRU, but I’m not holding my breath.) > > - AVX512: Yet more uarch-dependent horrible performance issues, and Intel has still not responded about documentation. The web is full of people speculating differently about when, exactly, using AVX512 breaks performance. This is NAKked in kernel until docs arrive. Also, it broke old user programs. If we had noticed a few years ago, AVX512 enablement would have been reverted. > > - AMX: This mess. > > The current system of automatic user enablement does not work. We need something better. Hi Andy, Can you provide a concise definition of the exact problemI(s) this thread is attempting to address? Thank ahead-of-time for excluding "blow up power consumption", since that paranoia is not grounded in fact. thanks, -Len ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-29 22:38 ` Len Brown @ 2021-03-30 5:08 ` Andy Lutomirski 2021-03-30 5:50 ` Noah Goldstein 2021-03-30 17:01 ` Len Brown 0 siblings, 2 replies; 130+ messages in thread From: Andy Lutomirski @ 2021-03-30 5:08 UTC (permalink / raw) To: Len Brown Cc: Greg KH, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Mon, Mar 29, 2021 at 3:38 PM Len Brown <lenb@kernel.org> wrote: > > On Mon, Mar 29, 2021 at 2:16 PM Andy Lutomirski <luto@amacapital.net> wrote: > > > Hi Andy, > > Can you provide a concise definition of the exact problemI(s) this thread > is attempting to address? The AVX-512 state, all by itself, is more than 2048 bytes. Quoting the POSIX sigaltstack page (man 3p sigaltstack): The value SIGSTKSZ is a system default specifying the number of bytes that would be used to cover the usual case when manually allocating an alternate stack area. The value MINSIGSTKSZ is defined to be the mini‐ mum stack size for a signal handler. In computing an alternate stack size, a program should add that amount to its stack requirements to al‐ low for the system implementation overhead. The constants SS_ONSTACK, SS_DISABLE, SIGSTKSZ, and MINSIGSTKSZ are defined in <signal.h>. arch/x86/include/uapi/asm/signal.h:#define MINSIGSTKSZ 2048 arch/x86/include/uapi/asm/signal.h:#define SIGSTKSZ 8192 Regrettably, the Linux signal frame format is the uncompacted format and, also regrettably, the uncompacted format has the nasty property that its format depends on XCR0 but not on the set of registers that are actually used or wanted, so, with the current ABI, the signal frame is stuck being quite large for all programs on a machine that supports avx512 and has it enabled by the kernel. And it's even larger for AMX and violates SIGSTKSZ as well as MINSTKSZ. There are apparently real programs that break as a result. We need to find a way to handle new, large extended states without breaking user ABI. We should also find a way to handle them without consuming silly amounts of stack space for programs that don't use them. Sadly, if the solution we settle on involves context switching XCR0, performance on first-generation hardware will suffer because VMX does not have any way to allow guests to write XCR0 without exiting. I don't consider this to be a showstopper -- if we end up having this problem, fixing it in subsequent CPUs is straightforward. > > Thank ahead-of-time for excluding "blow up power consumption", > since that paranoia is not grounded in fact. > I will gladly exclude power consumption from this discussion, since that's a separate issue that has nothing to do with the user<->kernel ABI. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-30 5:08 ` Andy Lutomirski @ 2021-03-30 5:50 ` Noah Goldstein 2021-03-30 17:01 ` Len Brown 1 sibling, 0 replies; 130+ messages in thread From: Noah Goldstein @ 2021-03-30 5:50 UTC (permalink / raw) To: Andy Lutomirski Cc: Len Brown, Florian Weimer, Rich Felker, libc-alpha, Greg KH, Bae, Chang Seok, X86 ML, LKML, Dave Hansen, Kyle Huey, Linux API, Keno Fischer Forgive if this is silly but would it be possible to do something simliar to rseq where the user can register a set of features for a program counter region and then on interrupt check that to determine what needs to be saved? For example if a user doesn't use any AMX but loads a library that does, for all ip in the users code AMX state won't be saved but an interrupt in ip range of the library will save AMX state. One advantage of this is it would be pretty easy silently do this right with compiler support and to preserve old code the "ip not found in table" case could default to the worst case the CPU supports. On Tue, Mar 30, 2021 at 1:09 AM Andy Lutomirski via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Mon, Mar 29, 2021 at 3:38 PM Len Brown <lenb@kernel.org> wrote: > > > > On Mon, Mar 29, 2021 at 2:16 PM Andy Lutomirski <luto@amacapital.net> wrote: > > > > > > Hi Andy, > > > > Can you provide a concise definition of the exact problemI(s) this thread > > is attempting to address? > > The AVX-512 state, all by itself, is more than 2048 bytes. Quoting > the POSIX sigaltstack page (man 3p sigaltstack): > > The value SIGSTKSZ is a system default specifying the number of bytes > that would be used to cover the usual case when manually allocating an > alternate stack area. The value MINSIGSTKSZ is defined to be the mini‐ > mum stack size for a signal handler. In computing an alternate stack > size, a program should add that amount to its stack requirements to al‐ > low for the system implementation overhead. The constants SS_ONSTACK, > SS_DISABLE, SIGSTKSZ, and MINSIGSTKSZ are defined in <signal.h>. > > arch/x86/include/uapi/asm/signal.h:#define MINSIGSTKSZ 2048 > arch/x86/include/uapi/asm/signal.h:#define SIGSTKSZ 8192 > > Regrettably, the Linux signal frame format is the uncompacted format > and, also regrettably, the uncompacted format has the nasty property > that its format depends on XCR0 but not on the set of registers that > are actually used or wanted, so, with the current ABI, the signal > frame is stuck being quite large for all programs on a machine that > supports avx512 and has it enabled by the kernel. And it's even > larger for AMX and violates SIGSTKSZ as well as MINSTKSZ. > > There are apparently real programs that break as a result. We need to > find a way to handle new, large extended states without breaking user > ABI. We should also find a way to handle them without consuming silly > amounts of stack space for programs that don't use them. > > Sadly, if the solution we settle on involves context switching XCR0, > performance on first-generation hardware will suffer because VMX does > not have any way to allow guests to write XCR0 without exiting. I > don't consider this to be a showstopper -- if we end up having this > problem, fixing it in subsequent CPUs is straightforward. > > > > > Thank ahead-of-time for excluding "blow up power consumption", > > since that paranoia is not grounded in fact. > > > > I will gladly exclude power consumption from this discussion, since > that's a separate issue that has nothing to do with the user<->kernel > ABI. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-30 5:08 ` Andy Lutomirski 2021-03-30 5:50 ` Noah Goldstein @ 2021-03-30 17:01 ` Len Brown 2021-03-30 17:05 ` Andy Lutomirski 1 sibling, 1 reply; 130+ messages in thread From: Len Brown @ 2021-03-30 17:01 UTC (permalink / raw) To: Andy Lutomirski Cc: Greg KH, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API Andy, I agree, completely, with your description of the challenge, thank you for focusing the discussion on that problem statement. Question: Is it required (by the "ABI") that a user program has everything on the stack for user-space XSAVE/XRESTOR to get back to the state of the program just before receiving the signal? My understanding is that there are programs that do this. However, if it is not guaranteed to work, that could greatly simplify what we are required to put on the signal stack. thanks, -Len ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-30 17:01 ` Len Brown @ 2021-03-30 17:05 ` Andy Lutomirski 2021-03-30 17:56 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-03-30 17:05 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, Greg KH, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API > On Mar 30, 2021, at 10:01 AM, Len Brown <lenb@kernel.org> wrote: > > Andy, > > I agree, completely, with your description of the challenge, > thank you for focusing the discussion on that problem statement. > > Question: > > Is it required (by the "ABI") that a user program has everything > on the stack for user-space XSAVE/XRESTOR to get back > to the state of the program just before receiving the signal? The current Linux signal frame format has XSTATE in uncompacted format, so everything has to be there. Maybe we could have an opt in new signal frame format, but the details would need to be worked out. It is certainly the case that a signal should be able to be delivered, run “async-signal-safe” code, and return, without corrupting register contents. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-30 17:05 ` Andy Lutomirski @ 2021-03-30 17:56 ` Len Brown 2021-03-30 19:12 ` Dave Hansen 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-03-30 17:56 UTC (permalink / raw) To: Andy Lutomirski Cc: Andy Lutomirski, Greg KH, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski <luto@amacapital.net> wrote: > > On Mar 30, 2021, at 10:01 AM, Len Brown <lenb@kernel.org> wrote: > > Is it required (by the "ABI") that a user program has everything > > on the stack for user-space XSAVE/XRESTOR to get back > > to the state of the program just before receiving the signal? > > The current Linux signal frame format has XSTATE in uncompacted format, > so everything has to be there. > Maybe we could have an opt in new signal frame format, but the details would need to be worked out. > > It is certainly the case that a signal should be able to be delivered, run “async-signal-safe” code, > and return, without corrupting register contents. And so an an acknowledgement: We can't change the legacy signal stack format without breaking existing programs. The legacy is uncompressed XSTATE. It is a complete set of architectural state -- everything necessary to XRESTOR. Further, the sigreturn flow allows the signal handler to *change* any of that state, so that it becomes active upon return from signal. And a proposal: Future programs, which know that they don't need the full-blown legacy signal stack format, can opt-in to a new format. That new format, can be minimal (fast) by default. Perhaps, as Noah suggests, it could have some sort of mechanism where the program can explicitly select which state components they would want included on their signal stack, and restored by sigreturn. If the new fast-signal format is successful, in a number of years, it will have spread to have taken over the world. thoughts? Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-30 17:56 ` Len Brown @ 2021-03-30 19:12 ` Dave Hansen 2021-03-30 20:20 ` Andy Lutomirski 0 siblings, 1 reply; 130+ messages in thread From: Dave Hansen @ 2021-03-30 19:12 UTC (permalink / raw) To: Len Brown, Andy Lutomirski Cc: Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On 3/30/21 10:56 AM, Len Brown wrote: > On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski <luto@amacapital.net> wrote: >>> On Mar 30, 2021, at 10:01 AM, Len Brown <lenb@kernel.org> wrote: >>> Is it required (by the "ABI") that a user program has everything >>> on the stack for user-space XSAVE/XRESTOR to get back >>> to the state of the program just before receiving the signal? >> The current Linux signal frame format has XSTATE in uncompacted format, >> so everything has to be there. >> Maybe we could have an opt in new signal frame format, but the details would need to be worked out. >> >> It is certainly the case that a signal should be able to be delivered, run “async-signal-safe” code, >> and return, without corrupting register contents. > And so an an acknowledgement: > > We can't change the legacy signal stack format without breaking > existing programs. The legacy is uncompressed XSTATE. It is a > complete set of architectural state -- everything necessary to > XRESTOR. Further, the sigreturn flow allows the signal handler to > *change* any of that state, so that it becomes active upon return from > signal. One nit with this: XRSTOR itself can work with the compacted format or uncompacted format. Unlike the XSAVE/XSAVEC side where compaction is explicit from the instruction itself, XRSTOR changes its behavior by reading XCOMP_BV. There's no XRSTORC. The issue with using the compacted format is when legacy software in the signal handler needs to go access the state. *That* is what can't handle a change in the XSAVE buffer format (either optimized/XSAVEOPT, or compacted/XSAVEC). ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-30 19:12 ` Dave Hansen @ 2021-03-30 20:20 ` Andy Lutomirski 2021-03-30 20:42 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-03-30 20:20 UTC (permalink / raw) To: Dave Hansen Cc: Len Brown, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API > On Mar 30, 2021, at 12:12 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 3/30/21 10:56 AM, Len Brown wrote: >> On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski <luto@amacapital.net> wrote: >>>> On Mar 30, 2021, at 10:01 AM, Len Brown <lenb@kernel.org> wrote: >>>> Is it required (by the "ABI") that a user program has everything >>>> on the stack for user-space XSAVE/XRESTOR to get back >>>> to the state of the program just before receiving the signal? >>> The current Linux signal frame format has XSTATE in uncompacted format, >>> so everything has to be there. >>> Maybe we could have an opt in new signal frame format, but the details would need to be worked out. >>> >>> It is certainly the case that a signal should be able to be delivered, run “async-signal-safe” code, >>> and return, without corrupting register contents. >> And so an an acknowledgement: >> >> We can't change the legacy signal stack format without breaking >> existing programs. The legacy is uncompressed XSTATE. It is a >> complete set of architectural state -- everything necessary to >> XRESTOR. Further, the sigreturn flow allows the signal handler to >> *change* any of that state, so that it becomes active upon return from >> signal. > > One nit with this: XRSTOR itself can work with the compacted format or > uncompacted format. Unlike the XSAVE/XSAVEC side where compaction is > explicit from the instruction itself, XRSTOR changes its behavior by > reading XCOMP_BV. There's no XRSTORC. > > The issue with using the compacted format is when legacy software in the > signal handler needs to go access the state. *That* is what can't > handle a change in the XSAVE buffer format (either optimized/XSAVEOPT, > or compacted/XSAVEC). The compacted format isn’t compact enough anyway. If we want to keep AMX and AVX512 enabled in XCR0 then we need to further muck with the format to omit the not-in-use features. I *think* we can pull this off in a way that still does the right thing wrt XRSTOR. If we go this route, I think we want a way for sigreturn to understand a pointer to the state instead of inline state to allow programs to change the state. Or maybe just to have a way to ask sigreturn to skip the restore entirely. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-30 20:20 ` Andy Lutomirski @ 2021-03-30 20:42 ` Len Brown 2021-03-30 22:01 ` David Laight 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-03-30 20:42 UTC (permalink / raw) To: Andy Lutomirski Cc: Dave Hansen, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Tue, Mar 30, 2021 at 4:20 PM Andy Lutomirski <luto@amacapital.net> wrote: > > > > On Mar 30, 2021, at 12:12 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > > > On 3/30/21 10:56 AM, Len Brown wrote: > >> On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski <luto@amacapital.net> wrote: > >>>> On Mar 30, 2021, at 10:01 AM, Len Brown <lenb@kernel.org> wrote: > >>>> Is it required (by the "ABI") that a user program has everything > >>>> on the stack for user-space XSAVE/XRESTOR to get back > >>>> to the state of the program just before receiving the signal? > >>> The current Linux signal frame format has XSTATE in uncompacted format, > >>> so everything has to be there. > >>> Maybe we could have an opt in new signal frame format, but the details would need to be worked out. > >>> > >>> It is certainly the case that a signal should be able to be delivered, run “async-signal-safe” code, > >>> and return, without corrupting register contents. > >> And so an an acknowledgement: > >> > >> We can't change the legacy signal stack format without breaking > >> existing programs. The legacy is uncompressed XSTATE. It is a > >> complete set of architectural state -- everything necessary to > >> XRESTOR. Further, the sigreturn flow allows the signal handler to > >> *change* any of that state, so that it becomes active upon return from > >> signal. > > > > One nit with this: XRSTOR itself can work with the compacted format or > > uncompacted format. Unlike the XSAVE/XSAVEC side where compaction is > > explicit from the instruction itself, XRSTOR changes its behavior by > > reading XCOMP_BV. There's no XRSTORC. > > > > The issue with using the compacted format is when legacy software in the > > signal handler needs to go access the state. *That* is what can't > > handle a change in the XSAVE buffer format (either optimized/XSAVEOPT, > > or compacted/XSAVEC). > > The compacted format isn’t compact enough anyway. If we want to keep AMX and AVX512 enabled in XCR0 then we need to further muck with the format to omit the not-in-use features. I *think* we can pull this off in a way that still does the right thing wrt XRSTOR. Agreed. Compacted format doesn't save any space when INIT=0, so it is only a half-step forward. > If we go this route, I think we want a way for sigreturn to understand a pointer to the state instead of inline state to allow programs to change the state. Or maybe just to have a way to ask sigreturn to skip the restore entirely. The legacy approach puts all architectural state on the signal stack in XSTATE format. If we make the signal stack smaller with a new fast-signal scheme, we need to find another place for that state to live. It can't live in the task context switch buffer. If we put it there and then take an interrupt while running the signal handler, then we'd overwrite the signaled thread's state with the signal handler's state. Can we leave it in live registers? That would be the speed-of-light signal handler approach. But we'd need to teach the signal handler to not clobber it. Perhaps that could be part of the contract that a fast signal handler signs? INIT=0 AMX state could simply sit patiently in the AMX registers for the duration of the signal handler. You can't get any faster than doing nothing :-) Of course part of the contract for the fast signal handler is that it knows that it can't possibly use XRESTOR of the stuff on the stack to necessarily get back to the state of the signaled thread (assuming we even used XSTATE format on the fast signal handler stack, it would forget the contents of the AMX registers, in this example) Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* RE: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-30 20:42 ` Len Brown @ 2021-03-30 22:01 ` David Laight 2021-03-31 16:31 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: David Laight @ 2021-03-30 22:01 UTC (permalink / raw) To: 'Len Brown', Andy Lutomirski Cc: Dave Hansen, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API From: Len Brown > Sent: 30 March 2021 21:42 > > On Tue, Mar 30, 2021 at 4:20 PM Andy Lutomirski <luto@amacapital.net> wrote: > > > > > > > On Mar 30, 2021, at 12:12 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > > > > > On 3/30/21 10:56 AM, Len Brown wrote: > > >> On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski <luto@amacapital.net> wrote: > > >>>> On Mar 30, 2021, at 10:01 AM, Len Brown <lenb@kernel.org> wrote: > > >>>> Is it required (by the "ABI") that a user program has everything > > >>>> on the stack for user-space XSAVE/XRESTOR to get back > > >>>> to the state of the program just before receiving the signal? > > >>> The current Linux signal frame format has XSTATE in uncompacted format, > > >>> so everything has to be there. > > >>> Maybe we could have an opt in new signal frame format, but the details would need to be worked > out. > > >>> > > >>> It is certainly the case that a signal should be able to be delivered, run “async-signal-safe” > code, > > >>> and return, without corrupting register contents. > > >> And so an an acknowledgement: > > >> > > >> We can't change the legacy signal stack format without breaking > > >> existing programs. The legacy is uncompressed XSTATE. It is a > > >> complete set of architectural state -- everything necessary to > > >> XRESTOR. Further, the sigreturn flow allows the signal handler to > > >> *change* any of that state, so that it becomes active upon return from > > >> signal. > > > > > > One nit with this: XRSTOR itself can work with the compacted format or > > > uncompacted format. Unlike the XSAVE/XSAVEC side where compaction is > > > explicit from the instruction itself, XRSTOR changes its behavior by > > > reading XCOMP_BV. There's no XRSTORC. > > > > > > The issue with using the compacted format is when legacy software in the > > > signal handler needs to go access the state. *That* is what can't > > > handle a change in the XSAVE buffer format (either optimized/XSAVEOPT, > > > or compacted/XSAVEC). > > > > The compacted format isn’t compact enough anyway. If we want to keep AMX and AVX512 enabled in XCR0 > then we need to further muck with the format to omit the not-in-use features. I *think* we can pull > this off in a way that still does the right thing wrt XRSTOR. > > Agreed. Compacted format doesn't save any space when INIT=0, so it is > only a half-step forward. > > > If we go this route, I think we want a way for sigreturn to understand a pointer to the state > instead of inline state to allow programs to change the state. Or maybe just to have a way to ask > sigreturn to skip the restore entirely. > > The legacy approach puts all architectural state on the signal stack > in XSTATE format. > > If we make the signal stack smaller with a new fast-signal scheme, we > need to find another place for that state to live. > > It can't live in the task context switch buffer. If we put it there > and then take an interrupt while running the signal handler, then we'd > overwrite the signaled thread's state with the signal handler's state. > > Can we leave it in live registers? That would be the speed-of-light > signal handler approach. But we'd need to teach the signal handler to > not clobber it. Perhaps that could be part of the contract that a > fast signal handler signs? INIT=0 AMX state could simply sit > patiently in the AMX registers for the duration of the signal handler. > You can't get any faster than doing nothing :-) > > Of course part of the contract for the fast signal handler is that it > knows that it can't possibly use XRESTOR of the stuff on the stack to > necessarily get back to the state of the signaled thread (assuming we > even used XSTATE format on the fast signal handler stack, it would > forget the contents of the AMX registers, in this example) gcc will just use the AVX registers for 'normal' code within the signal handler. So it has to have its own copy of all the registers. (Well, maybe you could make the TMX instructions fault, but that would need a nested signal delivered.) There is also the register save buffer that you need in order to long-jump out of a signal handler. Unfortunately that is required to work. I'm pretty sure the original setjmp/longjmp just saved the stack pointer - but that really doesn't work any more. OTOH most signal handlers don't care - but there isn't a flag to sigset() (etc) so ask for a specific register layout. I did have 'fun' changing the x86 segment registers so that the 'return to user' faulted in kernel during the last bit of the 'return to user' path - and then fixing the fallout. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-30 22:01 ` David Laight @ 2021-03-31 16:31 ` Len Brown 2021-03-31 16:53 ` Andy Lutomirski 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-03-31 16:31 UTC (permalink / raw) To: David Laight Cc: Andy Lutomirski, Dave Hansen, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Tue, Mar 30, 2021 at 6:01 PM David Laight <David.Laight@aculab.com> wrote: > > Can we leave it in live registers? That would be the speed-of-light > > signal handler approach. But we'd need to teach the signal handler to > > not clobber it. Perhaps that could be part of the contract that a > > fast signal handler signs? INIT=0 AMX state could simply sit > > patiently in the AMX registers for the duration of the signal handler. > > You can't get any faster than doing nothing :-) > > > > Of course part of the contract for the fast signal handler is that it > > knows that it can't possibly use XRESTOR of the stuff on the stack to > > necessarily get back to the state of the signaled thread (assuming we > > even used XSTATE format on the fast signal handler stack, it would > > forget the contents of the AMX registers, in this example) > > gcc will just use the AVX registers for 'normal' code within > the signal handler. > So it has to have its own copy of all the registers. > (Well, maybe you could make the TMX instructions fault, > but that would need a nested signal delivered.) This is true, by default, but it doesn't have to be true. Today, gcc has an annotation for user-level interrupts https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes An analogous annotation could be created for fast signals. gcc can be told exactly what registers and instructions it can use for that routine. Of course, this begs the question about what routines that handler calls, and that would need to be constrained too. Today signal-safety(7) advises programmers to limit what legacy signal handlers can call. There is no reason that a fast-signal-safety(7) could not be created for the fast path. > There is also the register save buffer that you need in order > to long-jump out of a signal handler. > Unfortunately that is required to work. > I'm pretty sure the original setjmp/longjmp just saved the stack > pointer - but that really doesn't work any more. > > OTOH most signal handlers don't care - but there isn't a flag > to sigset() (etc) so ask for a specific register layout. Right, the idea is to optimize for *most* signal handlers, since making any changes to *all* signal handlers is intractable. So the idea is that opting-in to a fast signal handler would opt-out of some legacy signal capibilities. Complete state is one of them, and thus long-jump is not supported, because the complete state may not automatically be available. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-31 16:31 ` Len Brown @ 2021-03-31 16:53 ` Andy Lutomirski 2021-03-31 21:42 ` Robert O'Callahan 2021-03-31 22:28 ` Len Brown 0 siblings, 2 replies; 130+ messages in thread From: Andy Lutomirski @ 2021-03-31 16:53 UTC (permalink / raw) To: Len Brown Cc: David Laight, Dave Hansen, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API > On Mar 31, 2021, at 9:31 AM, Len Brown <lenb@kernel.org> wrote: > > On Tue, Mar 30, 2021 at 6:01 PM David Laight <David.Laight@aculab.com> wrote: > >>> Can we leave it in live registers? That would be the speed-of-light >>> signal handler approach. But we'd need to teach the signal handler to >>> not clobber it. Perhaps that could be part of the contract that a >>> fast signal handler signs? INIT=0 AMX state could simply sit >>> patiently in the AMX registers for the duration of the signal handler. >>> You can't get any faster than doing nothing :-) >>> >>> Of course part of the contract for the fast signal handler is that it >>> knows that it can't possibly use XRESTOR of the stuff on the stack to >>> necessarily get back to the state of the signaled thread (assuming we >>> even used XSTATE format on the fast signal handler stack, it would >>> forget the contents of the AMX registers, in this example) >> >> gcc will just use the AVX registers for 'normal' code within >> the signal handler. >> So it has to have its own copy of all the registers. >> (Well, maybe you could make the TMX instructions fault, >> but that would need a nested signal delivered.) > > This is true, by default, but it doesn't have to be true. > > Today, gcc has an annotation for user-level interrupts > https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes > > An analogous annotation could be created for fast signals. > gcc can be told exactly what registers and instructions it can use for > that routine. > > Of course, this begs the question about what routines that handler calls, > and that would need to be constrained too. > > Today signal-safety(7) advises programmers to limit what legacy signal handlers > can call. There is no reason that a fast-signal-safety(7) could not be created > for the fast path. > >> There is also the register save buffer that you need in order >> to long-jump out of a signal handler. >> Unfortunately that is required to work. >> I'm pretty sure the original setjmp/longjmp just saved the stack >> pointer - but that really doesn't work any more. >> >> OTOH most signal handlers don't care - but there isn't a flag >> to sigset() (etc) so ask for a specific register layout. > > Right, the idea is to optimize for *most* signal handlers, > since making any changes to *all* signal handlers is intractable. > > So the idea is that opting-in to a fast signal handler would opt-out > of some legacy signal capibilities. Complete state is one of them, > and thus long-jump is not supported, because the complete state > may not automatically be available. Long jump is probably the easiest problem of all: sigsetjmp() is a *function*, following ABI, so sigsetjmp() is expected to clobber most or all of the extended state. But this whole annotation thing will require serious compiler support. We already have problems with compilers inlining functions and getting confused about attributes. An API like: if (get_amx()) { use AMX; } else { don’t; } Avoids this problem. And making XCR0 dynamic, for all its faults, at least helps force a degree of discipline on user code. > > thanks, > Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-31 16:53 ` Andy Lutomirski @ 2021-03-31 21:42 ` Robert O'Callahan 2021-03-31 22:11 ` Len Brown 2021-03-31 22:28 ` Len Brown 1 sibling, 1 reply; 130+ messages in thread From: Robert O'Callahan @ 2021-03-31 21:42 UTC (permalink / raw) To: Andy Lutomirski Cc: Len Brown, David Laight, Dave Hansen, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API For the record, the benefits of dynamic XCR0 for rr recording portability still apply. I guess it'd be useful for CRIU too. We would also benefit from anything that incentivizes increased support for CPUID faulting. Rob -- Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-31 21:42 ` Robert O'Callahan @ 2021-03-31 22:11 ` Len Brown 0 siblings, 0 replies; 130+ messages in thread From: Len Brown @ 2021-03-31 22:11 UTC (permalink / raw) To: robert Cc: Andy Lutomirski, David Laight, Dave Hansen, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Wed, Mar 31, 2021 at 5:42 PM Robert O'Callahan <robert@ocallahan.org> wrote: > > For the record, the benefits of dynamic XCR0 for rr recording > portability still apply. I guess it'd be useful for CRIU too. We would > also benefit from anything that incentivizes increased support for > CPUID faulting. As previously mentioned, today we don't have an architectural way to trap a user into the kernel on CPUID, even though we can do this for a VMM. But spoofing CPUID isn't a solution to all problems. The feature really needs to be OFF to prevent users from using it, even if the supported mechanisms of discovering that feature say "NOT PRESENT". Today there are plenty of users who will opportunistically try everything in the cloud and choose the machine that allows them to do something that other machines will not -- even if it is not officially supported. If something is not enumerated, it really needs to also be turned off. cheers, --Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-31 16:53 ` Andy Lutomirski 2021-03-31 21:42 ` Robert O'Callahan @ 2021-03-31 22:28 ` Len Brown 2021-03-31 22:45 ` Andy Lutomirski 2021-03-31 22:52 ` Borislav Petkov 1 sibling, 2 replies; 130+ messages in thread From: Len Brown @ 2021-03-31 22:28 UTC (permalink / raw) To: Andy Lutomirski Cc: David Laight, Dave Hansen, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Wed, Mar 31, 2021 at 12:53 PM Andy Lutomirski <luto@amacapital.net> wrote: > But this whole annotation thing will require serious compiler support. > We already have problems with compilers inlining functions and getting confused about attributes. We added compiler annotation for user-level interrupt handlers. I'm not aware of it failing, or otherwise being confused. Why would compiler support for fast-signals be any more "serious"? > An API like: > > if (get_amx()) { > use AMX; > } else { > don’t; > } > > Avoids this problem. And making XCR0 dynamic, for all its faults, at least helps force a degree of discipline on user code. dynamic XCR0 breaks the installed base, I thought we had established that. We've also established that when running in a VMM, every update to XCR0 causes a VMEXIT. I thought the goal was to allow new programs to have fast signal handlers. By default, those fast signal handlers would have a stable state image, and would not inherit large architectural state on their stacks, and could thus have minimal overhead on all hardware. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-31 22:28 ` Len Brown @ 2021-03-31 22:45 ` Andy Lutomirski 2021-04-09 20:52 ` Len Brown 2021-03-31 22:52 ` Borislav Petkov 1 sibling, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-03-31 22:45 UTC (permalink / raw) To: Len Brown Cc: David Laight, Dave Hansen, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Wed, Mar 31, 2021 at 3:28 PM Len Brown <lenb@kernel.org> wrote: > > On Wed, Mar 31, 2021 at 12:53 PM Andy Lutomirski <luto@amacapital.net> wrote: > > > But this whole annotation thing will require serious compiler support. > > We already have problems with compilers inlining functions and getting confused about attributes. > > We added compiler annotation for user-level interrupt handlers. > I'm not aware of it failing, or otherwise being confused. I followed your link and found nothing. Can you elaborate? In the kernel, we have noinstr, and gcc gives approximately no help toward catching problems. > > Why would compiler support for fast-signals be any more "serious"? > > > An API like: > > > > if (get_amx()) { > > use AMX; > > } else { > > don’t; > > } > > > > Avoids this problem. And making XCR0 dynamic, for all its faults, at least helps force a degree of discipline on user code. > > dynamic XCR0 breaks the installed base, I thought we had established that. I don't think this is at all established. If some code thinks it knows the uncompacted XSTATE size and XCR0 changes, it crashes. This is not necessarily a showstopper. > > We've also established that when running in a VMM, every update to > XCR0 causes a VMEXIT. This is true, it sucks, and Intel could fix it going forward. > > I thought the goal was to allow new programs to have fast signal handlers. > By default, those fast signal handlers would have a stable state > image, and would > not inherit large architectural state on their stacks, and could thus > have minimal overhead on all hardware. That is *a* goal, but not necessarily the only goal. --Andy ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-31 22:45 ` Andy Lutomirski @ 2021-04-09 20:52 ` Len Brown 2021-04-09 21:44 ` Andy Lutomirski 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-04-09 20:52 UTC (permalink / raw) To: Andy Lutomirski Cc: David Laight, Dave Hansen, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Wed, Mar 31, 2021 at 3:28 PM Len Brown <lenb@kernel.org> wrote: > > > We added compiler annotation for user-level interrupt handlers. > > I'm not aware of it failing, or otherwise being confused. > > I followed your link and found nothing. Can you elaborate? In the > kernel, we have noinstr, and gcc gives approximately no help toward > catching problems. A search for the word "interrupt" on this page https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes comes to the description of this attribute: __attribute__ ((interrupt)) > > dynamic XCR0 breaks the installed base, I thought we had established that. > > I don't think this is at all established. If some code thinks it > knows the uncompacted XSTATE size and XCR0 changes, it crashes. This > is not necessarily a showstopper. My working assumption is that crashing applications actually *is* a showstopper. Please clarify. > > We've also established that when running in a VMM, every update to > > XCR0 causes a VMEXIT. > > This is true, it sucks, and Intel could fix it going forward. What hardware fix do you suggest? If a guest is permitted to set XCR0 bits without notifying the VMM, what happens when it sets bits that the VMM doesn't know about? > > I thought the goal was to allow new programs to have fast signal handlers. > > By default, those fast signal handlers would have a stable state > > image, and would > > not inherit large architectural state on their stacks, and could thus > > have minimal overhead on all hardware. > > That is *a* goal, but not necessarily the only goal. I fully support coming up with a scheme for fast future-proof signal handlers, and I'm willing to back that up by putting work into it. I don't see any other goals articulated in this thread. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-09 20:52 ` Len Brown @ 2021-04-09 21:44 ` Andy Lutomirski 2021-04-11 19:07 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-04-09 21:44 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, David Laight, Dave Hansen, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Fri, Apr 9, 2021 at 1:53 PM Len Brown <lenb@kernel.org> wrote: > > On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > On Wed, Mar 31, 2021 at 3:28 PM Len Brown <lenb@kernel.org> wrote: > > > > > We added compiler annotation for user-level interrupt handlers. > > > I'm not aware of it failing, or otherwise being confused. > > > > I followed your link and found nothing. Can you elaborate? In the > > kernel, we have noinstr, and gcc gives approximately no help toward > > catching problems. > > A search for the word "interrupt" on this page > https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes > comes to the description of this attribute: > > __attribute__ ((interrupt)) > I read that and I see no mention of anything saying "this will generate code that does not touch extended state". Instead I see, paraphrasing, "this will generate code with an ABI that is completely inappropriate for use in a user space signal handler". Am I missing something? > > > dynamic XCR0 breaks the installed base, I thought we had established that. > > > > I don't think this is at all established. If some code thinks it > > knows the uncompacted XSTATE size and XCR0 changes, it crashes. This > > is not necessarily a showstopper. > > My working assumption is that crashing applications actually *is* a showstopper. > Please clarify. I think you're presuming that some program actually does this. If no program does this, it's not an ABI break. More relevantly, this can only happen in a process that uses XSAVE and thinks it knows the size that *also* does the prctl to change XCR0. By construction, existing programs can't break unless they load new dynamic libraries that break them. > > > > We've also established that when running in a VMM, every update to > > > XCR0 causes a VMEXIT. > > > > This is true, it sucks, and Intel could fix it going forward. > > What hardware fix do you suggest? > If a guest is permitted to set XCR0 bits without notifying the VMM, > what happens when it sets bits that the VMM doesn't know about? The VM could have a mask of allowed XCR0 bits that don't exist. TDX solved this problem *somehow* -- XSETBV doesn't (visibly?) exit on TDX. Surely plain VMX could fix it too. > > > > I thought the goal was to allow new programs to have fast signal handlers. > > > By default, those fast signal handlers would have a stable state > > > image, and would > > > not inherit large architectural state on their stacks, and could thus > > > have minimal overhead on all hardware. > > > > That is *a* goal, but not necessarily the only goal. > > I fully support coming up with a scheme for fast future-proof signal handlers, > and I'm willing to back that up by putting work into it. > > I don't see any other goals articulated in this thread. Before we get too carried away with *fast* signal handlers, something that works with existing programs is also a pretty strong goal. RIght now AVX-512 breaks existing programs, even if they don't use AVX-512. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-09 21:44 ` Andy Lutomirski @ 2021-04-11 19:07 ` Len Brown 2021-04-12 7:59 ` David Laight ` (2 more replies) 0 siblings, 3 replies; 130+ messages in thread From: Len Brown @ 2021-04-11 19:07 UTC (permalink / raw) To: Andy Lutomirski Cc: David Laight, Dave Hansen, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Fri, Apr 9, 2021 at 5:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Fri, Apr 9, 2021 at 1:53 PM Len Brown <lenb@kernel.org> wrote: > > > > On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > On Wed, Mar 31, 2021 at 3:28 PM Len Brown <lenb@kernel.org> wrote: > > > > > > > We added compiler annotation for user-level interrupt handlers. > > > > I'm not aware of it failing, or otherwise being confused. > > > > > > I followed your link and found nothing. Can you elaborate? In the > > > kernel, we have noinstr, and gcc gives approximately no help toward > > > catching problems. > > > > A search for the word "interrupt" on this page > > https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes > > comes to the description of this attribute: > > > > __attribute__ ((interrupt)) > > > > I read that and I see no mention of anything saying "this will > generate code that does not touch extended state". Instead I see, > paraphrasing, "this will generate code with an ABI that is completely > inappropriate for use in a user space signal handler". Am I missing > something? Again... An analogous annotation could be created for fast signals. gcc can be told exactly what registers and instructions it can use for that routine. If somebody can suggest a way to make fast signal handers faster than saving only the state that they-themselves actually use, I'm all ears. > > > > dynamic XCR0 breaks the installed base, I thought we had established that. > > > > > > I don't think this is at all established. If some code thinks it > > > knows the uncompacted XSTATE size and XCR0 changes, it crashes. This > > > is not necessarily a showstopper. > > > > My working assumption is that crashing applications actually *is* a showstopper. > > Please clarify. > > I think you're presuming that some program actually does this. If no > program does this, it's not an ABI break. So you agree that for a program that uses xgetbv to read XCR0 and compute XSTATE size for user-space use of XSAVE can break if XCR0 changes during its lifetime. But you don't believe such software exists? > More relevantly, this can only happen in a process that uses XSAVE and > thinks it knows the size that *also* does the prctl to change XCR0. > By construction, existing programs can't break unless they load new > dynamic libraries that break them. Let's say that a program does math. It calls a library to do that math. It doesn't know or care what instructions the library uses to do math. eg. the library uses SSE on an Atom, and uses AVX512 on a Xeon. Who calls the new prctl, the program, or the library? If it is the program, how does it know that the library wants to use what instructions? If it is the library, then you have just changed XCR0 at run-time and you expose breakage of the thread library that has computed XSAVE size. > > > > We've also established that when running in a VMM, every update to > > > > XCR0 causes a VMEXIT. > > > > > > This is true, it sucks, and Intel could fix it going forward. > > > > What hardware fix do you suggest? > > If a guest is permitted to set XCR0 bits without notifying the VMM, > > what happens when it sets bits that the VMM doesn't know about? > > The VM could have a mask of allowed XCR0 bits that don't exist. > > TDX solved this problem *somehow* -- XSETBV doesn't (visibly?) exit on > TDX. Surely plain VMX could fix it too. There are two cases. 1. Hardware that exists today and in the foreseeable future. VM modification of XCR0 results in VMEXIT to VMM. The VMM sees bits set by the guest, and so it can accept what it supports, or send the VM a fault for non-support. Here it is not possible for the VMM to change XCR0 without the VMM knowing. 2. Future Hardware that allows guests to write XCR0 w/o VMEXIT. Not sure I follow your proposal. Yes, the VM effectively has a mask of what is supported, because it can issue CPUID. The VMM virtualizes CPUID, and needs to know it must not expose to the VM any state features it doesn't support. Also, the VMM needs to audit XCR0 before it uses XSAVE, else the guest could attack or crash the VMM through buffer overrun. Is this what you suggest? If yes, what do you suggest in the years between now and when that future hardware and VMM exist? > > > > I thought the goal was to allow new programs to have fast signal handlers. > > > > By default, those fast signal handlers would have a stable state > > > > image, and would > > > > not inherit large architectural state on their stacks, and could thus > > > > have minimal overhead on all hardware. > > > > > > That is *a* goal, but not necessarily the only goal. > > > > I fully support coming up with a scheme for fast future-proof signal handlers, > > and I'm willing to back that up by putting work into it. > > > > I don't see any other goals articulated in this thread. > > Before we get too carried away with *fast* signal handlers, something > that works with existing programs is also a pretty strong goal. RIght > now AVX-512 breaks existing programs, even if they don't use AVX-512. Re: "AVX-512 breaks existing programs, even if they don't use AVX-512" Perhaps it would be useful to review how that breakage can happen, recognize when it is a problem, when it is not a problem, and what we are doing to address it today. The "ABI" here, is the signal.h definition of the MIN and SIG stacksize to 2KB and 8KB (on all architectures). These hard coded constants may be used by programs that choose to manually allocate and register alternative signal stacks. The signal delivery ABI we use today, where all x86 architecture state is XSAVED onto the signal stack will exceed 2KB when running on hardware that supports AVX-512. This issue is real. There do exist programs that use alternative stacks, and of those, there do exist programs that use these constants, and if they do take a signal on that size stack on that hardware, they do fail. As evidenced that AVX-512 shipped several years ago, and the world didn't stop, however, there are not a lot of programs with this exposure. That said, adding 8KB to the architecture state on systems that support AMX/TMUL makes this existing issue significantly more acute. Glibc 2.34, to be released in July, re-defines these constants into run-time values. It uses CPUID to compute values that work, and so a program that uses this ABI and is compiled with glibc 2.34 or later will not fail. Further, Chang's kernel patch series does two important things. First, it inspects the destination stack and computes the stack frame size and it refuses to write onto a stack that will overflow. We should have always been making that check. Second, it exports the kernel's notion of how big the signal stack needs to be via the altvec, and glibc 2.34 picks this up and uses it in preference over its own CPUID calculation, above. So in a perfect world, you have AMX hardware, and the OS that supports your AMX hardware has a kernel and glibc that support it. Everything that comes with that OS, or is built on that OS, uses that new library. This mechanism similarly addresses the AVX-512 stack issue. Granted, if you have an application that is statically linked and run on new hardware and new OS, it can still fail. Granted, you have an application that creates small signal stacks without using the ABI, even a re-compile with the new library will not help it. Granted, signal stacks -- whether they be normal or these alternative signal stacks, are bigger on hardware that has more hardware architecgture state. But applications that use the ABI do not need to be modified. I believe that this plan is sane. I acknowledge that it doesn't address the desire for minimum size fast signal handlers that are minimal and fast on all hardware. I think we can address that with a NEW ABI, but not the old one. thanks, Len Brown, Intel Open Source Technology Center -- Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* RE: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-11 19:07 ` Len Brown @ 2021-04-12 7:59 ` David Laight 2021-04-12 12:19 ` Borislav Petkov 2021-04-12 17:14 ` Sean Christopherson 2 siblings, 0 replies; 130+ messages in thread From: David Laight @ 2021-04-12 7:59 UTC (permalink / raw) To: 'Len Brown', Andy Lutomirski Cc: Dave Hansen, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API From: Len Brown > Sent: 11 April 2021 20:07 ... > Granted, if you have an application that is statically linked and run > on new hardware > and new OS, it can still fail. That also includes anything compiled and released as a program binary that must run on older Linux installations. Such programs have to be compiled with old copies of the system headers (and probably with an of gcc). While such programs themselves won't use AVX without checking for OS support, the glibc code on the installed system might. Such programs can be modified to run-time detect the required signal stack size - but cannot rely on glibc to convert SIGSTKSZ into a function call. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-11 19:07 ` Len Brown 2021-04-12 7:59 ` David Laight @ 2021-04-12 12:19 ` Borislav Petkov 2021-04-12 17:14 ` Sean Christopherson 2 siblings, 0 replies; 130+ messages in thread From: Borislav Petkov @ 2021-04-12 12:19 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, David Laight, Dave Hansen, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Sun, Apr 11, 2021 at 03:07:29PM -0400, Len Brown wrote: > If it is the program, how does it know that the library wants to use > what instructions? > > If it is the library, then you have just changed XCR0 at run-time and > you expose breakage of the thread library that has computed XSAVE size. So, when old programs which cannot possibly know about the arch_prctl() extension we're proposing here, link against that library, then that library should not be allowed to go use "fat" states. Unless the library can "tell" the process which links to it, that it has dynamically enlarged the save state. If it can and the process can handle that, then all is fine, save state gets enlarged dynamically and it all continues merrily. Also, in order for the library to use fat states, it needs to ask the kernel for such support - not CPUID - because the kernel is doing the state handling for everybody and doing all the CR4.OSXSAVE setup etc. Which also means that the kernel can help here by telling the library: - No, you cannot use fat states with this process because it hasn't called arch_prctl() so it cannot handle them properly. - Yes, this process allowes me to handle fat states for it so you can use those states and thus those instructions when doing operations for it. So the kernel becomes the arbiter in all this - as it should be - and then all should work fine. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-11 19:07 ` Len Brown 2021-04-12 7:59 ` David Laight 2021-04-12 12:19 ` Borislav Petkov @ 2021-04-12 17:14 ` Sean Christopherson 2 siblings, 0 replies; 130+ messages in thread From: Sean Christopherson @ 2021-04-12 17:14 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, David Laight, Dave Hansen, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Sun, Apr 11, 2021, Len Brown wrote: > On Fri, Apr 9, 2021 at 5:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > On Fri, Apr 9, 2021 at 1:53 PM Len Brown <lenb@kernel.org> wrote: > > > > > > On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > > > On Wed, Mar 31, 2021 at 3:28 PM Len Brown <lenb@kernel.org> wrote: > > > > > We've also established that when running in a VMM, every update to > > > > > XCR0 causes a VMEXIT. > > > > > > > > This is true, it sucks, and Intel could fix it going forward. > > > > > > What hardware fix do you suggest? > > > If a guest is permitted to set XCR0 bits without notifying the VMM, > > > what happens when it sets bits that the VMM doesn't know about? > > > > The VM could have a mask of allowed XCR0 bits that don't exist. > > > > TDX solved this problem *somehow* -- XSETBV doesn't (visibly?) exit on > > TDX. Surely plain VMX could fix it too. > > There are two cases. > > 1. Hardware that exists today and in the foreseeable future. > > VM modification of XCR0 results in VMEXIT to VMM. > The VMM sees bits set by the guest, and so it can accept what > it supports, or send the VM a fault for non-support. > > Here it is not possible for the VMM to change XCR0 without the VMM knowing. > > 2. Future Hardware that allows guests to write XCR0 w/o VMEXIT. > > Not sure I follow your proposal. > > Yes, the VM effectively has a mask of what is supported, > because it can issue CPUID. > > The VMM virtualizes CPUID, and needs to know it must not > expose to the VM any state features it doesn't support. > Also, the VMM needs to audit XCR0 before it uses XSAVE, > else the guest could attack or crash the VMM through > buffer overrun. The VMM already needs to context switch XCR0 and XSS, so this is a non-issue. > Is this what you suggest? Yar. In TDX, XSETBV exits, but only to the TDX module. I.e. TDX solves the problem in software by letting the VMM tell the TDX module what features the guest can set in XCR0/XSS via the XFAM (Extended Features Allowed Mask). But, that software "fix" can also be pushed into ucode, e.g. add an XFAM VMCS field, the guest can set any XCR0 bits that are '1' in VMCS.XFAM without exiting. Note, SGX has similar functionality in the form of XFRM (XSAVE-Feature Request Mask). The enclave author can specify what features will be enabled in XCR0 when the enclave is running. Not that relevant, other than to reinforce that this is a solvable problem. > If yes, what do you suggest in the years between now and when > that future hardware and VMM exist? Burn some patch space? :-) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-31 22:28 ` Len Brown 2021-03-31 22:45 ` Andy Lutomirski @ 2021-03-31 22:52 ` Borislav Petkov 2021-04-09 20:55 ` Len Brown 1 sibling, 1 reply; 130+ messages in thread From: Borislav Petkov @ 2021-03-31 22:52 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, David Laight, Dave Hansen, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Wed, Mar 31, 2021 at 06:28:27PM -0400, Len Brown wrote: > dynamic XCR0 breaks the installed base, I thought we had established > that. We should do a clear cut and have legacy stuff which has its legacy expectations on the XSTATE layout and not touch those at all. And then all new apps which will use these new APIs can go and request whatever fancy new state constellations we support. Including how they want their signals handled, etc. Fat states like avx512, amx etc will be off by default and apps explicitly requesting those, can get them. That's it. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-31 22:52 ` Borislav Petkov @ 2021-04-09 20:55 ` Len Brown 0 siblings, 0 replies; 130+ messages in thread From: Len Brown @ 2021-04-09 20:55 UTC (permalink / raw) To: Borislav Petkov Cc: Andy Lutomirski, David Laight, Dave Hansen, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Wed, Mar 31, 2021 at 6:54 PM Borislav Petkov <bp@alien8.de> wrote: > > On Wed, Mar 31, 2021 at 06:28:27PM -0400, Len Brown wrote: > > dynamic XCR0 breaks the installed base, I thought we had established > > that. > > We should do a clear cut and have legacy stuff which has its legacy > expectations on the XSTATE layout and not touch those at all. > > And then all new apps which will use these new APIs can go and request > whatever fancy new state constellations we support. Including how they > want their signals handled, etc. > > Fat states like avx512, amx etc will be off by default and apps > explicitly requesting those, can get them. > > That's it. 100% agreement from me! (does anybody disagree?) thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-26 23:18 ` Andy Lutomirski 2021-03-27 3:39 ` Len Brown @ 2021-03-28 0:53 ` Thomas Gleixner 2021-03-29 7:27 ` Peter Zijlstra 2021-03-29 15:06 ` Dave Hansen 1 sibling, 2 replies; 130+ messages in thread From: Thomas Gleixner @ 2021-03-28 0:53 UTC (permalink / raw) To: Andy Lutomirski Cc: Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API Andy, On Fri, Mar 26 2021 at 16:18, Andy Lutomirski wrote: > arch_prctl(ARCH_SET_XCR0, xcr0, lazy_states, sigsave_states, > sigclear_states, 0); > > Sets xcr0. All states are preallocated except that states in > lazy_states may be unallocated in the kernel until used. (Not > supported at all in v1. lazy_states & ~xcr0 != 0 is illegal.) States > in sigsave_states are saved in the signal frame. States in > sigclear_states are reset to the init state on signal delivery. > States in sigsave_states are restored by sigreturn, and states not in > sigsave_states are left alone by sigreturn. I like the idea in principle. > Optionally we come up with a new format for new features in the signal > frame, since the current format is showing its age. Taking 8kB for a > signal with AMX is one thing. Taking another 8kB for a nested signal > if AMX is not in use is worse. I don't think that we should make that optional to begin with. Sizing sigaltstack is lottery as of today and making it more so does not help at all. > Optionally we make AVX-512 also default off, which fixes what is > arguably a serious ABI break with AVX-512: lots of programs, following > POSIX (!), seem to think that they know much much space to allocate > for sigaltstack(). AVX-512 is too big. I really wish we could do that. That AVX512 disaster is not trivial to sort. Let's focus on AMX first. That ship at least has not sailed yet, but if it does without a proper resolution then it's going to sail deep south. Maybe we end up with some ideas about the AVX512 issue as well that way. The main problem I see is simply historical. Every other part of the user stack space from libraries to applications tries to be "smart" about utilizing the assumed best instruction set, feature extensions which are detected when something is initialized. I can sing a song of that because I was casually involved porting debian to an unsupported architecture. Magic all over the place. Now add the whole pile of proprietary software stacks, libraries on top of that picture and things get completely out of control. Why? Simply because user space has absolutely no concept about orchestrating these things at all. That worked for a while by some definition of works and this model is still proliferated today even by players who should know better. Even if you expected that some not so distant events and the experience with fleet consistency would have stopped the 'performance first, features first' chorus in some way, that's not what reality is. Linux is not necessarily innocent. For years we just crammed features into the kernel without thinking too hard about the big picture. But, yes we realized the hard way that there is a problem and just adding yet another magic 'make it work' hack for AMX is definitely the wrong approach. What are the possible problems when we make it a hard requirement for AMX to be requested by an application/task in order to use it? For the kernel itself. Not really any consequence I can think off aside of unhappy campers in user space. For user space this is disruptive and we have at least to come up with some reasonable model how all involved components with different ideas of how to best utilize a given CPU can be handled. That starts at the very simple problem of feature enumeration. Up to now CPUID is non-priviledged and a large amount of user space just takes that as the ultimate reference. We can change that when CPUID faulting in CPL3 is supported by the CPU which we can't depend on because it is not architectural. Though the little devil in my head tells me, that making AMX support depend on the CPUID faulting capability might be not the worst thing. Then we actually enforce CPUID faulting (finally) on CPUs which support it, which would be a first step into the right direction simply because then random library X has to go to the kernel and ask for it explicitely or just shrug and use whatever the kernel is willing to hand out in CPUID. Now take that one step further. When the first part of some user space application asks for it, then you can register that with the process and make sane decisions for all other requesters which come after it, which is an important step into the direction of having a common orchestration for this. Sure you can do that via XCR0 as well to some extent, but that CPUID fault would solve a whole class of other problems which people who care about feature consistency face today at least to some extent. And contrary to XCR0, which is orthogonal and obviously still required for the AMX (and hint AVX512) problem, CPUID faulting would just hand out the feature bits which the kernel want's to hand out. If the app, library or whatever still tries to use them, then they get the #UD, #GP or whatever penalty is associated to that particular XCR0 disabled piece. It's not there, you tried, keep the pieces. Making it solely depend on XCR0 and fault if not requested upfront is bringing you into the situation that you broke 'legacy code' which relied on the CPUID bit and that worked until now which gets you in the no-regression trap. I haven't thought this through obviously, but depending solely on XCR0 faults did not really sum up, so I thought I share that evil idea for broader discussion. Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-28 0:53 ` Thomas Gleixner @ 2021-03-29 7:27 ` Peter Zijlstra 2021-03-29 15:06 ` Dave Hansen 1 sibling, 0 replies; 130+ messages in thread From: Peter Zijlstra @ 2021-03-29 7:27 UTC (permalink / raw) To: Thomas Gleixner Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On Sun, Mar 28, 2021 at 01:53:15AM +0100, Thomas Gleixner wrote: > Though the little devil in my head tells me, that making AMX support > depend on the CPUID faulting capability might be not the worst thing. > > Then we actually enforce CPUID faulting (finally) on CPUs which support > it, which would be a first step into the right direction simply because > then random library X has to go to the kernel and ask for it explicitely > or just shrug and use whatever the kernel is willing to hand out in > CPUID. > > Now take that one step further. When the first part of some user space > application asks for it, then you can register that with the process and > make sane decisions for all other requesters which come after it, which > is an important step into the direction of having a common orchestration > for this. I wrote something like that at least one... https://lore.kernel.org/lkml/20190212164833.GK32494@hirez.programming.kicks-ass.net/ we just need to make sure AMD implements that before it ships a chip with AVX512 on. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-28 0:53 ` Thomas Gleixner 2021-03-29 7:27 ` Peter Zijlstra @ 2021-03-29 15:06 ` Dave Hansen 1 sibling, 0 replies; 130+ messages in thread From: Dave Hansen @ 2021-03-29 15:06 UTC (permalink / raw) To: Thomas Gleixner, Andy Lutomirski Cc: Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API On 3/27/21 5:53 PM, Thomas Gleixner wrote: > Making it solely depend on XCR0 and fault if not requested upfront is > bringing you into the situation that you broke 'legacy code' which > relied on the CPUID bit and that worked until now which gets you > in the no-regression trap. Trying to find the right place to jump into this thread... :) I don't know what apps do in practice. But, the enumeration of the features in the SDM describes three steps: 1. Check for XGETBV support 2. Use XGETBV[0] to check that the OS is aware of the feature and is context-switching it 3. Detect the feature itself So, apps *are* supposed to be checking XCR0 via XGETBV. If they don't, they run the risk of a feature being supported by the CPU and the registers "working" but not being context-switched. Zeroing out bits in XCR0 will have the effect of telling the app that the OS isn't context-switching the state. I think this means that apps will see the same thing in both situations: 1. If they run an old (say pre-AVX-512) kernel on new AVX-512-enabled hardware, or 2. They run a new kernel with this fancy proposed XCR0-switching mechanism I _think_ that gets us off the hook for an ABI break, at least for AVX-512. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-03-26 23:12 Candidate Linux ABI for Intel AMX and hypothetical new related features Andy Lutomirski 2021-03-26 23:18 ` Andy Lutomirski @ 2021-03-31 8:24 ` Borislav Petkov [not found] ` <87lf9nk2ku.fsf@oldenburg.str.redhat.com> 2 siblings, 0 replies; 130+ messages in thread From: Borislav Petkov @ 2021-03-31 8:24 UTC (permalink / raw) To: Andy Lutomirski Cc: Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer On Fri, Mar 26, 2021 at 04:12:25PM -0700, Andy Lutomirski wrote: > To detect features and control XCR0, we add some new arch_prctls: > > arch_prctl(ARCH_GET_XCR0_SUPPORT, 0, ...); > > returns the set of XCR0 bits supported on the current kernel. > > arch_prctl(ARCH_GET_XCR0_LAZY_SUPPORT, 0, ...); > > returns 0. See below. > > arch_prctl(ARCH_SET_XCR0, xcr0, lazy_states, sigsave_states, > sigclear_states, 0); Right, but I'd simply replace that "XCR0" arch detail, more or less, with "XSTATE": ARCH_GET_XSTATE_SUPPORT ARCH_GET_XSTATE_LAZY_SUPPORT ARCH_SET_XSTATE or ARCH_ENABLE_XSTATE or so to denote that this is controlling XSTATE handling while the XCR0 thing is the control register and when in the future something else does control that (XCR0[63] is one provision for that) then we're still on-point with the naming. > Thoughts? FTR, I really like the aspect of apps *requesting* handling of non-legacy, fat states and latter remaining off otherwise in order to keep the sanity of everyone involved. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
[parent not found: <87lf9nk2ku.fsf@oldenburg.str.redhat.com>]
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features [not found] ` <87lf9nk2ku.fsf@oldenburg.str.redhat.com> @ 2021-04-12 14:31 ` Borislav Petkov 2021-04-12 14:38 ` Florian Weimer 2021-04-12 15:21 ` Andy Lutomirski 1 sibling, 1 reply; 130+ messages in thread From: Borislav Petkov @ 2021-04-12 14:31 UTC (permalink / raw) To: Florian Weimer Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, Apr 12, 2021 at 04:19:29PM +0200, Florian Weimer wrote: > Maybe we could have done this in 2016 when I reported this for the first > time. Now it is too late, as more and more software is using > CPUID-based detection for AVX-512. So as I said on another mail today, I don't think a library should rely solely on CPUID-based detection of features especially if those features need kernel support too. IOW, it should ask whether the kernel can handle those too, first. And the CPUID-faulting thing would solve stuff like that because then the kernel can *actually* get involved into answering something where it has a say in, too. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-12 14:31 ` Borislav Petkov @ 2021-04-12 14:38 ` Florian Weimer 2021-04-12 15:08 ` Borislav Petkov 2021-04-12 15:10 ` Andy Lutomirski 0 siblings, 2 replies; 130+ messages in thread From: Florian Weimer @ 2021-04-12 14:38 UTC (permalink / raw) To: Borislav Petkov Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer * Borislav Petkov: > On Mon, Apr 12, 2021 at 04:19:29PM +0200, Florian Weimer wrote: >> Maybe we could have done this in 2016 when I reported this for the first >> time. Now it is too late, as more and more software is using >> CPUID-based detection for AVX-512. > > So as I said on another mail today, I don't think a library should rely > solely on CPUID-based detection of features especially if those features > need kernel support too. IOW, it should ask whether the kernel can > handle those too, first. Yes, that's why we have the XGETBV handshake. I was imprecise. It's CPUID + XGETBV of course. Or even AT_HWCAP2 (for FSGSBASE). > And the CPUID-faulting thing would solve stuff like that because then > the kernel can *actually* get involved into answering something where it > has a say in, too. But why wouldn't we use a syscall or an entry in the auxiliary vector for that? Why fault a potentially performance-critical instruction? Thanks, Florian ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-12 14:38 ` Florian Weimer @ 2021-04-12 15:08 ` Borislav Petkov 2021-04-12 15:10 ` Andy Lutomirski 1 sibling, 0 replies; 130+ messages in thread From: Borislav Petkov @ 2021-04-12 15:08 UTC (permalink / raw) To: Florian Weimer Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, Apr 12, 2021 at 04:38:15PM +0200, Florian Weimer wrote: > Yes, that's why we have the XGETBV handshake. I was imprecise. It's > CPUID + XGETBV of course. Or even AT_HWCAP2 (for FSGSBASE). Ok, that sounds better. So looking at glibc sources, I see something like this: init_cpu_features |-> update_usable |-> CPU_FEATURE_SET_USABLE (cpu_features, XGETBV_ECX_1); so I'm guessing this happens when the library gets loaded per process, right? Which means, once the detection has taken place and the library has gotten XCR0, it is going to use it and won't re-ask the kernel or so? I.e., I'm trying to imagine how a per-process thing would work at all. If at all. And this sounds especially "fun": > Code that installs a signal handler often does not have control on > which thread an asynchronous signal is delivered, or which code it > interrupts. In my simplistic approach I'm thinking about something along the lines of: Library: hey kernel, can you handle AVX512? Kernel: yes Library: ok, I will use that in the signal handler And since kernel has said yes, kernel is going to take care of handling AVX512 state and library can assume that. All those old processes which cannot be recompiled, for them I guess the kernel should have to say no. Dunno how much sense that makes... > > And the CPUID-faulting thing would solve stuff like that because then > > the kernel can *actually* get involved into answering something where it > > has a say in, too. > > But why wouldn't we use a syscall or an entry in the auxiliary vector > for that? Why fault a potentially performance-critical instruction? Oh sure, CPUID faulting was just an example. I think the intent is to have this important aspect of userspace asking the kernel first what kind of features it can handle and then do accordingly. IOW, legacy stuff can work unchanged and new libraries and kernels can support fancier features and bigger buffers. Methinks. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-12 14:38 ` Florian Weimer 2021-04-12 15:08 ` Borislav Petkov @ 2021-04-12 15:10 ` Andy Lutomirski 1 sibling, 0 replies; 130+ messages in thread From: Andy Lutomirski @ 2021-04-12 15:10 UTC (permalink / raw) To: Florian Weimer Cc: Borislav Petkov, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer > On Apr 12, 2021, at 7:38 AM, Florian Weimer <fweimer@redhat.com> wrote: > > * Borislav Petkov: > >>> On Mon, Apr 12, 2021 at 04:19:29PM +0200, Florian Weimer wrote: >>> Maybe we could have done this in 2016 when I reported this for the first >>> time. Now it is too late, as more and more software is using >>> CPUID-based detection for AVX-512. >> >> So as I said on another mail today, I don't think a library should rely >> solely on CPUID-based detection of features especially if those features >> need kernel support too. IOW, it should ask whether the kernel can >> handle those too, first. > > Yes, that's why we have the XGETBV handshake. I was imprecise. It's > CPUID + XGETBV of course. Or even AT_HWCAP2 (for FSGSBASE). > >> And the CPUID-faulting thing would solve stuff like that because then >> the kernel can *actually* get involved into answering something where it >> has a say in, too. > > But why wouldn't we use a syscall or an entry in the auxiliary vector > for that? Why fault a potentially performance-critical instruction? > CPUID is horrifically slow in various virt scenarios. If user code needs to serialize, use IRET or SERIALIZE. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features [not found] ` <87lf9nk2ku.fsf@oldenburg.str.redhat.com> 2021-04-12 14:31 ` Borislav Petkov @ 2021-04-12 15:21 ` Andy Lutomirski 2021-04-12 23:46 ` Len Brown 1 sibling, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-04-12 15:21 UTC (permalink / raw) To: Florian Weimer Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, Apr 12, 2021 at 7:19 AM Florian Weimer <fweimer@redhat.com> wrote: > > * Andy Lutomirski: > > Maybe we could have done this in 2016 when I reported this for the first > time. Now it is too late, as more and more software is using > CPUID-based detection for AVX-512. Our users have been using AVX-512 > hardware for quite some time now, and I haven't seen *that* many issues > resulting from the context size. That isn't to say that problems do not > exist, but they are more of the kind that the increased stack usage > means that areas of the stack that used to be zero no longer are, so > users encounter different side effects from uninitialized-variable bugs. > > How much software depends on the signal handler data layout? The %zmm > state does not appear to be exposed today, so perhaps some savings could > be had there. The fact that including <asm/signal.h> is barely functional in glibc probably helps keep software from touching the state. :) > > The suggestion to make CPUID trap doesn't sound workable to me. At > least in the past, it's been suggested as a serializing instruction to > be used alongside RDTSC, which makes it rather time-critical for some > applications. > > Even today, signal handlers do not really compose well in the sense that > multiple libraries can use them and collaborate without being aware of > each other (like they can divide up TLS memory with the help of the > dynamic linker, or carve out address space using mmap). Proposals to > set special process-wide flags only make that situation worse. Code > that installs a signal handler often does not have control on which > thread an asynchronous signal is delivered, or which code it interrupts. > A single process-wide flag cannot capture that accurately, even if it is > per signal number. I think this would want to be a per-signal-handler flag, not per process. It's entirely possible to write a signal handler callback that doesn't touch AVX512 or AMX state, even if the toolchain may make it annoying. That specific handler could set the "make me fast" flag. > > The rseq extension might work conceptually, but it requires to make > operations idempotent, with periodic checkpoint, and of course > inline/flatten all calls. And it requires compiler work, the present > model based on inline asm hacks doesn't look workable. Maybe that works > for AMX. I have not checked if there is yet any public documentation of > the programming model. I tend to think that the rseq model will be unworkable. People trying to use the new instructions will hate it. > > I think someone expressed the sentiment (maybe on another thread) that > the current CPU feature enablement process does not work. I do not > agree. Currently it is only necessary to upgrade the kernel and maybe > glibc (but not in all cases), and then you are good to go. You can keep > using your old libraries, old compilers, and even old assemblers if you > are okay with .byte hacks. You do not need special userspace libraries, > new compilers for different languages, special firmware or binary blobs. > Overall, it just works. > > On x86, we are really bad about actually using CPU features pervasively, > but that is a different story. > "Just works" is different from "is a good idea", though. With SSE2 and other non-VEX xmm extensions, just using them in userspace seems quite reasonable. If a function could run faster using xmm, then it might as well use xmm. But this model starts to break down with newer features: VEX: ymm (AFAIK) performs just fine, at least on most CPUs, except that mixing VEX and non-VEX code has big penalties. Copying that 64-bit data structure using ymm is not necessarily wise even if it microbenchmarks well. Heck, mixing translation units using normal C floating point code that were compiled with different flags can be quite slow. AVX-512: Intel has still not responded to my requests for detailed documentation of the performance issues. The internet is full of various reports and various voodoo ideas. VZEROALL does not do what one would naively expect, and the implications are unclear. AVX-512 code, even used just once, is likely to permanently bloat the signal state. Even ignoring the unknowns here, on most current non-Xeon-phi parts AFAICT, using small bits of AVX-512 code has *huge* performance impacts. Libraries automatically using AVX-512 just because it's there is not necessarily a good idea, even if it microbenchmarks well. AMX: Multiplying a 4x4 matrix probably looks *great* in a microbenchmark. Do it once and you permanently allocate 8kB (is that even a constant? can it grow in newer parts?), potentially hurts all future context switches, and does who-knows-what to Turbo licenses and such. Even putting aside all kernel and ABI issues, is it actually a good idea for user libraries to transparently use these new features? I'm not really convinced. I think that serious discussion among userspace people is needed. --Andy ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-12 15:21 ` Andy Lutomirski @ 2021-04-12 23:46 ` Len Brown 2021-04-13 0:17 ` Thomas Gleixner ` (2 more replies) 0 siblings, 3 replies; 130+ messages in thread From: Len Brown @ 2021-04-12 23:46 UTC (permalink / raw) To: Andy Lutomirski Cc: Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski <luto@kernel.org> wrote: > AMX: Multiplying a 4x4 matrix probably looks *great* in a > microbenchmark. Do it once and you permanently allocate 8kB (is that > even a constant? can it grow in newer parts?), potentially hurts all > future context switches, and does who-knows-what to Turbo licenses and > such. Intel expects that AMX will be extremely valuable to key workloads. It is true that you may never run that kind of workload on the machine in front of you, and so you have every right to be doubtful about the value of AMX. The AMX architectural state size is not expected to change. Rather, if a "new AMX" has a different state size, it is expected to use a new feature bit, different from AMX. The AMX context switch buffer is allocated only if and when a task touches AMX registers. Yes, there will be data transfer to and from that buffer when three things all happen. 1. the data is valid 2. hardware interrupts the application 3. the kernel decides to context switch. There will be no data transfer of AMX architectural state when it is in INIT state. As AMX registers are volatile, correct software will always have them in INIT state before calls, including system calls. I've addressed turbo licenses already. > Even putting aside all kernel and ABI issues, is it actually a good > idea for user libraries to transparently use these new features? I'm > not really convinced. I think that serious discussion among userspace > people is needed. At the risk of stating the obvious... Intel's view is that libraries that deliver the most value from the hardware are a "good thing", and that anything preventing libraries from getting the most value from the hardware is a "bad thing":-) cheers, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-12 23:46 ` Len Brown @ 2021-04-13 0:17 ` Thomas Gleixner 2021-04-13 1:25 ` Len Brown 2021-04-13 3:43 ` Willy Tarreau 2021-04-13 20:16 ` Andy Lutomirski 2 siblings, 1 reply; 130+ messages in thread From: Thomas Gleixner @ 2021-04-13 0:17 UTC (permalink / raw) To: Len Brown, Andy Lutomirski Cc: Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, Apr 12 2021 at 19:46, Len Brown wrote: > On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski <luto@kernel.org> wrote: >> Even putting aside all kernel and ABI issues, is it actually a good >> idea for user libraries to transparently use these new features? I'm >> not really convinced. I think that serious discussion among userspace >> people is needed. > > At the risk of stating the obvious... > Intel's view is that libraries that deliver the most value from the > hardware are a "good thing", > and that anything preventing libraries from getting the most value > from the hardware is a "bad thing":-) Sure, and as a consequence the kernel is the problem when creative libraries cause wreckage along the way. I'm fine with that as long the kernel has a way to detect that and can kill the offending application/library combo with an excplicit -SIG_NICE_TRY. Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-13 0:17 ` Thomas Gleixner @ 2021-04-13 1:25 ` Len Brown 0 siblings, 0 replies; 130+ messages in thread From: Len Brown @ 2021-04-13 1:25 UTC (permalink / raw) To: Thomas Gleixner Cc: Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, Apr 12, 2021 at 8:17 PM Thomas Gleixner <tglx@linutronix.de> wrote: > I'm fine with that as long the kernel has a way to detect that and can > kill the offending application/library combo with an excplicit > -SIG_NICE_TRY. Agreed. The new run-time check for altsigstack overflow is one place we can do that. Let me know if you think of others, thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-12 23:46 ` Len Brown 2021-04-13 0:17 ` Thomas Gleixner @ 2021-04-13 3:43 ` Willy Tarreau 2021-04-13 19:51 ` Len Brown 2021-04-13 20:16 ` Andy Lutomirski 2 siblings, 1 reply; 130+ messages in thread From: Willy Tarreau @ 2021-04-13 3:43 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, Apr 12, 2021 at 07:46:06PM -0400, Len Brown wrote: > On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski <luto@kernel.org> wrote: > > > AMX: Multiplying a 4x4 matrix probably looks *great* in a > > microbenchmark. Do it once and you permanently allocate 8kB (is that > > even a constant? can it grow in newer parts?), potentially hurts all > > future context switches, and does who-knows-what to Turbo licenses and > > such. > > Intel expects that AMX will be extremely valuable to key workloads. > It is true that you may never run that kind of workload on the machine > in front of you, > and so you have every right to be doubtful about the value of AMX. > > The AMX architectural state size is not expected to change. > Rather, if a "new AMX" has a different state size, it is expected to > use a new feature bit, different from AMX. > > The AMX context switch buffer is allocated only if and when a task > touches AMX registers. > > Yes, there will be data transfer to and from that buffer when three > things all happen. > 1. the data is valid > 2. hardware interrupts the application > 3. the kernel decides to context switch. As a userspace developer of a proxy, my code is extremely sensitive to syscall cost and works in environments where 1 million interrupts/s is not uncommon. Additionally the data I process are small HTTP headers and I already had to reimplement my own byte-level memcmp because the overhead of some libc to decide what variant to use to compare 5 bytes was higher than the time to iterate over them. So I'm among those userspace developers who grumble each time new technology is automatically adopted by the compiler and libs, because that tends to make me figure what the impact is and how to work around it. I have no idea what AMX could bring me but reading this above makes me think that it has a great potential of significantly hurting the performance if one lib decides to occasionally make use of it. It would possibly be similar if a lib decided to use AVX-512 to copy data and if it resulted in the CPU quickly reaching its TDP and starting to throttle like crazy :-/ Thus I think that the first thing to think about before introducing possibly cost-sensitive optimizations is : how do I allow easily user-space to easily disable them for a task, and how do I allow an admin to easily disable them system-wide. "echo !foobar > cpuinfo" could be a nice way to mask a flag system-wide for example. prctl() would be nice for a task (as long as it's not too late already). Maybe the API should be surrounded by __amx_begin() / __amx_end() and the calls having undefined behavior outside of these. These flags would put a flag somewhere asking to extend the stacks, or __amx_begin() could even point itself to the specific stack to be used. This way it could possibly allow some userspace libraries to use it for small stuff without definitely impacting the rest of the process. > At the risk of stating the obvious... > Intel's view is that libraries that deliver the most value from the > hardware are a "good thing", > and that anything preventing libraries from getting the most value > from the hardware is a "bad thing":-) As a developer I have a different view. Anything that requires to build using different libraries depending on the systems is a real hassle, and I want to focus on the same code to run everywhere. I'm fine with some #ifdef in the code if I know that a specific part must run as fast as possible, and even some runtime detection at various points but do not want to have to deal with extra dependencies that further increase the test matrix and combinations in bug reports. Just my two cents, Willy ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-13 3:43 ` Willy Tarreau @ 2021-04-13 19:51 ` Len Brown 2021-04-14 9:58 ` Borislav Petkov 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-04-13 19:51 UTC (permalink / raw) To: Willy Tarreau Cc: Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer Thanks for sharing your perspective, Willy. I agree that if your application is so sensitive that you need to hand-code your own memcmp, then linking with (any) new version of (any) dynamic library is a risk you must consider carefully. AMX does the type of matrix multiplication that AI algorithms use. In the unlikely event that you or one of the libraries you call are doing the same, then you will be very happy with AMX. Otherwise, you'll probably not use it. I acknowledge the issue with the toolchain transparently using AVX-512 for copying data, and how that approach impacted systems with a poor AVX-512 hardware implementation. FWIW. I'm not aware of any plans to implicitly use AMX this way, and I'm not aware of any non-Xeon AMX implementations in the near future. cheers, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-13 19:51 ` Len Brown @ 2021-04-14 9:58 ` Borislav Petkov 2021-04-14 10:06 ` Willy Tarreau 2021-04-14 21:57 ` Len Brown 0 siblings, 2 replies; 130+ messages in thread From: Borislav Petkov @ 2021-04-14 9:58 UTC (permalink / raw) To: Len Brown Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Tue, Apr 13, 2021 at 03:51:50PM -0400, Len Brown wrote: > AMX does the type of matrix multiplication that AI algorithms use. In > the unlikely event that you or one of the libraries you call are doing > the same, then you will be very happy with AMX. Otherwise, you'll > probably not use it. Which sounds to me like AMX is something which should not be enabled automatically but explicitly requested. I don't see the majority of the processes on the majority of the Linux machines out there doing AI with AMX - at least not anytime soon. If it becomes ubiquitous later, we can make it automatic then. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-14 9:58 ` Borislav Petkov @ 2021-04-14 10:06 ` Willy Tarreau 2021-04-14 10:08 ` Borislav Petkov 2021-04-14 21:57 ` Len Brown 1 sibling, 1 reply; 130+ messages in thread From: Willy Tarreau @ 2021-04-14 10:06 UTC (permalink / raw) To: Borislav Petkov Cc: Len Brown, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Wed, Apr 14, 2021 at 11:58:04AM +0200, Borislav Petkov wrote: > On Tue, Apr 13, 2021 at 03:51:50PM -0400, Len Brown wrote: > > AMX does the type of matrix multiplication that AI algorithms use. In > > the unlikely event that you or one of the libraries you call are doing > > the same, then you will be very happy with AMX. Otherwise, you'll > > probably not use it. > > Which sounds to me like AMX is something which should not be enabled > automatically but explicitly requested. I don't see the majority of the > processes on the majority of the Linux machines out there doing AI with > AMX - at least not anytime soon. If it becomes ubiquitous later, we can > make it automatic then. And change jobs :-) Willy ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-14 10:06 ` Willy Tarreau @ 2021-04-14 10:08 ` Borislav Petkov 0 siblings, 0 replies; 130+ messages in thread From: Borislav Petkov @ 2021-04-14 10:08 UTC (permalink / raw) To: Willy Tarreau Cc: Len Brown, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Wed, Apr 14, 2021 at 12:06:39PM +0200, Willy Tarreau wrote: > And change jobs :-) I think by the time that happens, we'll be ready to go to the eternal vacation. Which means: not my problem. :-))) -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-14 9:58 ` Borislav Petkov 2021-04-14 10:06 ` Willy Tarreau @ 2021-04-14 21:57 ` Len Brown 2021-04-15 4:43 ` Borislav Petkov 1 sibling, 1 reply; 130+ messages in thread From: Len Brown @ 2021-04-14 21:57 UTC (permalink / raw) To: Borislav Petkov Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Wed, Apr 14, 2021 at 5:58 AM Borislav Petkov <bp@alien8.de> wrote: > > On Tue, Apr 13, 2021 at 03:51:50PM -0400, Len Brown wrote: > > AMX does the type of matrix multiplication that AI algorithms use. In > > the unlikely event that you or one of the libraries you call are doing > > the same, then you will be very happy with AMX. Otherwise, you'll > > probably not use it. > > Which sounds to me like AMX is something which should not be enabled > automatically but explicitly requested. I don't see the majority of the > processes on the majority of the Linux machines out there doing AI with > AMX - at least not anytime soon. If it becomes ubiquitous later, we can > make it automatic then. I'm pretty sure that the "it isn't my use case of interest, so it doesn't matter" line of reasoning has long been established as -EINVAL ;-) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-14 21:57 ` Len Brown @ 2021-04-15 4:43 ` Borislav Petkov 2021-04-15 5:29 ` Willy Tarreau 0 siblings, 1 reply; 130+ messages in thread From: Borislav Petkov @ 2021-04-15 4:43 UTC (permalink / raw) To: Len Brown Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Wed, Apr 14, 2021 at 05:57:22PM -0400, Len Brown wrote: > I'm pretty sure that the "it isn't my use case of interest, so it > doesn't matter" line of reasoning has long been established as -EINVAL > ;-) I have only a very faint idea what you're trying to say here. Please explain properly and more verbosely what exactly has been established where? Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-15 4:43 ` Borislav Petkov @ 2021-04-15 5:29 ` Willy Tarreau 2021-04-15 5:47 ` Borislav Petkov 0 siblings, 1 reply; 130+ messages in thread From: Willy Tarreau @ 2021-04-15 5:29 UTC (permalink / raw) To: Borislav Petkov Cc: Len Brown, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Thu, Apr 15, 2021 at 06:43:43AM +0200, Borislav Petkov wrote: > On Wed, Apr 14, 2021 at 05:57:22PM -0400, Len Brown wrote: > > I'm pretty sure that the "it isn't my use case of interest, so it > > doesn't matter" line of reasoning has long been established as -EINVAL > > ;-) > > I have only a very faint idea what you're trying to say here. Please > explain properly and more verbosely what exactly has been established > where? What Len is saying is that not being interested in a feature is not an argument for rejecting its adoption, which I'm perfectly fine with. But conversely not being interested in a feature is also an argument for insisting that its adoption doesn't harm other use cases (generally speaking, not this specific case here). Willy ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-15 5:29 ` Willy Tarreau @ 2021-04-15 5:47 ` Borislav Petkov 2021-04-16 22:05 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Borislav Petkov @ 2021-04-15 5:47 UTC (permalink / raw) To: Willy Tarreau Cc: Len Brown, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Thu, Apr 15, 2021 at 07:29:38AM +0200, Willy Tarreau wrote: > What Len is saying is that not being interested in a feature is not an > argument for rejecting its adoption, Oh, I'm not rejecting its adoption - no, don't mean that. > which I'm perfectly fine with. But conversely not being interested in > a feature is also an argument for insisting that its adoption doesn't > harm other use cases (generally speaking, not this specific case > here). Pretty much. What I'd like to see is 0-overhead for current use cases and only overhead for those who want to use it. If that can't be done automagically, then users should request it explicitly. So basically you blow up the xsave buffer only for processes which want to do AMX. And this brings the question about libraries which, if they start using AMX by default - which doesn't sound like they will want to because AMX reportedly will have only a limited? set of users - if libraries start using it by default, then it better be worth the handling of the 8kb buffer per process. If not, this should also be requestable per process so that a simple pipe in Linux: <process> | grep | awk | sed ... and so on is not penalized to allocate and handle by default 8kb for *each* process' buffer in that pipe just because each is linking against glibc which has detected AMX support in CPUID and is using it too for some weird reason like some microbenchmark saying so. All AFAIU, ofc. But my initial question was on the "establishing" part and was asking where we have established anything wrt AMX. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-15 5:47 ` Borislav Petkov @ 2021-04-16 22:05 ` Len Brown 2021-04-19 14:14 ` Borislav Petkov 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-04-16 22:05 UTC (permalink / raw) To: Borislav Petkov Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Thu, Apr 15, 2021 at 1:47 AM Borislav Petkov <bp@alien8.de> wrote: > What I'd like to see is 0-overhead for current use cases and only > overhead for those who want to use it. If that can't be done > automagically, then users should request it explicitly. So basically you > blow up the xsave buffer only for processes which want to do AMX. Indeed, expanding the xsave buffer happens only for tasks that touch AMX TILE registers. > And this brings the question about libraries which, if they start using > AMX by default - which doesn't sound like they will want to because AMX > reportedly will have only a limited? set of users - if libraries start > using it by default, then it better be worth the handling of the 8kb > buffer per process. I'm not aware of any intent to transparently use AMX for bcopy, like what happened with AVX-512. (didn't they undo that mistake?) > If not, this should also be requestable per process so that a simple > pipe in Linux: > > <process> | grep | awk | sed ... > > and so on is not penalized to allocate and handle by default 8kb for > *each* process' buffer in that pipe just because each is linking against > glibc which has detected AMX support in CPUID and is using it too for > some weird reason like some microbenchmark saying so. Tasks are created without an 8KB AMX buffer. Tasks have to actually touch the AMX TILE registers for us to allocate one for them. > But my initial question was on the "establishing" part and was asking > where we have established anything wrt AMX. The patch set on LKML establishes working AMX Linux support in public. I am thankful for your and other public review and feedback on that series. I can think of 3 actual bugs that were found in the process. thanks, Len Brown Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-16 22:05 ` Len Brown @ 2021-04-19 14:14 ` Borislav Petkov 2021-04-19 18:18 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Borislav Petkov @ 2021-04-19 14:14 UTC (permalink / raw) To: Len Brown Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Fri, Apr 16, 2021 at 06:05:10PM -0400, Len Brown wrote: > I'm not aware of any intent to transparently use AMX for bcopy, like > what happened > with AVX-512. (didn't they undo that mistake?) No clue, did they? > Tasks are created without an 8KB AMX buffer. > Tasks have to actually touch the AMX TILE registers for us to allocate > one for them. When tasks do that it doesn't matter too much - for the library it does! If the library does that by default and the processes which comprise of that pipe I mentioned earlier, get all 8K buffers because the underlying library decided so and swinging those buffers around when saving/restoring contexts turns out to be a performance penalty, then we have lost. Lost because if that thing goes upstream in this way of use of AMX is allowed implicitly, there ain't fixing it anymore once it becomes an ABI. So, that library should ask the kernel whether it supports AMX and only use it if has gotten a positive answer. And by default that answer should be "no" because the majority of processes - that same pipe I keep mentioning - don't need it. I have no good idea yet how granulary that should be - per process, per thread, whatever, but there should be a way for the kernel to control whether the library uses AMX, AVX512 or whatever fat state is out there available. Then, if a process wants the library to use AMX on its behalf, then it can say so and the library can do that but only after having asked for explicitly. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-19 14:14 ` Borislav Petkov @ 2021-04-19 18:18 ` Len Brown 2021-04-19 19:15 ` Borislav Petkov 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-04-19 18:18 UTC (permalink / raw) To: Borislav Petkov Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, Apr 19, 2021 at 10:15 AM Borislav Petkov <bp@alien8.de> wrote: > > Tasks are created without an 8KB AMX buffer. > > Tasks have to actually touch the AMX TILE registers for us to allocate > > one for them. > > When tasks do that it doesn't matter too much - for the library it does! > > If the library does that by default and the processes which comprise > of that pipe I mentioned earlier, get all 8K buffers because the > underlying library decided so and swinging those buffers around when > saving/restoring contexts turns out to be a performance penalty, then we > have lost. > > Lost because if that thing goes upstream in this way of use of AMX is > allowed implicitly, there ain't fixing it anymore once it becomes an > ABI. > > So, that library should ask the kernel whether it supports AMX and only > use it if has gotten a positive answer. Right, the library *does* ask the kernel whether it supports AMX (below). > And by default that answer > should be "no" because the majority of processes - that same pipe I keep > mentioning - don't need it. Indeed, the default is "no" because most libraries will *not* ask the system for AMX support (below). However, if they *did* probe for it, and they *did* use it, the kernel would not stand in the way of any of those requests. > I have no good idea yet how granulary that should be - per process, per > thread, whatever, but there should be a way for the kernel to control > whether the library uses AMX, AVX512 or whatever fat state is out there > available. > > Then, if a process wants the library to use AMX on its behalf, then it > can say so and the library can do that but only after having asked for > explicitly. The ABI works like this: 0. App or library author decides AMX is useful at build-time. 1. App checks CPUID for AMX CPU feature 2. App checks XCR0 for AMX OS support (if app touches AMX without these two being TRUE, it will suffer the consequence of a #UD when it touches an AMX instruction) This ABI is how AVX works today. What is new with AMX is the ability of the hardware and the OS to delay the allocation of the context switch buffer until if/when it is actually needed. This is transparent, and thus not part of the ABI, unless you count the absence of a mandated system call to be an ABI. 3. the application then touches an AMX register, triggering... 4. #NM handled by the kernel, which allocates a context switch buffer for that task, and dis-arms XFD. Yes, we could invent a new system call and mandate that it be called between #2 and #3. However, we'd still do #4 in response, so I don't see any value to that system call. Indeed, I would advocate that glibc replace it with a return statement. So back to the example: <process> | grep | awk | sed ... Sure, if grep grows support for some AI feature that we haven't imaged yet, then something in its code flow is fully empowered to probe for AMX and use AMX on AMX hardware. Sort of hard to imagine with the programs above that we know today, but future programs certainly could do this if they chose to. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-19 18:18 ` Len Brown @ 2021-04-19 19:15 ` Borislav Petkov 2021-04-19 21:33 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Borislav Petkov @ 2021-04-19 19:15 UTC (permalink / raw) To: Len Brown Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, Apr 19, 2021 at 02:18:51PM -0400, Len Brown wrote: > Yes, we could invent a new system call and mandate that it be called > between #2 and #3. However, we'd still do #4 in response, so I don't > see any value to that system call. Lemme refresh your memory: there was a time when the kernel did lazy FPU switching because tasks which really wanted to do that, would use FPU insns and from the first use onwards, the kernel would shuffle an FPU state buffer back'n'forth for the task, for the length of its lifetime. Then glibc decided to use FPU in memcpy or whatever, leading up to *every* task using the FPU which practically made us remove all that lazy FPU switching logic and do eager FPU. Back then that state was what, dunno, 1-2 KB tops. Now imagine the same lazy => eager switch but with AVX or AMX or <insert fat buffer feature here>. All of a sudden you have *every* thread sporting a fat 8K buffer because the library decided to use a fat buffer feature for it. Nope, I don't want that to happen. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-19 19:15 ` Borislav Petkov @ 2021-04-19 21:33 ` Len Brown 2021-04-19 21:58 ` Borislav Petkov 2021-04-19 23:52 ` Paul Eggert 0 siblings, 2 replies; 130+ messages in thread From: Len Brown @ 2021-04-19 21:33 UTC (permalink / raw) To: Borislav Petkov Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, Apr 19, 2021 at 3:15 PM Borislav Petkov <bp@alien8.de> wrote: > All of a sudden you have *every* thread sporting a fat 8K buffer because > the library decided to use a fat buffer feature for it. > > Nope, I don't want that to happen. For this to happen, every thread would not only have to include/link-with code that uses AMX, but that code would have to *run*. I'm sure that the AI guys are super excited about matrix multiplication, but I have a hard time imagining why grep(1) would find a use for it. Indeed, if anyone expected AMX to be used by every task, we would have never gone to the trouble of inventing the XFD hardware to support the kernel's lazy 8KB buffer allocation. cheers, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-19 21:33 ` Len Brown @ 2021-04-19 21:58 ` Borislav Petkov 2021-04-23 19:35 ` Len Brown 2021-04-19 23:52 ` Paul Eggert 1 sibling, 1 reply; 130+ messages in thread From: Borislav Petkov @ 2021-04-19 21:58 UTC (permalink / raw) To: Len Brown Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, Apr 19, 2021 at 05:33:03PM -0400, Len Brown wrote: > For this to happen, every thread would not only have to include/link-with > code that uses AMX, but that code would have to *run*. It looks like either I'm not expressing myself clearly enough or you're not reading my text: the *library* does that decision automatically! Which means *every* possible thread on the system. Which means, *every* thread has a fat 8K buffer attached to it because the library uses AMX on its behalf by *default*. > I'm sure that the AI guys are super excited about matrix multiplication, > but I have a hard time imagining why grep(1) would find a use for it. It doesn't matter if you're imagining it or not - what matters is if the decision whether the thread uses AMX or not is put in the hands of the thread and *NOT* in the hands of the library. Which means, majority of the threads should not allow AMX and only a handful who do, will have to explicitly state that. And the library will have to comply. Not the library decides for every thread itself because the feature's there. > Indeed, if anyone expected AMX to be used by every task, we would have > never gone to the trouble of inventing the XFD hardware to support the > kernel's lazy 8KB buffer allocation. If it gives me fat-buffers-off-by-default and on only for a handful of threads which really want it and *request* it *explicitly*, sure, whatever gets the job done. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-19 21:58 ` Borislav Petkov @ 2021-04-23 19:35 ` Len Brown 2021-04-23 19:57 ` Borislav Petkov 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-04-23 19:35 UTC (permalink / raw) To: Borislav Petkov Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer n Mon, Apr 19, 2021 at 5:58 PM Borislav Petkov <bp@alien8.de> wrote: > > On Mon, Apr 19, 2021 at 05:33:03PM -0400, Len Brown wrote: > > For this to happen, every thread would not only have to include/link-with > > code that uses AMX, but that code would have to *run*. > > ...the *library* does that decision automatically! > > Which means *every* possible thread on the system. > > Which means, *every* thread has a fat 8K buffer attached to it because > the library uses AMX on its behalf by *default*. Yes. If a library decides to execute AMX instructions on behalf of a task, the kernel will allocate an 8KB context switch buffer on behalf of that task. True. Nothing prevents every user task in the system from executing AMX instructions, whether explicitly or in a library, and the kernel will transparently allocate an 8KB buffer for each one. I do not know anybody who predicts or expects that every task in the system, or a universally executed library routine, will find a reason to run AMX instructions. Again, if that were the expectation or the intent, the proposal would be to statically allocate an 8KB context switch buffer on AMX hardware, instead of dynamic allocation. Today, libraries routinely probe for what instructions are available and decide to use them, or not. I think that most customers consider that a desirable feature -- since it allows them to run the same application on multiple hardware generations and get the most out of each generation. I respect your right to dislike that feature. Granted, if you find a reason to dislike AMX, the mechanisms to disable it today are on a system-wide basis, not on a process or task basis. Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-23 19:35 ` Len Brown @ 2021-04-23 19:57 ` Borislav Petkov 2021-05-02 15:27 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Borislav Petkov @ 2021-04-23 19:57 UTC (permalink / raw) To: Len Brown Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Fri, Apr 23, 2021 at 03:35:30PM -0400, Len Brown wrote: > Yes. If a library decides to execute AMX instructions on behalf > of a task, the kernel will allocate an 8KB context switch buffer > on behalf of that task. Again, the library should ask the kernel first whether it supports AMX. And the process should decide whether to use AMX - not the library on its own, on behalf of the process. > Granted, if you find a reason to dislike AMX, the mechanisms to disable > it today are on a system-wide basis, not on a process or task basis. Again, I don't dislike the feature. I don't want libraries jumping on new features without asking the process or the kernel first especially when those features have performance implications and need kernel support. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-23 19:57 ` Borislav Petkov @ 2021-05-02 15:27 ` Len Brown 2021-05-03 5:18 ` Florian Weimer ` (2 more replies) 0 siblings, 3 replies; 130+ messages in thread From: Len Brown @ 2021-05-02 15:27 UTC (permalink / raw) To: Borislav Petkov Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Fri, Apr 23, 2021 at 3:58 PM Borislav Petkov <bp@alien8.de> wrote: > > On Fri, Apr 23, 2021 at 03:35:30PM -0400, Len Brown wrote: > > Yes. If a library decides to execute AMX instructions on behalf > > of a task, the kernel will allocate an 8KB context switch buffer > > on behalf of that task. > > Again, the library should ask the kernel first whether it supports AMX. > And the process should decide whether to use AMX - not the library on > its own, on behalf of the process. > > > Granted, if you find a reason to dislike AMX, the mechanisms to disable > > it today are on a system-wide basis, not on a process or task basis. > > Again, I don't dislike the feature. I don't want libraries jumping on > new features without asking the process or the kernel first especially > when those features have performance implications and need kernel > support. Here is how it works: 1. The kernel boots and sees the feature in CPUID. 2. If the kernel supports that feature, it sets XCR0[feature]. For some features, there may be a bunch of kernel support, while simple features may require only state save/restore. 2a. If the kernel doesn't support the feature, XCR0[feature] remains cleared. 3. user-space sees the feature in CPUID 4. user-space sees for the feature via xgetbv[XCR0] 5. If the feature is enabled in XCR0, the user happily uses it. For AMX, Linux implements "transparent first use" so that it doesn't have to allocate 8KB context switch buffers for tasks that don't actually use AMX. It does this by arming XFD for all tasks, and taking a #NM to allocate a context switch buffer only for those tasks that actually execute AMX instructions. 5a. If the feature is not enabled in XCR0, and the tasks uses those instructions anway, the hardware delivers a #UD and the kernel kills the process with a signal. So you already have what you want, WRT user-space being required to ask the kernel if the feature is supported. When the kernel sets XCR0[feature], it tells user-space that the kernel supports the feature; and there is no way that user space can use the feature if the kernel did not set that bit. System calls before (and after) using a feature are not necessary, and would only degrade performance over the existing ABI. Regarding performance implications... The system administrator is empowered to enable or disable a feature in BIOS (clears CPUID bit) or in the kernel (clears XCR0 bit) if they don't like it. Yes, this is a per-system decision, rather than a per-process or a per-thread decision. So the only thing you don't have that you asked for is a way for the main process to control what instructions are used by the libraries that it links with. That one is above my pay grade. Does the application have a *choice* of which libraries they link with? cheers, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-02 15:27 ` Len Brown @ 2021-05-03 5:18 ` Florian Weimer 2021-05-03 13:43 ` Dave Hansen 2021-05-08 9:45 ` Thomas Gleixner 2021-05-17 9:45 ` Thomas Gleixner 2 siblings, 1 reply; 130+ messages in thread From: Florian Weimer @ 2021-05-03 5:18 UTC (permalink / raw) To: Len Brown Cc: Borislav Petkov, Willy Tarreau, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer * Len Brown: > 5. If the feature is enabled in XCR0, the user happily uses it. > > For AMX, Linux implements "transparent first use" > so that it doesn't have to allocate 8KB context switch > buffers for tasks that don't actually use AMX. > It does this by arming XFD for all tasks, and taking a #NM > to allocate a context switch buffer only for those tasks > that actually execute AMX instructions. What happens if the kernel cannot allocate that additional context switch buffer? Thanks, Florian ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-03 5:18 ` Florian Weimer @ 2021-05-03 13:43 ` Dave Hansen 2021-05-03 13:47 ` Florian Weimer 2021-05-07 18:44 ` Thomas Gleixner 0 siblings, 2 replies; 130+ messages in thread From: Dave Hansen @ 2021-05-03 13:43 UTC (permalink / raw) To: Florian Weimer, Len Brown Cc: Borislav Petkov, Willy Tarreau, Andy Lutomirski, Bae, Chang Seok, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On 5/2/21 10:18 PM, Florian Weimer wrote: >> 5. If the feature is enabled in XCR0, the user happily uses it. >> >> For AMX, Linux implements "transparent first use" >> so that it doesn't have to allocate 8KB context switch >> buffers for tasks that don't actually use AMX. >> It does this by arming XFD for all tasks, and taking a #NM >> to allocate a context switch buffer only for those tasks >> that actually execute AMX instructions. > What happens if the kernel cannot allocate that additional context > switch buffer? Well, it's vmalloc()'d and currently smaller that the kernel stack, which is also vmalloc()'d. While it can theoretically fail, if it happens you have bigger problems on your hands. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-03 13:43 ` Dave Hansen @ 2021-05-03 13:47 ` Florian Weimer 2021-05-03 14:14 ` Dave Hansen 2021-05-07 18:44 ` Thomas Gleixner 1 sibling, 1 reply; 130+ messages in thread From: Florian Weimer @ 2021-05-03 13:47 UTC (permalink / raw) To: Dave Hansen Cc: Len Brown, Borislav Petkov, Willy Tarreau, Andy Lutomirski, Bae, Chang Seok, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer * Dave Hansen: > On 5/2/21 10:18 PM, Florian Weimer wrote: >>> 5. If the feature is enabled in XCR0, the user happily uses it. >>> >>> For AMX, Linux implements "transparent first use" >>> so that it doesn't have to allocate 8KB context switch >>> buffers for tasks that don't actually use AMX. >>> It does this by arming XFD for all tasks, and taking a #NM >>> to allocate a context switch buffer only for those tasks >>> that actually execute AMX instructions. >> What happens if the kernel cannot allocate that additional context >> switch buffer? > > Well, it's vmalloc()'d and currently smaller that the kernel stack, > which is also vmalloc()'d. While it can theoretically fail, if it > happens you have bigger problems on your hands. Not sure if I understand. Is your position that the kernel should terminate processes if it runs out of memory instead reporting proper errors, even if memory overcommit is disabled? Kernel stack allocation is different because it happens at system call, so errors can be properly reported. Thanks, Florian ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-03 13:47 ` Florian Weimer @ 2021-05-03 14:14 ` Dave Hansen 0 siblings, 0 replies; 130+ messages in thread From: Dave Hansen @ 2021-05-03 14:14 UTC (permalink / raw) To: Florian Weimer Cc: Len Brown, Borislav Petkov, Willy Tarreau, Andy Lutomirski, Bae, Chang Seok, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On 5/3/21 6:47 AM, Florian Weimer wrote: > * Dave Hansen: > >> On 5/2/21 10:18 PM, Florian Weimer wrote: >>>> 5. If the feature is enabled in XCR0, the user happily uses it. >>>> >>>> For AMX, Linux implements "transparent first use" >>>> so that it doesn't have to allocate 8KB context switch >>>> buffers for tasks that don't actually use AMX. >>>> It does this by arming XFD for all tasks, and taking a #NM >>>> to allocate a context switch buffer only for those tasks >>>> that actually execute AMX instructions. >>> What happens if the kernel cannot allocate that additional context >>> switch buffer? >> Well, it's vmalloc()'d and currently smaller that the kernel stack, >> which is also vmalloc()'d. While it can theoretically fail, if it >> happens you have bigger problems on your hands. > Not sure if I understand. > > Is your position that the kernel should terminate processes if it runs > out of memory instead reporting proper errors, even if memory overcommit > is disabled? I assume you mean sysctl vm.overcommit=2 by "overcommit is disabled"? > When this flag is 2, the kernel uses a "never overcommit" > policy that attempts to prevent any overcommit of memory. > Note that user_reserve_kbytes affects this policy. Note the "attempts". So, no, the kernel should not be terminating processes when it runs out of memory. It *attempts* not to do that. What you are seeing here with a demand-based XSAVE buffer allocation driven by a #NM fault is the *addition* of a case where those attempts can fail, not the creation of the first one. The addition of this case doesn't bother me because I don't think it will ultimately be visible to end users. If I'm wrong, and our HPC friends who are so enamored with "vm.overcommit=2" end up seeing lots of SIGSEGV's where where would rather see syscall failures, there's an easy solution: disable first-use detection. Stop dynamically allocating XSAVE buffers on faults. Actually, if we don't have a tunable or boot parameter for that now, we should add one. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-03 13:43 ` Dave Hansen 2021-05-03 13:47 ` Florian Weimer @ 2021-05-07 18:44 ` Thomas Gleixner 2021-05-07 18:50 ` Andy Lutomirski 1 sibling, 1 reply; 130+ messages in thread From: Thomas Gleixner @ 2021-05-07 18:44 UTC (permalink / raw) To: Dave Hansen, Florian Weimer, Len Brown Cc: Borislav Petkov, Willy Tarreau, Andy Lutomirski, Bae, Chang Seok, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, May 03 2021 at 06:43, Dave Hansen wrote: > On 5/2/21 10:18 PM, Florian Weimer wrote: >>> 5. If the feature is enabled in XCR0, the user happily uses it. >>> >>> For AMX, Linux implements "transparent first use" >>> so that it doesn't have to allocate 8KB context switch >>> buffers for tasks that don't actually use AMX. >>> It does this by arming XFD for all tasks, and taking a #NM >>> to allocate a context switch buffer only for those tasks >>> that actually execute AMX instructions. >> What happens if the kernel cannot allocate that additional context >> switch buffer? > > Well, it's vmalloc()'d and currently smaller that the kernel stack, > which is also vmalloc()'d. While it can theoretically fail, if it > happens you have bigger problems on your hands. Such a buffer allocation might also exceed a per process/cgroup limitation. Anything else which is accounted happens in syscall context which then returns an error on which the application can react. So what's the consequence when the allocation fails? Kill it right away from #NM? Kill it on the first signal? Do nothing and see what happens? Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-07 18:44 ` Thomas Gleixner @ 2021-05-07 18:50 ` Andy Lutomirski 2021-05-07 19:22 ` Thomas Gleixner 0 siblings, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-05-07 18:50 UTC (permalink / raw) To: Thomas Gleixner Cc: Dave Hansen, Florian Weimer, Len Brown, Borislav Petkov, Willy Tarreau, Andy Lutomirski, Bae, Chang Seok, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Fri, May 7, 2021 at 11:44 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Mon, May 03 2021 at 06:43, Dave Hansen wrote: > > On 5/2/21 10:18 PM, Florian Weimer wrote: > >>> 5. If the feature is enabled in XCR0, the user happily uses it. > >>> > >>> For AMX, Linux implements "transparent first use" > >>> so that it doesn't have to allocate 8KB context switch > >>> buffers for tasks that don't actually use AMX. > >>> It does this by arming XFD for all tasks, and taking a #NM > >>> to allocate a context switch buffer only for those tasks > >>> that actually execute AMX instructions. > >> What happens if the kernel cannot allocate that additional context > >> switch buffer? > > > > Well, it's vmalloc()'d and currently smaller that the kernel stack, > > which is also vmalloc()'d. While it can theoretically fail, if it > > happens you have bigger problems on your hands. > > Such a buffer allocation might also exceed a per process/cgroup > limitation. Anything else which is accounted happens in syscall context > which then returns an error on which the application can react. > > So what's the consequence when the allocation fails? Kill it right away > from #NM? Kill it on the first signal? Do nothing and see what happens? > It has to be an immediate signal or kill. A failure to load FPU state is somewhat tolerable (and has to be for CET), but a failure to *save* FPU state on a context switch would be a really nasty can of worms. At the very least we will want arch_prctl(ARCH_ALLOCTE_XSTATE, mask) to allow HPC workloads to manually allocate the state and get an error code if it fails. --Andy ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-07 18:50 ` Andy Lutomirski @ 2021-05-07 19:22 ` Thomas Gleixner 0 siblings, 0 replies; 130+ messages in thread From: Thomas Gleixner @ 2021-05-07 19:22 UTC (permalink / raw) To: Andy Lutomirski Cc: Dave Hansen, Florian Weimer, Len Brown, Borislav Petkov, Willy Tarreau, Andy Lutomirski, Bae, Chang Seok, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Fri, May 07 2021 at 11:50, Andy Lutomirski wrote: > On Fri, May 7, 2021 at 11:44 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> >> On Mon, May 03 2021 at 06:43, Dave Hansen wrote: >> > On 5/2/21 10:18 PM, Florian Weimer wrote: >> >>> 5. If the feature is enabled in XCR0, the user happily uses it. >> >>> >> >>> For AMX, Linux implements "transparent first use" >> >>> so that it doesn't have to allocate 8KB context switch >> >>> buffers for tasks that don't actually use AMX. >> >>> It does this by arming XFD for all tasks, and taking a #NM >> >>> to allocate a context switch buffer only for those tasks >> >>> that actually execute AMX instructions. >> >> What happens if the kernel cannot allocate that additional context >> >> switch buffer? >> > >> > Well, it's vmalloc()'d and currently smaller that the kernel stack, >> > which is also vmalloc()'d. While it can theoretically fail, if it >> > happens you have bigger problems on your hands. >> >> Such a buffer allocation might also exceed a per process/cgroup >> limitation. Anything else which is accounted happens in syscall context >> which then returns an error on which the application can react. >> >> So what's the consequence when the allocation fails? Kill it right away >> from #NM? Kill it on the first signal? Do nothing and see what happens? >> > It has to be an immediate signal or kill. Killing it right there is the only sensible thing to do. > A failure to load FPU state is somewhat tolerable (and has to be for > CET), but a failure to *save* FPU state on a context switch would be a > really nasty can of worms. :) > At the very least we will want arch_prctl(ARCH_ALLOCTE_XSTATE, mask) > to allow HPC workloads to manually allocate the state and get an error > code if it fails. Yes. Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-02 15:27 ` Len Brown 2021-05-03 5:18 ` Florian Weimer @ 2021-05-08 9:45 ` Thomas Gleixner 2021-05-18 20:39 ` Len Brown 2021-05-17 9:45 ` Thomas Gleixner 2 siblings, 1 reply; 130+ messages in thread From: Thomas Gleixner @ 2021-05-08 9:45 UTC (permalink / raw) To: Len Brown, Borislav Petkov Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Sun, May 02 2021 at 11:27, Len Brown wrote: > On Fri, Apr 23, 2021 at 3:58 PM Borislav Petkov <bp@alien8.de> wrote: > Here is how it works: > > 1. The kernel boots and sees the feature in CPUID. > > 2. If the kernel supports that feature, it sets XCR0[feature]. > > For some features, there may be a bunch of kernel support, > while simple features may require only state save/restore. > > 2a. If the kernel doesn't support the feature, XCR0[feature] remains cleared. > > 3. user-space sees the feature in CPUID > > 4. user-space sees for the feature via xgetbv[XCR0] > > 5. If the feature is enabled in XCR0, the user happily uses it. > > For AMX, Linux implements "transparent first use" > so that it doesn't have to allocate 8KB context switch > buffers for tasks that don't actually use AMX. > It does this by arming XFD for all tasks, and taking a #NM > to allocate a context switch buffer only for those tasks > that actually execute AMX instructions. > > 5a. If the feature is not enabled in XCR0, and the tasks uses > those instructions anway, the hardware delivers a #UD > and the kernel kills the process with a signal. Where is #6 which describes the signal interaction? Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-08 9:45 ` Thomas Gleixner @ 2021-05-18 20:39 ` Len Brown 2021-05-19 23:29 ` Andy Lutomirski 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-05-18 20:39 UTC (permalink / raw) To: Thomas Gleixner Cc: Borislav Petkov, Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Sat, May 8, 2021 at 5:45 AM Thomas Gleixner <tglx@linutronix.de> wrote: > Where is #6 which describes the signal interaction? #6 Per the current ABI, Linux gives signal handlers access to all of the hardware architectural state. #6a Signal Stack is on User Stack The architectural state is pushed on the user stack in uncompressed XSTATE format. It is established that there exists application code that counts on this opaque state being complete so that it can do a user-space XRESTORE instead of a sigreturn(2). (My opinion is that not breaking that legacy code is a requirement, and I'm actually shocked this view is not unanimous) If a feature is enabled in XCR0 but is in INIT state, the XSAVE will transfer zeros. While this is established for AVX-512, we optimize for this case for AMX by checking for this scenario and not transferring any data. (this optimization, and the self-test for it, is in AMX patch series v5) The signal hander is empowered to alter everything in XSTATE on the signal stack. Upon sigreturn, the kernel will dutifully XRESTORE the XSTATE. #6b Applications that allocate and register a dedicated alternate signal stack Run-time is similar to above, except the user has allocated a dedicated signal stack. The problem is that the user had to decide this stack's size. Unfortunately, signal.h ABI contained #define MIN/SIG-STACKSIZE (2k/8k) constants, which were: a) constant b) not updated in decades The kernel, for its part, also failed to check that an altstack was big enough before writing to it. Indeed, AVX-512 made the 2k constant a lie, which Andy points out is ABI breakage. This is factual, and there were real programs that broke because of it. Were AMX to be deployed in this manner without repairing the broken ABI, the 8K state would exceed both of these constants, and that would be more severe breakage than AVX-512. glibc 2.34 addressed both the existing and future problem, by updating these constants to be calculated at run-time. The run-time calculation can be done entirely in glibc, or if glibc is running on an updated kernel, it will ask the kernel for the size via altvec. Further, the kernel has been updated to check for alt-stack too-small at run-time. https://lore.kernel.org/lkml/20210518200320.17239-1-chang.seok.bae@intel.com/ I believe that all feedback has been addressed in that patch series, and that it is ready for linux-next. There are still two potential failures on systems that have AVX-512/AMX enabled: 1. program, re-compiled or not, that hard-codes its own too-small alt-stack 2. legacy static binary using old signal.h constants to allocate alt-stack. The kernel will not prohibit these programs from executing, but if they actually take a signal, the kernel will SIGSEGV them instead of overflowing their stack. Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-18 20:39 ` Len Brown @ 2021-05-19 23:29 ` Andy Lutomirski 2021-05-20 19:16 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-05-19 23:29 UTC (permalink / raw) To: Len Brown, Thomas Gleixner Cc: Borislav Petkov, Willy Tarreau, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On 5/18/21 1:39 PM, Len Brown wrote: > On Sat, May 8, 2021 at 5:45 AM Thomas Gleixner <tglx@linutronix.de> wrote: > >> Where is #6 which describes the signal interaction? > > #6 Per the current ABI, Linux gives signal handlers access to all of > the hardware architectural state. > > #6a Signal Stack is on User Stack > > The architectural state is pushed on the user stack in uncompressed > XSTATE format. > > It is established that there exists application code that counts on > this opaque state being complete so that it can do a user-space > XRESTORE instead of a sigreturn(2). Is this established? Note that the specific case of a user program doing XRSTOR will work just fine if we omit the allocation of non-in-use states from the buffer, at least by my reading of the pseudocode. The case that would break is if user code then assumes that it can XSAVE back to the same buffer. > (My opinion is that not breaking > that legacy code is a requirement, and I'm actually shocked this view > is not unanimous) > It's pretty unanimous. But the legacy code that's broken has to actually exist for this to apply. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-19 23:29 ` Andy Lutomirski @ 2021-05-20 19:16 ` Len Brown 0 siblings, 0 replies; 130+ messages in thread From: Len Brown @ 2021-05-20 19:16 UTC (permalink / raw) To: Andy Lutomirski Cc: Thomas Gleixner, Borislav Petkov, Willy Tarreau, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Wed, May 19, 2021 at 7:29 PM Andy Lutomirski <luto@kernel.org> wrote: > > It is established that there exists application code that counts on > > this opaque state being complete so that it can do a user-space > > XRESTORE instead of a sigreturn(2). > > Is this established? > > Note that the specific case of a user program doing XRSTOR will work > just fine if we omit the allocation of non-in-use states from the > buffer, at least by my reading of the pseudocode. Yes, your understanding is correct -- XRESTOR works as one would expect. > The case that would > break is if user code then assumes that it can XSAVE back to the same > buffer. The other case that would break is if the concept of what features were supported (eg. XCR0) changed between when the context was saved and when it was subsequently restored. Yes, if a feature appeared, you'd get INIT; but if a feature went away, you would fault. I've been told that user-space software exists that does this. If I can find specific examples, I'll share that. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-02 15:27 ` Len Brown 2021-05-03 5:18 ` Florian Weimer 2021-05-08 9:45 ` Thomas Gleixner @ 2021-05-17 9:45 ` Thomas Gleixner 2021-05-17 9:56 ` Florian Weimer ` (2 more replies) 2 siblings, 3 replies; 130+ messages in thread From: Thomas Gleixner @ 2021-05-17 9:45 UTC (permalink / raw) To: Len Brown, Borislav Petkov Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-api, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven Len, On Sun, May 02 2021 at 11:27, Len Brown wrote: > Here is how it works: > > 1. The kernel boots and sees the feature in CPUID. > > 2. If the kernel supports that feature, it sets XCR0[feature]. > > For some features, there may be a bunch of kernel support, > while simple features may require only state save/restore. > > 2a. If the kernel doesn't support the feature, XCR0[feature] remains cleared. > > 3. user-space sees the feature in CPUID > > 4. user-space sees for the feature via xgetbv[XCR0] > > 5. If the feature is enabled in XCR0, the user happily uses it. > > For AMX, Linux implements "transparent first use" > so that it doesn't have to allocate 8KB context switch > buffers for tasks that don't actually use AMX. > It does this by arming XFD for all tasks, and taking a #NM > to allocate a context switch buffer only for those tasks > that actually execute AMX instructions. I thought more about this and it's absolutely the wrong way to go for several reasons. AMX (or whatever comes next) is nothing else than a device and it just should be treated as such. The fact that it is not exposed via a driver and a device node does not matter at all. Not doing so requires this awkward buffer allocation issue via #NM with all it's downsides; it's just wrong to force the kernel to manage resources of a user space task without being able to return a proper error code. It also prevents fine grained control over access to this functionality. As AMX is clearly a shared resource which is not per HT thread (maybe not even per core) and it has impact on power/frequency it is important to be able to restrict access on a per process/cgroup scope. Having a proper interface (syscall, prctl) which user space can use to ask for permission and allocation of the necessary buffer(s) is clearly avoiding the downsides and provides the necessary mechanisms for proper control and failure handling. It's not the end of the world if something which wants to utilize this has do issue a syscall during detection. It does not matter whether that's a library or just the application code itself. That's a one off operation and every involved entity can cache the result in TLS. AVX512 has already proven that XSTATE management is fragile and error prone, so we really have to stop this instead of creating yet another half baken solution. Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-17 9:45 ` Thomas Gleixner @ 2021-05-17 9:56 ` Florian Weimer 2021-05-17 10:18 ` Thomas Gleixner 2021-05-21 16:29 ` Len Brown 2021-05-17 13:49 ` Arjan van de Ven 2021-05-20 15:35 ` Len Brown 2 siblings, 2 replies; 130+ messages in thread From: Florian Weimer @ 2021-05-17 9:56 UTC (permalink / raw) To: Thomas Gleixner Cc: Len Brown, Borislav Petkov, Willy Tarreau, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-api, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven * Thomas Gleixner: > Having a proper interface (syscall, prctl) which user space can use to > ask for permission and allocation of the necessary buffer(s) is clearly > avoiding the downsides and provides the necessary mechanisms for proper > control and failure handling. > > It's not the end of the world if something which wants to utilize this > has do issue a syscall during detection. It does not matter whether > that's a library or just the application code itself. > > That's a one off operation and every involved entity can cache the > result in TLS. I'm not sure if it's a good idea to have each AMX consumer to set up its own TLS cache. How expensive is checking XCR0 via XGETBV instead on the AMX path? Then AMX can be enabled on the thread via a system call. It also allows disabling of AMX. It would also need an AT_HWCAP2 feature flag telling user space that AMX support is available after that system call (switching on AMX to check whether AMX paths should enabled later seems potentially wasteful if the AMX paths are never taken after all). Thanks, Florian ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-17 9:56 ` Florian Weimer @ 2021-05-17 10:18 ` Thomas Gleixner 2021-05-21 16:29 ` Len Brown 1 sibling, 0 replies; 130+ messages in thread From: Thomas Gleixner @ 2021-05-17 10:18 UTC (permalink / raw) To: Florian Weimer Cc: Len Brown, Borislav Petkov, Willy Tarreau, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-api, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven On Mon, May 17 2021 at 11:56, Florian Weimer wrote: > * Thomas Gleixner: > >> Having a proper interface (syscall, prctl) which user space can use to >> ask for permission and allocation of the necessary buffer(s) is clearly >> avoiding the downsides and provides the necessary mechanisms for proper >> control and failure handling. >> >> It's not the end of the world if something which wants to utilize this >> has do issue a syscall during detection. It does not matter whether >> that's a library or just the application code itself. >> >> That's a one off operation and every involved entity can cache the >> result in TLS. > > I'm not sure if it's a good idea to have each AMX consumer to set up its > own TLS cache. How expensive is checking XCR0 via XGETBV instead on the > AMX path? Then AMX can be enabled on the thread via a system call. Right, did not think about that. > It also allows disabling of AMX. That needs reference counting, but yes it's possible. > It would also need an AT_HWCAP2 feature flag telling user space that AMX > support is available after that system call (switching on AMX to check > whether AMX paths should enabled later seems potentially wasteful if the > AMX paths are never taken after all). Either that or just have: prctl(PR_QUERY_XSTATE_FEATURES,.... prctl(PR_ENABLE_XSTATE_FEATURES,.... prctl(PR_DISABLE_XSTATE_FEATURES,.... Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-17 9:56 ` Florian Weimer 2021-05-17 10:18 ` Thomas Gleixner @ 2021-05-21 16:29 ` Len Brown 1 sibling, 0 replies; 130+ messages in thread From: Len Brown @ 2021-05-21 16:29 UTC (permalink / raw) To: Florian Weimer Cc: Thomas Gleixner, Borislav Petkov, Willy Tarreau, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, Linux API, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven On Mon, May 17, 2021 at 5:56 AM Florian Weimer <fweimer@redhat.com> wrote: > How expensive is checking XCR0 via XGETBV instead, on the AMX path? XGETBV takes about the same number of cycles as RDTSC (ie. it is relatively fast) Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-17 9:45 ` Thomas Gleixner 2021-05-17 9:56 ` Florian Weimer @ 2021-05-17 13:49 ` Arjan van de Ven 2021-05-20 15:35 ` Len Brown 2 siblings, 0 replies; 130+ messages in thread From: Arjan van de Ven @ 2021-05-17 13:49 UTC (permalink / raw) To: Thomas Gleixner, Len Brown, Borislav Petkov Cc: Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-api, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer > Having a proper interface (syscall, prctl) which user space can use to > ask for permission and allocation of the necessary buffer(s) is clearly > avoiding the downsides and provides the necessary mechanisms for proper > control and failure handling. this would need to be a "get / put" interface, so a refcount; that way things nest nicely. For API symmetry I'd want to have the put there, even if we may decide to be infinitely lazy in cleaning up the state. it also would want it to take an arguement that's a bitmask, so that this can be applied to future state as well. Eh actually I'd start with also adding AVX512 to this. Even though for obvious compat reasons that one is on by default (so at process start we might need to start with a count of 1) it's interesting to fold that into this same framework. (and who knows, dropping AVX512 state if you don't need it might improve context switches) Syscalls are relatively cheap (and I can imagine the C library doing a TLS cache of the count if it becomes an issue) so can be done on a relatively finegrained level. I've worked on OpenBLAS before, and that library basically has a global initialization function that ends up getting called on the first big math op (it may spawn threads as well etc) but which "stays around" for consecutive math functions; a get/put model would work quite well for such math library (since it's based on BLAS like almost all such math libraries, I expect this to be the common pattern) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-17 9:45 ` Thomas Gleixner 2021-05-17 9:56 ` Florian Weimer 2021-05-17 13:49 ` Arjan van de Ven @ 2021-05-20 15:35 ` Len Brown 2021-05-20 20:54 ` Thomas Gleixner 2 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-05-20 15:35 UTC (permalink / raw) To: Thomas Gleixner Cc: Borislav Petkov, Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, Linux API, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven Hi Thomas, On Mon, May 17, 2021 at 5:45 AM Thomas Gleixner <tglx@linutronix.de> wrote: > AMX (or whatever comes next) is nothing else than a device and it > just should be treated as such. The fact that it is not exposed > via a driver and a device node does not matter at all. TMM registers are part of the CPU architectural state. If TMM registers exist for one logical CPU, they exist for all CPUs -- including HT siblings. (Intel supports only homogeneous ISA) Ditto for the instructions that access and operate on TMM registers. One can reasonably predict, that like Intel has done for all other registers, there will be future instructions added to the ISA to operate on TMM registers, including in combination with non-TMM registers that are also part of the architectural state. It is an unfortunate word choice that some documentation calls the TMUL instruction an "accelerator". It isn't. It is part of the ISA, like any other instruction. I agree that a device interface may make sense for real accelerators that don't run x86 instructions, I don't see long term viability for attempting to carve a sub-set of x86 instructions into a device, particularly when the set of instructions will continue to evolve. > Not doing so requires this awkward buffer allocation issue via #NM with > all it's downsides; it's just wrong to force the kernel to manage > resources of a user space task without being able to return a proper > error code. The hardware #NM support for fault on first use is a feature to allow the OS to optimize space so that pages do not have to be dedicated to back registers unless/until they are actually used. There is absolutely no requirement that a particular OS take advantage of that feature. If you think that this optimization is awkward, we can easily delete/disable it and simply statically allocate buffers for all threads at initialization time. Though you'll have to convince me why the word "awkward" applies, rather than "elegant". Regarding error return for allocation failures. I'm not familiar with the use-case where vmalloc would be likely to fail today, and I'd be interested if anybody can detail that use-case. But even if there is none today, I grate that Linux could evolve to make vmalloc fail in the future, and so an interface to reqeust pre-allocation of buffers is reasonable insurance. Chang has implemented this prctl in v5 of the TMUL patch series. > It also prevents fine grained control over access to this > functionality. As AMX is clearly a shared resource which is not per HT > thread (maybe not even per core) and it has impact on power/frequency it > is important to be able to restrict access on a per process/cgroup > scope. AMX is analogous to the multiplier used by AVX-512. The architectural state must exist on every CPU, including HT siblings. Today, the HT siblings share the same execution unit, and I have no reason to expect that will change. I thought we already addressed the FUD surrounding power/frequency. As with every kind of instruction -- those that use more power will leave less power for their peers, and there is a mechanism to track that power budget. I acknowledge that the mechanism was overly conservative and slow to recover in initial AVX-512 systems, and that issue persists even with the latest publically available hardware today. I acknowledge that you do not trust that Intel has addressed this (for both AVX-512 and AMX) in the first hardware that supports AMX. > Having a proper interface (syscall, prctl) which user space can use to > ask for permission and allocation of the necessary buffer(s) is clearly > avoiding the downsides and provides the necessary mechanisms for proper > control and failure handling. > > It's not the end of the world if something which wants to utilize this > has do issue a syscall during detection. It does not matter whether > that's a library or just the application code itself. > > That's a one off operation and every involved entity can cache the > result in TLS. > > AVX512 has already proven that XSTATE management is fragile and error > prone, so we really have to stop this instead of creating yet another > half baked solution. We fixed the glibc ABI issue. It is available now and production release is this summer. Yes, it should have been addressed when AVX-512 was deployed. thanks Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-20 15:35 ` Len Brown @ 2021-05-20 20:54 ` Thomas Gleixner 2021-05-20 21:13 ` Dave Hansen 2021-05-20 21:22 ` Len Brown 0 siblings, 2 replies; 130+ messages in thread From: Thomas Gleixner @ 2021-05-20 20:54 UTC (permalink / raw) To: Len Brown Cc: Borislav Petkov, Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, Linux API, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven Len, On Thu, May 20 2021 at 11:35, Len Brown wrote: > On Mon, May 17, 2021 at 5:45 AM Thomas Gleixner <tglx@linutronix.de> wrote: > >> AMX (or whatever comes next) is nothing else than a device and it >> just should be treated as such. The fact that it is not exposed >> via a driver and a device node does not matter at all. > > TMM registers are part of the CPU architectural state. > If TMM registers exist for one logical CPU, they exist for all CPUs -- > including HT siblings. (Intel supports only homogeneous ISA) > > Ditto for the instructions that access and operate on TMM registers. > > One can reasonably predict, that like Intel has done for all other registers, > there will be future instructions added to the ISA to operate on TMM registers, > including in combination with non-TMM registers that are also part > of the architectural state. > > It is an unfortunate word choice that some documentation calls the > TMUL instruction an "accelerator". It isn't. It is part of the ISA, > like any other instruction. of course I know that it is an instruction and the register state is part of the per CPU architectural state. Though there is a fundamental difference between per logical CPU architectural state and per logical CPU resources and you know that as well as I do. IOW, that does not change the fact that AMX is a shared resource. That's true for AVX and that's also true for the architectural RNG, which is also accessed like "any other instruction". We've seen how well that works. That's the whole point. Because it's a shared resource with causes contention and also has side effects vs. power/thermal and state size this _is_ different from 'any other instruction'. > I agree that a device interface may make sense for real accelerators > that don't run x86 instructions, I don't see long term viability for attempting > to carve a sub-set of x86 instructions into a device, particularly when > the set of instructions will continue to evolve. Nobody asked for a device interface for AMX. All I asked for is a _mandatory_ "request usage" interface, e.g. prctl. Just for the record: Your like "any other instruction" argument is a nothing else than a strawman. There exist instructions today which need OS assistance, e.g. the SGX related instructions, the upcoming TDX related ones, ENQCMD & al. Please tell me _why_ they are so different. They are part of the ISA and still are subject to fine grained (OS) control. >> Not doing so requires this awkward buffer allocation issue via #NM with >> all it's downsides; it's just wrong to force the kernel to manage >> resources of a user space task without being able to return a proper >> error code. > > The hardware #NM support for fault on first use is a feature to allow the OS > to optimize space so that pages do not have to be dedicated to back registers > unless/until they are actually used. > > There is absolutely no requirement that a particular > OS take advantage of that feature. If you think that this optimization is > awkward, we can easily delete/disable it and simply statically allocate buffers > for all threads at initialization time. Though you'll have to convince me > why the word "awkward" applies, rather than "elegant". It's not elegant. It's a hack to avoid rethinking the approach to this kind of features. But I have to admit that it's a cute hack and it even can be utilized for a access-request based solution. > Regarding error return for allocation failures. > > I'm not familiar with the use-case where vmalloc would be likely to fail today, > and I'd be interested if anybody can detail that use-case. It does not matter whether it's likely or not. Unlikely simply does not exist at cloud-scale. > But even if there is none today, I grate that Linux could evolve to make vmalloc > fail in the future, and so an interface to reqeust pre-allocation of buffers > is reasonable insurance. Chang has implemented this prctl in v5 > of the TMUL patch series. No, it's not a reasonable insurance, simply because it's not mandatory. >> It also prevents fine grained control over access to this >> functionality. As AMX is clearly a shared resource which is not per HT >> thread (maybe not even per core) and it has impact on power/frequency it >> is important to be able to restrict access on a per process/cgroup >> scope. > > AMX is analogous to the multiplier used by AVX-512. > The architectural state must exist on every CPU, including HT siblings. > Today, the HT siblings share the same execution unit, > and I have no reason to expect that will change. I'm well aware that HT siblings share the same execution unit for AVX. Though AMX is if I remember the discussions two years ago correctly shared by more than the HT siblings which makes things worse. > I thought we already addressed the FUD surrounding power/frequency. What's FUD here? The fact that AMX is a shared resource which has contention issues? The fact that AMX usage has an influence on power/frequency? If that's FUD by now, then your documentation needs an update. > As with every kind of instruction -- those that use > more power will leave less power for their peers, and there is a mechanism > to track that power budget. I acknowledge that the mechanism was overly > conservative and slow to recover in initial AVX-512 systems, and that issue > persists even with the latest publically available hardware today. > I acknowledge that you do not trust that Intel has addressed this > (for both AVX-512 and AMX) in the first hardware that supports AMX. It does not matter whether I trust Intel or not to get this right. It does neither matter whether there is a mechanism to track the budget or not. What matters is that the proposed #NM automatism simply prevents fine grained access control for a _shared_ resource which has implications on power and frequency and performance in general due to the fact that it's shared and causes contention. And because the #NM hack allows the world and its dog to use AMX any unpriviledged user can utilize it. See the idea to use it for grep... You might want to talk to the people in your company who care about real-time and functional safety whether they think it's a good idea to allow unrestricted access to functionality which has an influence on the overall system behaviour with no other knob than to turn it off completely. Turn it off completely is not an option simply because there are valid use cases even in that area. >> Having a proper interface (syscall, prctl) which user space can use to >> ask for permission and allocation of the necessary buffer(s) is clearly >> avoiding the downsides and provides the necessary mechanisms for proper >> control and failure handling. >> >> It's not the end of the world if something which wants to utilize this >> has do issue a syscall during detection. It does not matter whether >> that's a library or just the application code itself. >> >> That's a one off operation and every involved entity can cache the >> result in TLS. >> >> AVX512 has already proven that XSTATE management is fragile and error >> prone, so we really have to stop this instead of creating yet another >> half baked solution. > > We fixed the glibc ABI issue. It is available now and production > release is this summer. That does not answer my questions at all. > Yes, it should have been addressed when AVX-512 was deployed. Correct. And in hindsight we should have insisted to have fine grained control over that back then, but that's water under the bridge. AMX and what's coming next is not. Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-20 20:54 ` Thomas Gleixner @ 2021-05-20 21:13 ` Dave Hansen 2021-05-20 21:41 ` Len Brown 2021-05-20 21:22 ` Len Brown 1 sibling, 1 reply; 130+ messages in thread From: Dave Hansen @ 2021-05-20 21:13 UTC (permalink / raw) To: Thomas Gleixner, Len Brown Cc: Borislav Petkov, Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, X86 ML, LKML, Linux API, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven On 5/20/21 1:54 PM, Thomas Gleixner wrote: >> Regarding error return for allocation failures. >> >> I'm not familiar with the use-case where vmalloc would be likely to fail today, >> and I'd be interested if anybody can detail that use-case. > It does not matter whether it's likely or not. Unlikely simply does not > exist at cloud-scale. Len, I may have led you astray in some of our discussions on this topic. Here are the cold hard facts: * vmalloc() can fail (the memory.kmem cgroup limit is probably the most likely place to be exposed to this) * vmalloc() failure in a fault (like #NM) will result in SIGSEGV * vmalloc() failure in a syscall can be handled with -ENOMEM In some of our discussions, I told you that reasonably-sized vmalloc()s don't practically fail and that we shouldn't be concerned with failure for our vmalloc()-in-#NM use-case. In other words, I'm OK with crashing apps at the point that vmalloc() is failing. However, Thomas was pretty clear that he's not OK with that. To paraphrase: if we can avoid expanding the scope of where memory allocation failures result in SIGSEGV, we should do it. While I don't *entirely* agree that it's worth it, I can respect Thomas's opinion here. It leads me in the direction of wanting to drive dynamic xstate vmalloc()s from an explicit syscall ABI. My apologies if I sent the AMX support on an unproductive tangent here. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-20 21:13 ` Dave Hansen @ 2021-05-20 21:41 ` Len Brown 2021-05-20 22:53 ` Dave Hansen 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-05-20 21:41 UTC (permalink / raw) To: Dave Hansen Cc: Thomas Gleixner, Borislav Petkov, Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, X86 ML, LKML, Linux API, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven On Thu, May 20, 2021 at 5:13 PM Dave Hansen <dave.hansen@intel.com> wrote: > >> Regarding error return for allocation failures. ... > * vmalloc() can fail (the memory.kmem cgroup limit is probably the most > likely place to be exposed to this) > * vmalloc() failure in a fault (like #NM) will result in SIGSEGV > * vmalloc() failure in a syscall can be handled with -ENOMEM Thanks for clarifying this, Dave. We added the explicit-allocate to v5, which should be on the list by tomorrow. So the questions are: 1. who calls it -- a call/thread or process? the application? a library -- which library? 2. is it optional, or mandatory? 3. if it is mandatory, what is the best way to enforce it? 4. should we have a "release" system call too? 1. Every thread needs a context switch buffer. Does every thread make the system call? It seems sort of awkward for a library to always make a system call before doing a TMUL. It would be functionally harmless, but it would add latency to an otherwise low-latency operation. If some central library does it, and caches that it has done it before, then it would be ugly, but at least it would remove an unnecessary user/kernel transition. 2. If it is optional, then v5 is code complete -- because it allows you to allocate either explicitly via prtcl, or transparently via #NM. 3. If it is mandatory, then we should re-purpose the XFD mechanism: app starts with XFD armed, by default if app touches AMX before prctl, it takes a signal (and dies). When app calls prctl, allocate buffer disarm XFD for that app (exactly what #NM trap does today). 4. I don't see a justification for a release concept, but it is possible -- though sort of sticky with possible nested calls from combinations of apps and libraries. If that were sorted out by a central library, then the actual system call on the last release per thread would re-arm XFD to prevent access until the next explicit request. Unclear if it is important that the kernel actually do the free -- some things might run faster if we keep it around... Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-20 21:41 ` Len Brown @ 2021-05-20 22:53 ` Dave Hansen 2021-05-21 9:41 ` Thomas Gleixner 2021-05-21 14:44 ` Florian Weimer 0 siblings, 2 replies; 130+ messages in thread From: Dave Hansen @ 2021-05-20 22:53 UTC (permalink / raw) To: Len Brown Cc: Thomas Gleixner, Borislav Petkov, Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, X86 ML, LKML, Linux API, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven On 5/20/21 2:41 PM, Len Brown wrote: > So the questions are: > 1. who calls it -- a call/thread or process? the application? a > library -- which library? > 2. is it optional, or mandatory? > 3. if it is mandatory, what is the best way to enforce it? > 4. should we have a "release" system call too? > > 1. Every thread needs a context switch buffer. Does every thread make > the system call? It seems sort of awkward for a library to always > make a system call before doing a TMUL. It would be functionally > harmless, but it would add latency to an otherwise low-latency > operation. If some central library does it, and caches that it has > done it before, then it would be ugly, but at least it would remove an > unnecessary user/kernel transition. Our system calls are *REALLY* fast. We can even do a vsyscall for this if we want to get the overhead down near zero. Userspace can also cache the "I did the prctl()" state in thread-local storage if it wants to avoid the syscall. > 2. If it is optional, then v5 is code complete -- because it allows > you to allocate either explicitly via prtcl, or transparently via #NM. It needs to be mandatory. If it's not, then nobody will use it, and they'll suffer the dreaded SIGSEGV-on-vmalloc()-failure and start filing bug reports. > 3. If it is mandatory, then we should re-purpose the XFD mechanism: > app starts with XFD armed, by default > if app touches AMX before prctl, it takes a signal (and dies). > When app calls prctl, allocate buffer disarm XFD for that app (exactly > what #NM trap does today). Yes, that sounds like a good use of XFD. > 4. I don't see a justification for a release concept, but it is > possible -- though sort of sticky with possible nested calls from > combinations of apps and libraries. If that were sorted out by a > central library, then the actual system call on the last release per > thread would re-arm XFD to prevent access until the next explicit > request. Unclear if it is important that the kernel actually do the > free -- some things might run faster if we keep it around... I think would be more of a get/put model rather than an allocate/free model. The "put" could effectively be a noop for now. But, if we don't put this in the ABI up front, we can't add it later. That means that we could never add a lazy-free, even if we wanted to. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-20 22:53 ` Dave Hansen @ 2021-05-21 9:41 ` Thomas Gleixner 2021-05-21 14:44 ` Florian Weimer 1 sibling, 0 replies; 130+ messages in thread From: Thomas Gleixner @ 2021-05-21 9:41 UTC (permalink / raw) To: Dave Hansen, Len Brown Cc: Borislav Petkov, Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, X86 ML, LKML, Linux API, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven Dave, Len, On Thu, May 20 2021 at 15:53, Dave Hansen wrote: > On 5/20/21 2:41 PM, Len Brown wrote: >> So the questions are: >> 1. who calls it -- a call/thread or process? the application? a >> library -- which library? >> 2. is it optional, or mandatory? >> 3. if it is mandatory, what is the best way to enforce it? >> 4. should we have a "release" system call too? >> >> 1. Every thread needs a context switch buffer. Does every thread make >> the system call? It seems sort of awkward for a library to always >> make a system call before doing a TMUL. It would be functionally >> harmless, but it would add latency to an otherwise low-latency >> operation. If some central library does it, and caches that it has >> done it before, then it would be ugly, but at least it would remove an >> unnecessary user/kernel transition. > > Our system calls are *REALLY* fast. We can even do a vsyscall for this > if we want to get the overhead down near zero. Userspace can also cache > the "I did the prctl()" state in thread-local storage if it wants to > avoid the syscall. Correct. >> 2. If it is optional, then v5 is code complete -- because it allows >> you to allocate either explicitly via prtcl, or transparently via #NM. > > It needs to be mandatory. If it's not, then nobody will use it, and > they'll suffer the dreaded SIGSEGV-on-vmalloc()-failure and start filing > bug reports. Yes. Plus mandatory allows to do access control. IOW the prctl() can return EPERM. >> 3. If it is mandatory, then we should re-purpose the XFD mechanism: >> app starts with XFD armed, by default >> if app touches AMX before prctl, it takes a signal (and dies). Yes. >> When app calls prctl, allocate buffer disarm XFD for that app (exactly >> what #NM trap does today). > > Yes, that sounds like a good use of XFD. Agreed. >> 4. I don't see a justification for a release concept, but it is >> possible -- though sort of sticky with possible nested calls from >> combinations of apps and libraries. If that were sorted out by a >> central library, then the actual system call on the last release per >> thread would re-arm XFD to prevent access until the next explicit >> request. Unclear if it is important that the kernel actually do the >> free -- some things might run faster if we keep it around... > > I think would be more of a get/put model rather than an allocate/free model. > > The "put" could effectively be a noop for now. Yes. > But, if we don't put this in the ABI up front, we can't add it later. > That means that we could never add a lazy-free, even if we wanted to. As I said somewhere in that thread, something like: prctl(PR_QUERY_XSTATE_FEATURES,.... prctl(PR_ENABLE_XSTATE_FEATURES,.... prctl(PR_DISABLE_XSTATE_FEATURES,.... To make this work you need refcounting and the last put (DISABLE) drops the buffer and re-arms XFD. But of course an application/library can do the put late if it knows that it's going to use it over and over. Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-20 22:53 ` Dave Hansen 2021-05-21 9:41 ` Thomas Gleixner @ 2021-05-21 14:44 ` Florian Weimer 2021-05-21 14:49 ` Peter Zijlstra 2021-05-21 16:14 ` Dave Hansen 1 sibling, 2 replies; 130+ messages in thread From: Florian Weimer @ 2021-05-21 14:44 UTC (permalink / raw) To: Dave Hansen via Libc-alpha Cc: Len Brown, Dave Hansen, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Arjan van de Ven, Willy Tarreau * Dave Hansen via Libc-alpha: > Our system calls are *REALLY* fast. We can even do a vsyscall for this > if we want to get the overhead down near zero. Userspace can also cache > the "I did the prctl()" state in thread-local storage if it wants to > avoid the syscall. Why can't userspace look at XCR0 to make the decision? And we added an interface for querying x86 CPU features to glibc 2.33 which is completely incompatible with this because it assumes that CPU features do not change during the lifetime of a process. 8-( Thanks, Florian ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 14:44 ` Florian Weimer @ 2021-05-21 14:49 ` Peter Zijlstra 2021-06-23 15:06 ` Florian Weimer 2021-05-21 16:14 ` Dave Hansen 1 sibling, 1 reply; 130+ messages in thread From: Peter Zijlstra @ 2021-05-21 14:49 UTC (permalink / raw) To: Florian Weimer Cc: Dave Hansen via Libc-alpha, Len Brown, Dave Hansen, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Arjan van de Ven, Willy Tarreau On Fri, May 21, 2021 at 04:44:58PM +0200, Florian Weimer wrote: > And we added an interface for querying x86 CPU features to glibc 2.33 > which is completely incompatible with this because it assumes that CPU > features do not change during the lifetime of a process. 8-( How many x86 kernel maintainers signed off on that patch? ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 14:49 ` Peter Zijlstra @ 2021-06-23 15:06 ` Florian Weimer 2021-06-23 23:11 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Florian Weimer @ 2021-06-23 15:06 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Hansen via Libc-alpha, Len Brown, Dave Hansen, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Arjan van de Ven, Willy Tarreau * Peter Zijlstra: > On Fri, May 21, 2021 at 04:44:58PM +0200, Florian Weimer wrote: > >> And we added an interface for querying x86 CPU features to glibc 2.33 >> which is completely incompatible with this because it assumes that CPU >> features do not change during the lifetime of a process. 8-( > > How many x86 kernel maintainers signed off on that patch? I've started a new thread: x86 CPU features detection for applications (and AMX) <https://lore.kernel.org/linux-api/87tulo39ms.fsf@oldenburg.str.redhat.com/> Thanks, Florian ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-06-23 15:06 ` Florian Weimer @ 2021-06-23 23:11 ` Len Brown 2021-06-28 10:14 ` Enrico Weigelt, metux IT consult 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-06-23 23:11 UTC (permalink / raw) To: Florian Weimer Cc: Peter Zijlstra, Dave Hansen via Libc-alpha, Dave Hansen, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Arjan van de Ven, Willy Tarreau On Wed, Jun 23, 2021 at 11:07 AM Florian Weimer <fweimer@redhat.com> wrote: > > * Peter Zijlstra: > > > On Fri, May 21, 2021 at 04:44:58PM +0200, Florian Weimer wrote: > > > >> And we added an interface for querying x86 CPU features to glibc 2.33 > >> which is completely incompatible with this because it assumes that CPU > >> features do not change during the lifetime of a process. 8-( > > > > How many x86 kernel maintainers signed off on that patch? > > I've started a new thread: > > x86 CPU features detection for applications (and AMX) > <https://lore.kernel.org/linux-api/87tulo39ms.fsf@oldenburg.str.redhat.com/> FWIW, I didn't receive it, because you excluded linux-kernel@vger.kernel.org (x86@kernel.org is just the actual x86 kernel committers, ISTR) Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-06-23 23:11 ` Len Brown @ 2021-06-28 10:14 ` Enrico Weigelt, metux IT consult 2021-06-28 12:49 ` Florian Weimer 0 siblings, 1 reply; 130+ messages in thread From: Enrico Weigelt, metux IT consult @ 2021-06-28 10:14 UTC (permalink / raw) To: Len Brown, Florian Weimer Cc: Peter Zijlstra, Dave Hansen via Libc-alpha, Dave Hansen, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Arjan van de Ven, Willy Tarreau On 24.06.21 01:11, Len Brown wrote: >> x86 CPU features detection for applications (and AMX) >> <https://lore.kernel.org/linux-api/87tulo39ms.fsf@oldenburg.str.redhat.com/> > > FWIW, I didn't receive it, because you excluded > > linux-kernel@vger.kernel.org me neither :( Maybe just repost it to LKML ? You mention the interface *was* designed with cpu features remaining constant over a process' lifetime. Between the line I'm reading that this might not be the case anymore. How could that happen ? Process migration on a different CPU (or perhaps on a different host) ? This is gonna be tricky, because this somehow needs to be synchronized with the application (even if we check the bits before calling some cpu-specific opcode, which also has some performance cost), there's still some window where the application might not yet recognize the change. So either we need some explicit migration points (where app tells, please let me finish this func first) or transparent emulation. Damn, how could the cpu designers come up with such weird concepts in the first place ? :o --mtx -- --- Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren GPG/PGP-Schlüssel zu. --- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info@metux.net -- +49-151-27565287 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-06-28 10:14 ` Enrico Weigelt, metux IT consult @ 2021-06-28 12:49 ` Florian Weimer 2021-06-30 12:22 ` Enrico Weigelt, metux IT consult 0 siblings, 1 reply; 130+ messages in thread From: Florian Weimer @ 2021-06-28 12:49 UTC (permalink / raw) To: Enrico Weigelt, metux IT consult Cc: Len Brown, Peter Zijlstra, Dave Hansen via Libc-alpha, Dave Hansen, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Arjan van de Ven, Willy Tarreau * Enrico Weigelt: > On 24.06.21 01:11, Len Brown wrote: >>> x86 CPU features detection for applications (and AMX) >>> <https://lore.kernel.org/linux-api/87tulo39ms.fsf@oldenburg.str.redhat.com/> >> FWIW, I didn't receive it, because you excluded >> linux-kernel@vger.kernel.org > > me neither :( > > Maybe just repost it to LKML ? Isn't it sufficient to start Cc:ing the list? > You mention the interface *was* designed with cpu features remaining > constant over a process' lifetime. Between the line I'm reading that > this might not be the case anymore. > > How could that happen ? Process migration on a different CPU (or perhaps > on a different host) ? AMX will be shown as enabled in the hardware, but trap into the kernel on first use. The kernel developers prefer a model where it is checked that the process has previously enabled the feature explicitly, instead relying on lazy initialization as part of the trap (as intended by the hardware design). This means that the usual CPUID/XCR0 approach (which is reflected in the glibc feature) will not work. Now it turns out that we can still support this in glibc because of the pointer indirection, but only if the kernel provides a bit we can read in thread-specific data. > Damn, how could the cpu designers come up with such weird concepts > in the first place ? :o It's not the CPU designers. The CPU behaves according to the old model. (I consider the old model a success, despite all the challenges, but not everyone agrees, obviosly.) Thanks, Florian ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-06-28 12:49 ` Florian Weimer @ 2021-06-30 12:22 ` Enrico Weigelt, metux IT consult 2021-06-30 12:41 ` Willy Tarreau 2021-06-30 13:55 ` Arjan van de Ven 0 siblings, 2 replies; 130+ messages in thread From: Enrico Weigelt, metux IT consult @ 2021-06-30 12:22 UTC (permalink / raw) To: Florian Weimer Cc: Len Brown, Peter Zijlstra, Dave Hansen via Libc-alpha, Dave Hansen, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Arjan van de Ven, Willy Tarreau On 28.06.21 14:49, Florian Weimer wrote: > * Enrico Weigelt: > >> On 24.06.21 01:11, Len Brown wrote: >>>> x86 CPU features detection for applications (and AMX) >>>> <https://lore.kernel.org/linux-api/87tulo39ms.fsf@oldenburg.str.redhat.com/> >>> FWIW, I didn't receive it, because you excluded >>> linux-kernel@vger.kernel.org >> >> me neither :( >> >> Maybe just repost it to LKML ? > > Isn't it sufficient to start Cc:ing the list? Well, in that case people probably missed the original mail. (maybe, I'm too lazy for searching the web for archives ... :P) >> You mention the interface *was* designed with cpu features remaining >> constant over a process' lifetime. Between the line I'm reading that >> this might not be the case anymore. >> >> How could that happen ? Process migration on a different CPU (or perhaps >> on a different host) ? > > AMX will be shown as enabled in the hardware, but trap into the kernel > on first use. The kernel developers prefer a model where it is checked > that the process has previously enabled the feature explicitly, instead > relying on lazy initialization as part of the trap (as intended by the > hardware design). This means that the usual CPUID/XCR0 approach (which > is reflected in the glibc feature) will not work. Ah, now I'm beginning to get it: * this feature needs to be initialized first, before it can be used * on first use (when not initialized yet), it traps into the kernel * we don't want to always initialize it at boot Correct ? What I'm wondering: why shall the process explicitly ask for it and why isn't the initialization be done either on bootup or on first use ? >> Damn, how could the cpu designers come up with such weird concepts >> in the first place ? :o > > It's not the CPU designers. The CPU behaves according to the old model. > (I consider the old model a success, despite all the challenges, but not > everyone agrees, obviosly.) I'm still claiming already this old model is a horrible misdesign and (most of) the extensions made over the decades are anything but well designed - there had been many changes to do it much, much better. For example there would have been ways to introduce new opcodes in a way that they can be easily emulated in kernel or userland, w/o going through a full trap. But that's gonna be a long discussion on its own, probably getting offtopic here. --mtx -- --- Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren GPG/PGP-Schlüssel zu. --- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info@metux.net -- +49-151-27565287 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-06-30 12:22 ` Enrico Weigelt, metux IT consult @ 2021-06-30 12:41 ` Willy Tarreau 2021-06-30 13:55 ` Arjan van de Ven 1 sibling, 0 replies; 130+ messages in thread From: Willy Tarreau @ 2021-06-30 12:41 UTC (permalink / raw) To: Enrico Weigelt, metux IT consult Cc: Florian Weimer, Len Brown, Peter Zijlstra, Dave Hansen via Libc-alpha, Dave Hansen, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Arjan van de Ven On Wed, Jun 30, 2021 at 02:22:19PM +0200, Enrico Weigelt, metux IT consult wrote: > Ah, now I'm beginning to get it: > > * this feature needs to be initialized first, before it can be used > * on first use (when not initialized yet), it traps into the kernel > * we don't want to always initialize it at boot > > Correct ? Not exactly. It's available but comes with a huge context-switch cost for each task using it. > What I'm wondering: why shall the process explicitly ask for it and > why isn't the initialization be done either on bootup or on first use ? The whole discussion about the pros and cons is archived here: https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ@mail.gmail.com/ > I'm still claiming already this old model is a horrible misdesign and > (most of) the extensions made over the decades are anything but well > designed - there had been many changes to do it much, much better. > For example there would have been ways to introduce new opcodes in a way > that they can be easily emulated in kernel or userland, w/o going > through a full trap. It's not a matter of opcodes but of context switch cost which not everyone wants to inflict to every single task that opportunistically uses these instructions without realizing what this subsequently implies for the rest of their life. All this is discussed in the thread above. I don't remember seeing anybody criticize the choice of instruction encoding hence it's irrelevant to this discussion. Hoping this helps, Willy ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-06-30 12:22 ` Enrico Weigelt, metux IT consult 2021-06-30 12:41 ` Willy Tarreau @ 2021-06-30 13:55 ` Arjan van de Ven 2021-06-30 15:20 ` Len Brown 2021-06-30 15:25 ` Enrico Weigelt, metux IT consult 1 sibling, 2 replies; 130+ messages in thread From: Arjan van de Ven @ 2021-06-30 13:55 UTC (permalink / raw) To: Enrico Weigelt, metux IT consult, Florian Weimer Cc: Len Brown, Peter Zijlstra, Dave Hansen via Libc-alpha, Dave Hansen, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Willy Tarreau On 6/30/2021 5:22 AM, Enrico Weigelt, metux IT consult wrote: >> >> AMX will be shown as enabled in the hardware, but trap into the kernel >> on first use. The kernel developers prefer a model where it is checked >> that the process has previously enabled the feature explicitly, instead >> relying on lazy initialization as part of the trap (as intended by the >> hardware design). This means that the usual CPUID/XCR0 approach (which >> is reflected in the glibc feature) will not work. > > Ah, now I'm beginning to get it: > > * this feature needs to be initialized first, before it can be used > * on first use (when not initialized yet), it traps into the kernel > * we don't want to always initialize it at boot > > Correct ? not really, the init is PER PROCESS and then there is a per thread 8Kb state allocation that needs to be context switched/etc once you actually use AMX. > > What I'm wondering: why shall the process explicitly ask for it and > why isn't the initialization be done either on bootup or on first use ? the kernel needs to be able to say "no" in a graceful way, there are several scenarios (from the sysadmin wanting to manage power/performance/resources to outright compatibility where the kernel wants or needs to say "no". Most obvious example: if a process asked for an sigaltstack, we can't let the process use AMX since that stack will be too small most likely to hold the stackframe) If you do this on "first use of the instruction" there is no graceful way to say "no". ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-06-30 13:55 ` Arjan van de Ven @ 2021-06-30 15:20 ` Len Brown 2021-06-30 15:25 ` Enrico Weigelt, metux IT consult 1 sibling, 0 replies; 130+ messages in thread From: Len Brown @ 2021-06-30 15:20 UTC (permalink / raw) To: Arjan van de Ven Cc: Enrico Weigelt, metux IT consult, Florian Weimer, Peter Zijlstra, Dave Hansen via Libc-alpha, Dave Hansen, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Willy Tarreau The latest proposal for kernel AMX support (updated today) is here: https://lore.kernel.org/lkml/20210630060226.24652-1-chang.seok.bae@intel.com/ The main challenge for AMX is not context switch performance. Hardware recognizes INIT state (the common case) and skips that data transfer when it is not needed. The main challenge for AMX is compatibility. Specifically, user signal stack growth. The legacy ABI is that we put an uncompacted XSTATE image on the signal stack. In the default stack case, this isn't a problem, but when a user allocates an alternative signal stack, the 8K of XSTATE growth that AMX can exceed what the user allocated. The new system call tells the kernel that the application can handle it. (it can do this by not using altsigstack, or by using the updated stack size advertised by glibc 2.34 and later, or some other means) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-06-30 13:55 ` Arjan van de Ven 2021-06-30 15:20 ` Len Brown @ 2021-06-30 15:25 ` Enrico Weigelt, metux IT consult 1 sibling, 0 replies; 130+ messages in thread From: Enrico Weigelt, metux IT consult @ 2021-06-30 15:25 UTC (permalink / raw) To: Arjan van de Ven, Florian Weimer Cc: Len Brown, Peter Zijlstra, Dave Hansen via Libc-alpha, Dave Hansen, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Willy Tarreau On 30.06.21 15:55, Arjan van de Ven wrote: >> * this feature needs to be initialized first, before it can be used >> * on first use (when not initialized yet), it traps into the kernel >> * we don't want to always initialize it at boot >> >> Correct ? > > not really, the init is PER PROCESS IIRC there had been some discussion here whether it should be done per thread. But now that I've learned that the major problem is saving the register state, I wouldn't dare thinking about how a working per-thread solution really would need to look like :o (by the way: is sighandler stack per thread or per process ?) > the kernel needs to be able to say "no" in a graceful way, there are > several scenarios > (from the sysadmin wanting to manage power/performance/resources to > outright compatibility where > the kernel wants or needs to say "no". Most obvious example: if a > process asked for an sigaltstack, > we can't let the process use AMX since that stack will be too small most > likely to hold > the stackframe) Ah okay, when I wrote that mail, didn't know yet that so much state needs to be saved. --mtx -- --- Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren GPG/PGP-Schlüssel zu. --- Enrico Weigelt, metux IT consult Free software and Linux embedded engineering info@metux.net -- +49-151-27565287 ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 14:44 ` Florian Weimer 2021-05-21 14:49 ` Peter Zijlstra @ 2021-05-21 16:14 ` Dave Hansen 2021-05-21 16:19 ` Florian Weimer 1 sibling, 1 reply; 130+ messages in thread From: Dave Hansen @ 2021-05-21 16:14 UTC (permalink / raw) To: Florian Weimer, Dave Hansen via Libc-alpha Cc: Len Brown, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Arjan van de Ven, Willy Tarreau On 5/21/21 7:44 AM, Florian Weimer wrote: > * Dave Hansen via Libc-alpha: >> Our system calls are *REALLY* fast. We can even do a vsyscall for this >> if we want to get the overhead down near zero. Userspace can also cache >> the "I did the prctl()" state in thread-local storage if it wants to >> avoid the syscall. > Why can't userspace look at XCR0 to make the decision? The thing we're trying to avoid is a #NM exception from XFD (the new first-use detection feature) that occurs on the first use of AMX. XCR0 will have XCR0[AMX]=1, even if XFD is "armed" and ready to generate the #NM. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 16:14 ` Dave Hansen @ 2021-05-21 16:19 ` Florian Weimer 2021-05-21 16:26 ` Len Brown ` (3 more replies) 0 siblings, 4 replies; 130+ messages in thread From: Florian Weimer @ 2021-05-21 16:19 UTC (permalink / raw) To: Dave Hansen Cc: Dave Hansen via Libc-alpha, Len Brown, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Arjan van de Ven, Willy Tarreau * Dave Hansen: > On 5/21/21 7:44 AM, Florian Weimer wrote: >> * Dave Hansen via Libc-alpha: >>> Our system calls are *REALLY* fast. We can even do a vsyscall for this >>> if we want to get the overhead down near zero. Userspace can also cache >>> the "I did the prctl()" state in thread-local storage if it wants to >>> avoid the syscall. >> Why can't userspace look at XCR0 to make the decision? > > The thing we're trying to avoid is a #NM exception from XFD (the new > first-use detection feature) that occurs on the first use of AMX. > XCR0 will have XCR0[AMX]=1, even if XFD is "armed" and ready to > generate the #NM. I see. So essentially the hardware wants to offer transparent initialize-on-use, but Linux does not seem to want to implement it this way. Is there still a chance to bring the hardware and Linux into alignment? Thanks, Florian ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 16:19 ` Florian Weimer @ 2021-05-21 16:26 ` Len Brown 2021-05-21 16:28 ` Dave Hansen ` (2 subsequent siblings) 3 siblings, 0 replies; 130+ messages in thread From: Len Brown @ 2021-05-21 16:26 UTC (permalink / raw) To: Florian Weimer Cc: Dave Hansen, Dave Hansen via Libc-alpha, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Arjan van de Ven, Willy Tarreau On Fri, May 21, 2021 at 12:19 PM Florian Weimer <fweimer@redhat.com> wrote: > I see. So essentially the hardware wants to offer transparent > initialize-on-use, but Linux does not seem to want to implement it this > way. That is a reasonable summary. > Is there still a chance to bring the hardware and Linux into alignment? The hardware was done some time ago, so this is a Linux decision. The current trajectory is that for user space to use TMUL it must 1. query CPUID to see if the instructions exist 2. query xgetbv(XCR0) to see if the OS supports the state 3. Tell Linux that this thread wants to use the state. The original proposal required just #1 and #2. It is clear that Linux can not support that, and so #3 is being added. Of course, if #2 is false, then Linux will return failure for #3, so technically you could skip that check and just make this new syscall. Probably user-space will still need to query CPUID for the instructions, since there will be a many-to-one mapping of instructions to state. Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 16:19 ` Florian Weimer 2021-05-21 16:26 ` Len Brown @ 2021-05-21 16:28 ` Dave Hansen 2021-05-21 16:31 ` Andy Lutomirski 2021-05-21 19:05 ` Thomas Gleixner 3 siblings, 0 replies; 130+ messages in thread From: Dave Hansen @ 2021-05-21 16:28 UTC (permalink / raw) To: Florian Weimer Cc: Dave Hansen via Libc-alpha, Len Brown, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Keno Fischer, Arjan van de Ven, Willy Tarreau On 5/21/21 9:19 AM, Florian Weimer wrote: >> On 5/21/21 7:44 AM, Florian Weimer wrote: >>> * Dave Hansen via Libc-alpha: >>>> Our system calls are *REALLY* fast. We can even do a vsyscall for this >>>> if we want to get the overhead down near zero. Userspace can also cache >>>> the "I did the prctl()" state in thread-local storage if it wants to >>>> avoid the syscall. >>> Why can't userspace look at XCR0 to make the decision? >> >> The thing we're trying to avoid is a #NM exception from XFD (the new >> first-use detection feature) that occurs on the first use of AMX. >> XCR0 will have XCR0[AMX]=1, even if XFD is "armed" and ready to >> generate the #NM. > > I see. So essentially the hardware wants to offer transparent > initialize-on-use, but Linux does not seem to want to implement it this > way. I don't quite see it that way. The hardware wants to offer the OS a guarantee that it will know *BEFORE* an application tried to establish specific register state. An OS could implement relatively transparent XSAVE backing resizing with it, like the earlier AMX patches did. Or, the OS could use it to implement a nice, immediate thwack if the app misbehaves and violates the ABI, like we're moving toward now. > Is there still a chance to bring the hardware and Linux into alignment? I think they're aligned just fine. XFD might be a bit overblown as a feature for how Linux will use it, but other OSes might get some mileage out of it. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 16:19 ` Florian Weimer 2021-05-21 16:26 ` Len Brown 2021-05-21 16:28 ` Dave Hansen @ 2021-05-21 16:31 ` Andy Lutomirski 2021-05-21 19:10 ` Thomas Gleixner 2021-05-21 19:05 ` Thomas Gleixner 3 siblings, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-05-21 16:31 UTC (permalink / raw) To: Florian Weimer, Dave Hansen Cc: Dave Hansen via Libc-alpha, Len Brown, Rich Felker, Linux API, Bae, Chang Seok, the arch/x86 maintainers, Linux Kernel Mailing List, Kyle Huey, Borislav Petkov, Thomas Gleixner, Keno Fischer, Arjan van de Ven, Willy Tarreau On Fri, May 21, 2021, at 9:19 AM, Florian Weimer wrote: > * Dave Hansen: > > > On 5/21/21 7:44 AM, Florian Weimer wrote: > >> * Dave Hansen via Libc-alpha: > >>> Our system calls are *REALLY* fast. We can even do a vsyscall for this > >>> if we want to get the overhead down near zero. Userspace can also cache > >>> the "I did the prctl()" state in thread-local storage if it wants to > >>> avoid the syscall. > >> Why can't userspace look at XCR0 to make the decision? > > > > The thing we're trying to avoid is a #NM exception from XFD (the new > > first-use detection feature) that occurs on the first use of AMX. > > XCR0 will have XCR0[AMX]=1, even if XFD is "armed" and ready to > > generate the #NM. > > I see. So essentially the hardware wants to offer transparent > initialize-on-use, but Linux does not seem to want to implement it this > way. > > Is there still a chance to bring the hardware and Linux into alignment? arch_prctl(SET_XSTATE_INIT_ON_FIRST_USE, TILE_STUFF);? As long as this is allowed to fail, I don’t have a huge problem with it. I think several things here are regrettable: 1. Legacy XSTATE code might assume that XCR0 is a constant. 2. Intel virt really doesn’t like us context switching XCR0, although we might say that this is Intel’s fault and therefore Intel’s problem. AMD hardware doesn’t appear to have this issue. 3. AMX bring tangled up in XSTATE is unfortunate. The whole XSTATE mechanism is less than amazing. IMO the best we can make of this whole situation is to make XCR0 dynamic, but the legacy compatibility issues are potentially problematic. > > Thanks, > Florian > > ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 16:31 ` Andy Lutomirski @ 2021-05-21 19:10 ` Thomas Gleixner 2021-05-21 20:07 ` Andy Lutomirski 2021-05-21 22:07 ` Len Brown 0 siblings, 2 replies; 130+ messages in thread From: Thomas Gleixner @ 2021-05-21 19:10 UTC (permalink / raw) To: Andy Lutomirski, Florian Weimer, Dave Hansen Cc: Dave Hansen via Libc-alpha, Len Brown, Rich Felker, Linux API, Bae, Chang Seok, the arch/x86 maintainers, Linux Kernel Mailing List, Kyle Huey, Borislav Petkov, Keno Fischer, Arjan van de Ven, Willy Tarreau On Fri, May 21 2021 at 09:31, Andy Lutomirski wrote: > arch_prctl(SET_XSTATE_INIT_ON_FIRST_USE, TILE_STUFF);? > > As long as this is allowed to fail, I don’t have a huge problem with > it. I'm fine with that. It's still controlled by the OS and can return -EPERM. If allowed then the application would also accept to be insta killed if that #NM allocation fails. Any bug report vs. that will be ignored. > I think several things here are regrettable: > > 1. Legacy XSTATE code might assume that XCR0 is a constant. > > 2. Intel virt really doesn’t like us context switching XCR0, although > we might say that this is Intel’s fault and therefore Intel’s > problem. AMD hardware doesn’t appear to have this issue. > > 3. AMX bring tangled up in XSTATE is unfortunate. The whole XSTATE > mechanism is less than amazing. > > IMO the best we can make of this whole situation is to make XCR0 > dynamic, but the legacy compatibility issues are potentially > problematic. Why? The bit can be enabled and #NM catches the violation of the ABI contract if the application did not request usage. No XCR0 fiddling on context switch required. Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 19:10 ` Thomas Gleixner @ 2021-05-21 20:07 ` Andy Lutomirski 2021-05-21 21:43 ` Thomas Gleixner 2021-05-21 22:07 ` Len Brown 1 sibling, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-05-21 20:07 UTC (permalink / raw) To: Thomas Gleixner, Florian Weimer, Dave Hansen Cc: Dave Hansen via Libc-alpha, Len Brown, Rich Felker, Linux API, Bae, Chang Seok, the arch/x86 maintainers, Linux Kernel Mailing List, Kyle Huey, Borislav Petkov, Keno Fischer, Arjan van de Ven, Willy Tarreau On Fri, May 21, 2021, at 12:10 PM, Thomas Gleixner wrote: > On Fri, May 21 2021 at 09:31, Andy Lutomirski wrote: > > arch_prctl(SET_XSTATE_INIT_ON_FIRST_USE, TILE_STUFF);? > > > > As long as this is allowed to fail, I don’t have a huge problem with > > it. > > I'm fine with that. It's still controlled by the OS and can return > -EPERM. > > If allowed then the application would also accept to be insta killed if > that #NM allocation fails. Any bug report vs. that will be ignored. > > > I think several things here are regrettable: > > > > 1. Legacy XSTATE code might assume that XCR0 is a constant. > > > > 2. Intel virt really doesn’t like us context switching XCR0, although > > we might say that this is Intel’s fault and therefore Intel’s > > problem. AMD hardware doesn’t appear to have this issue. > > > > 3. AMX bring tangled up in XSTATE is unfortunate. The whole XSTATE > > mechanism is less than amazing. > > > > IMO the best we can make of this whole situation is to make XCR0 > > dynamic, but the legacy compatibility issues are potentially > > problematic. > > Why? The bit can be enabled and #NM catches the violation of the ABI > contract if the application did not request usage. No XCR0 fiddling on > context switch required. > > Thanks, > > tglx > > > XFD does nothing about signals. It also doesn’t help give applications a non-Linux-specific way to ask if AMX is available. The SDM says that one can read XCR0. Sure, we can use it, but cross platform libraries seem likely to get it wrong. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 20:07 ` Andy Lutomirski @ 2021-05-21 21:43 ` Thomas Gleixner 0 siblings, 0 replies; 130+ messages in thread From: Thomas Gleixner @ 2021-05-21 21:43 UTC (permalink / raw) To: Andy Lutomirski, Florian Weimer, Dave Hansen Cc: Dave Hansen via Libc-alpha, Len Brown, Rich Felker, Linux API, Bae, Chang Seok, the arch/x86 maintainers, Linux Kernel Mailing List, Kyle Huey, Borislav Petkov, Keno Fischer, Arjan van de Ven, Willy Tarreau On Fri, May 21 2021 at 13:07, Andy Lutomirski wrote: > On Fri, May 21, 2021, at 12:10 PM, Thomas Gleixner wrote: >> Why? The bit can be enabled and #NM catches the violation of the ABI >> contract if the application did not request usage. No XCR0 fiddling on >> context switch required. > > XFD does nothing about signals. It's a matter of what's implemented in #NM. XFD just arms #NM > It also doesn’t help give applications a non-Linux-specific way to ask > if AMX is available. The SDM says that one can read XCR0. Sure, we > can use it, but cross platform libraries seem likely to get it wrong. Well, that's the inevitable consequence of Intel declaring that everything needs to be exposed unconditionally for the very wrong reasons. Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 19:10 ` Thomas Gleixner 2021-05-21 20:07 ` Andy Lutomirski @ 2021-05-21 22:07 ` Len Brown 2021-05-21 22:46 ` Thomas Gleixner 2021-05-21 23:06 ` Dave Hansen 1 sibling, 2 replies; 130+ messages in thread From: Len Brown @ 2021-05-21 22:07 UTC (permalink / raw) To: Thomas Gleixner Cc: Andy Lutomirski, Florian Weimer, Dave Hansen, Dave Hansen via Libc-alpha, Rich Felker, Linux API, Bae, Chang Seok, the arch/x86 maintainers, Linux Kernel Mailing List, Kyle Huey, Borislav Petkov, Keno Fischer, Arjan van de Ven, Willy Tarreau On Fri, May 21, 2021 at 3:10 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Fri, May 21 2021 at 09:31, Andy Lutomirski wrote: > > arch_prctl(SET_XSTATE_INIT_ON_FIRST_USE, TILE_STUFF);? > > > > As long as this is allowed to fail, I don’t have a huge problem with > > it. > > I'm fine with that. It's still controlled by the OS and can return > -EPERM. > > If allowed then the application would also accept to be insta killed if > that #NM allocation fails. Any bug report vs. that will be ignored. Regarding pre-allocation vs on-demand allocation, consider two scenarios: 1. Synchronous. At process or thread start up time, prctl() synchronously allocates 8K context switch buffers. Return code is 0 -- good go go! 10 seconds later the program decides to create additional threads. Woops. vmalloc failed, and the process synchronously dies. bug filed. 2. On demand. Same scenario, except vmalloc failure upon creation of those additional threads sends a SIGSEGV at the instruction where AMX is touched. bug filed. Why ignore the 2nd bug and not ignore the 1st bug? My concern about synchronous allocation is that it will be very easy to abuse. programs and threads can ask for buffers they will never use. With on-demand allocation, we allocate buffers only if they are actually needed. Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 22:07 ` Len Brown @ 2021-05-21 22:46 ` Thomas Gleixner 2021-05-21 23:31 ` Len Brown 2021-05-21 23:06 ` Dave Hansen 1 sibling, 1 reply; 130+ messages in thread From: Thomas Gleixner @ 2021-05-21 22:46 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, Florian Weimer, Dave Hansen, Dave Hansen via Libc-alpha, Rich Felker, Linux API, Bae, Chang Seok, the arch/x86 maintainers, Linux Kernel Mailing List, Kyle Huey, Borislav Petkov, Keno Fischer, Arjan van de Ven, Willy Tarreau On Fri, May 21 2021 at 18:07, Len Brown wrote: > On Fri, May 21, 2021 at 3:10 PM Thomas Gleixner <tglx@linutronix.de> wrote: > Regarding pre-allocation vs on-demand allocation, consider two scenarios: > > 1. Synchronous. At process or thread start up time, prctl() > synchronously allocates 8K context switch buffers. Return code is 0 > -- good go go! 10 seconds later the program decides to create > additional threads. Woops. vmalloc failed, and the process > synchronously dies. bug filed. No. pthread_create() will fail with -ENOMEM. A return value of -ENOMEM is not a bug. If the application fails to check the error code then it's not the kernels problem and not a kernel bug either. > 2. On demand. Same scenario, except vmalloc failure upon creation of > those additional threads sends a SIGSEGV at the instruction where AMX > is touched. bug filed. > > Why ignore the 2nd bug and not ignore the 1st bug? See above. > My concern about synchronous allocation is that it will be very easy > to abuse. programs and threads can ask for buffers they will never > use. With on-demand allocation, we allocate buffers only if they are > actually needed. Programs ask for memory in various ways. The buffer is not any different than any other memory allocation of the application/thread. It's accounted for and when the limits are reached the allocation fails. But it fails in a way which can be acted upon at the application level and not in a way where the kernel has no other choice than killing the whole process. So where is the problem? Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 22:46 ` Thomas Gleixner @ 2021-05-21 23:31 ` Len Brown 2021-05-22 7:16 ` Florian Weimer 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-05-21 23:31 UTC (permalink / raw) To: Thomas Gleixner Cc: Andy Lutomirski, Florian Weimer, Dave Hansen, Dave Hansen via Libc-alpha, Rich Felker, Linux API, Bae, Chang Seok, the arch/x86 maintainers, Linux Kernel Mailing List, Kyle Huey, Borislav Petkov, Keno Fischer, Arjan van de Ven, Willy Tarreau With this proposed API, we seem to be combining two requirements, and I wonder if we should be treating them independently. Requirement 1: "Fine grained control". We want the kernel to be able to prohibit a program from using AMX. The foundation for this is a system call that the kernel can say "No". It may deny access for whatever reason it wants, including inability to allocate a buffer, or some TBD administer-invoked hook in the system call, say membership or lack of membership of the process in an empowered cgroup. Requirement 2: Ability to synchronously fail upon buffer allocation. I agree that pthread_create() returning an error code is more friendly way to kill a program rather than a SIGSEGV when touching AMX state for the first time. But the reality is, that program is almost certainly going to exit either way. So the 1st question is if the system call requesting permission should be on a per-process basis, or a per-task basis. A. per-task. If we do it this way, then we will likely wind up mandating a GET at the start of every routine in every library that touches AMX, and potentially also a PUT. This is because the library has no idea what thread called it. The plus is that this will address the "used once and sits on a buffer for the rest of the process lifetime' scenario. The minus is that high performance users will be executing thousands of unnecessary system calls that have zero value. B. per-process. If we do it this way, then the run time linker can do a single system call on behalf of the entire process, and there is no need to sprinkle system calls throughout the library. Presumably the startup code would query CPUID, query XCR0, query this system call, and set a global variable to access by all threads going forward. The plus is that permission makes more sense on a process basis than on a task basis. Why would the kernel give one thread in a process permission, and not another thread -- and if that happened, would a process actually be able to figure out what to do? If we do per-process, I don't see that the PUT call would be useful, and I would skip it. Neither A or B has an advantage in the situation where a thread is created long after initialization and faces memory allocation failure. A synchronously fails in the new system call, and B synchronously fails in pthread_create. The 2nd question is if "successful permission" implies synchronous allocation, or perhaps it allows "please enable on-demand dynamic allocation" X. Synchronous Allocation results in allocation failures returning a synchronous error code, explaining why the program needs to exit. The downside is that it is likely that in both case A and B, every thread in the program will allocate a buffer, if they ever use it or not. Indeed, it is possible that the API we have invented to manage AMX buffer use will actually *increase* AMX buffer use... a Y. Enable on-demand allocation. Here the system call enables XFD to not kill the process, but on first use to allocate a buffer for a thread that is actually touching AMX. The benefit is if you have a program with many threads, only the ones that actually use AMX will allocate buffers. Of course the down side is that this program is exposed to a SIGSEGV if vmalloc fails in that run-time allocation, rather than a friendly pthread_create -1 return code killing the program. And, of course, we can have our cake and eat it too, by having a the syscall tell the kernel if it wants (X) or (Y). The question is if it is worth the complexity of having two options. thoughts? -Len ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 23:31 ` Len Brown @ 2021-05-22 7:16 ` Florian Weimer 2021-05-22 23:55 ` Andy Lutomirski 0 siblings, 1 reply; 130+ messages in thread From: Florian Weimer @ 2021-05-22 7:16 UTC (permalink / raw) To: Len Brown Cc: Thomas Gleixner, Andy Lutomirski, Dave Hansen, Dave Hansen via Libc-alpha, Rich Felker, Linux API, Bae, Chang Seok, the arch/x86 maintainers, Linux Kernel Mailing List, Kyle Huey, Borislav Petkov, Keno Fischer, Arjan van de Ven, Willy Tarreau * Len Brown: > A. per-task. If we do it this way, then we will likely wind up > mandating a GET at the start of every routine in every library that > touches AMX, and potentially also a PUT. This is because the library > has no idea what thread called it. The plus is that this will address > the "used once and sits on a buffer for the rest of the process > lifetime' scenario. The minus is that high performance users will be > executing thousands of unnecessary system calls that have zero value. We could revive the KTLS proposal (userspace donates memory for use by the kernel & vDSO), and the thread could reserve (on-stack) buffer space for kernel use for the duration of the AMX computation. There would be a pointer to that space in the KTLS area, set upon entry of the AMX region, and cleared upon exit. It's not extremely cheap (unbounded alloca has a stack probing loop nowadays). But no system call is required. Thanks, Florian ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-22 7:16 ` Florian Weimer @ 2021-05-22 23:55 ` Andy Lutomirski 0 siblings, 0 replies; 130+ messages in thread From: Andy Lutomirski @ 2021-05-22 23:55 UTC (permalink / raw) To: Florian Weimer Cc: Len Brown, Thomas Gleixner, Andy Lutomirski, Dave Hansen, Dave Hansen via Libc-alpha, Rich Felker, Linux API, Bae, Chang Seok, the arch/x86 maintainers, Linux Kernel Mailing List, Kyle Huey, Borislav Petkov, Keno Fischer, Arjan van de Ven, Willy Tarreau > On May 22, 2021, at 12:17 AM, Florian Weimer <fweimer@redhat.com> wrote: > > * Len Brown: > >> A. per-task. If we do it this way, then we will likely wind up >> mandating a GET at the start of every routine in every library that >> touches AMX, and potentially also a PUT. This is because the library >> has no idea what thread called it. The plus is that this will address >> the "used once and sits on a buffer for the rest of the process >> lifetime' scenario. The minus is that high performance users will be >> executing thousands of unnecessary system calls that have zero value. > > We could revive the KTLS proposal (userspace donates memory for use by > the kernel & vDSO), and the thread could reserve (on-stack) buffer space > for kernel use for the duration of the AMX computation. There would be > a pointer to that space in the KTLS area, set upon entry of the AMX > region, and cleared upon exit. It's not extremely cheap (unbounded > alloca has a stack probing loop nowadays). But no system call is > required. > Making this work well would be very nasty. The memory *must* be available at context switch out time, which means it would need to be pinned at context switch in time, which is not great. But also Intel, in its infinite wisdom, decided to mix “supervisor” states in which the state that user space is permitted to directly access. Putting the supervisor state on the stack would be problematic. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 22:07 ` Len Brown 2021-05-21 22:46 ` Thomas Gleixner @ 2021-05-21 23:06 ` Dave Hansen 2021-05-21 23:08 ` Len Brown 1 sibling, 1 reply; 130+ messages in thread From: Dave Hansen @ 2021-05-21 23:06 UTC (permalink / raw) To: Len Brown, Thomas Gleixner Cc: Andy Lutomirski, Florian Weimer, Dave Hansen via Libc-alpha, Rich Felker, Linux API, Bae, Chang Seok, the arch/x86 maintainers, Linux Kernel Mailing List, Kyle Huey, Borislav Petkov, Keno Fischer, Arjan van de Ven, Willy Tarreau On 5/21/21 3:07 PM, Len Brown wrote: > My concern about synchronous allocation is that it will be very easy > to abuse. programs and threads can ask for buffers they will never > use. With on-demand allocation, we allocate buffers only if they are > actually needed. If someone wants to abuse the on-demand allocation, they will simply write a single bit to an AMX register. That does *NOT* mean they will actually execute an instruction that actually uses AMX to do something meaningful. In the face of abuse, I think the two approaches are very similar. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 23:06 ` Dave Hansen @ 2021-05-21 23:08 ` Len Brown 0 siblings, 0 replies; 130+ messages in thread From: Len Brown @ 2021-05-21 23:08 UTC (permalink / raw) To: Dave Hansen Cc: Thomas Gleixner, Andy Lutomirski, Florian Weimer, Dave Hansen via Libc-alpha, Rich Felker, Linux API, Bae, Chang Seok, the arch/x86 maintainers, Linux Kernel Mailing List, Kyle Huey, Borislav Petkov, Keno Fischer, Arjan van de Ven, Willy Tarreau On Fri, May 21, 2021 at 7:06 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 5/21/21 3:07 PM, Len Brown wrote: > > My concern about synchronous allocation is that it will be very easy > > to abuse. programs and threads can ask for buffers they will never > > use. With on-demand allocation, we allocate buffers only if they are > > actually needed. > > If someone wants to abuse the on-demand allocation, they will simply > write a single bit to an AMX register. That does *NOT* mean they will > actually execute an instruction that actually uses AMX to do something > meaningful. > > In the face of abuse, I think the two approaches are very similar. I didn't mean "abuse" in terms of malicious resource hogging. I meant "abuse" in terms of unnecessarily using resources out of laziness. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-21 16:19 ` Florian Weimer ` (2 preceding siblings ...) 2021-05-21 16:31 ` Andy Lutomirski @ 2021-05-21 19:05 ` Thomas Gleixner 3 siblings, 0 replies; 130+ messages in thread From: Thomas Gleixner @ 2021-05-21 19:05 UTC (permalink / raw) To: Florian Weimer, Dave Hansen Cc: Dave Hansen via Libc-alpha, Len Brown, Rich Felker, Linux API, Bae, Chang Seok, X86 ML, LKML, Kyle Huey, Borislav Petkov, Andy Lutomirski, Keno Fischer, Arjan van de Ven, Willy Tarreau On Fri, May 21 2021 at 18:19, Florian Weimer wrote: > * Dave Hansen: >> On 5/21/21 7:44 AM, Florian Weimer wrote: >>> Why can't userspace look at XCR0 to make the decision? >> >> The thing we're trying to avoid is a #NM exception from XFD (the new >> first-use detection feature) that occurs on the first use of AMX. >> XCR0 will have XCR0[AMX]=1, even if XFD is "armed" and ready to >> generate the #NM. > > I see. So essentially the hardware wants to offer transparent > initialize-on-use, but Linux does not seem to want to implement it this > way. The hardware offers an exception which can be used to implement that, but the hardware does not dictate that usage. If we'd go that way we lost any control over that resource and I can demonstrate with AVX512 today what kind of consequences that has with mixed criticality realtime workloads. The only solution we have today is to disable AVX512 completely, which sucks because restricted usage can be benefitial for some of the computations. The problem is that the approach of user space in general seems to be blindly_select_max(AVX). I've seen that in quite some places. With AMX (and the stuff coming next) we have the chance to do proper resource control and it would be outright stupid not to take that opportunity. Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-20 20:54 ` Thomas Gleixner 2021-05-20 21:13 ` Dave Hansen @ 2021-05-20 21:22 ` Len Brown 2021-05-20 21:41 ` Thomas Gleixner 1 sibling, 1 reply; 130+ messages in thread From: Len Brown @ 2021-05-20 21:22 UTC (permalink / raw) To: Thomas Gleixner Cc: Borislav Petkov, Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, Linux API, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven On Thu, May 20, 2021 at 4:54 PM Thomas Gleixner <tglx@linutronix.de> wrote: Thomas, > > AMX is analogous to the multiplier used by AVX-512. > > The architectural state must exist on every CPU, including HT siblings. > > Today, the HT siblings share the same execution unit, > > and I have no reason to expect that will change. > > I'm well aware that HT siblings share the same execution unit for > AVX. > > Though AMX is if I remember the discussions two years ago correctly > shared by more than the HT siblings which makes things worse. I regret that we were unable to get together in the last year to have an updated discussion. I think if we had, then we would have saved a lot of mis-understanding and a lot of email! So let me emphasize here: There is one TMUL execution unit per core. It is shared by the HT siblings within that core. So the comparison to the AVX-512 multiplier is a good one. Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-20 21:22 ` Len Brown @ 2021-05-20 21:41 ` Thomas Gleixner 2021-05-20 21:49 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Thomas Gleixner @ 2021-05-20 21:41 UTC (permalink / raw) To: Len Brown Cc: Borislav Petkov, Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, Linux API, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven Len, On Thu, May 20 2021 at 17:22, Len Brown wrote: > On Thu, May 20, 2021 at 4:54 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> > AMX is analogous to the multiplier used by AVX-512. >> > The architectural state must exist on every CPU, including HT siblings. >> > Today, the HT siblings share the same execution unit, >> > and I have no reason to expect that will change. >> >> I'm well aware that HT siblings share the same execution unit for >> AVX. >> >> Though AMX is if I remember the discussions two years ago correctly >> shared by more than the HT siblings which makes things worse. > > I regret that we were unable to get together in the last year to have > an updated discussion. I think if we had, then we would have saved > a lot of mis-understanding and a lot of email! > > So let me emphasize here: > > There is one TMUL execution unit per core. > It is shared by the HT siblings within that core. > > So the comparison to the AVX-512 multiplier is a good one. Fine, but that does not at all change the facts that: 1) It's shared between logical CPUs 2) It has effects on power/thermal and therefore effects which reach outside of the core scope 3) Your appproach of making it unconditionlly available via the proposed #NM prevents the OS and subsequently the system admin / system designer to implement fine grained control over that resource. And no, an opt-in approach by providing a non-mandatory preallocation prctl does not solve that problem. Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-20 21:41 ` Thomas Gleixner @ 2021-05-20 21:49 ` Len Brown 2021-05-21 9:26 ` Thomas Gleixner 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-05-20 21:49 UTC (permalink / raw) To: Thomas Gleixner Cc: Borislav Petkov, Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, Linux API, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven On Thu, May 20, 2021 at 5:41 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > Len, > > On Thu, May 20 2021 at 17:22, Len Brown wrote: > > On Thu, May 20, 2021 at 4:54 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >> > AMX is analogous to the multiplier used by AVX-512. > >> > The architectural state must exist on every CPU, including HT siblings. > >> > Today, the HT siblings share the same execution unit, > >> > and I have no reason to expect that will change. > >> > >> I'm well aware that HT siblings share the same execution unit for > >> AVX. > >> > >> Though AMX is if I remember the discussions two years ago correctly > >> shared by more than the HT siblings which makes things worse. > > > > I regret that we were unable to get together in the last year to have > > an updated discussion. I think if we had, then we would have saved > > a lot of mis-understanding and a lot of email! > > > > So let me emphasize here: > > > > There is one TMUL execution unit per core. > > It is shared by the HT siblings within that core. > > > > So the comparison to the AVX-512 multiplier is a good one. > > Fine, but that does not at all change the facts that: > > 1) It's shared between logical CPUs > > 2) It has effects on power/thermal and therefore effects which reach > outside of the core scope FWIW, this is true of *every* instruction in the CPU. Indeed, even when the CPU is executing *no* instructions at all, the C-state chosen by that CPU has power/thermal impacts on its peers. Granted, high performance instructions such as AVX-512 and TMUL are the most extreme case. > 3) Your approach of making it unconditionally available via the > proposed #NM prevents the OS and subsequently the system admin / > system designer to implement fine grained control over that > resource. > > And no, an opt-in approach by providing a non-mandatory > preallocation prctl does not solve that problem. I'm perfectly fine with making the explicit allocation (aka opt-in) mandatory, and enforcing it. Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-05-20 21:49 ` Len Brown @ 2021-05-21 9:26 ` Thomas Gleixner 0 siblings, 0 replies; 130+ messages in thread From: Thomas Gleixner @ 2021-05-21 9:26 UTC (permalink / raw) To: Len Brown Cc: Borislav Petkov, Willy Tarreau, Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, Linux API, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer, Arjan van de Ven Len, On Thu, May 20 2021 at 17:49, Len Brown wrote: > On Thu, May 20, 2021 at 5:41 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> 2) It has effects on power/thermal and therefore effects which reach >> outside of the core scope > > FWIW, this is true of *every* instruction in the CPU. > Indeed, even when the CPU is executing *no* instructions at all, > the C-state chosen by that CPU has power/thermal impacts on its peers. > > Granted, high performance instructions such as AVX-512 and TMUL > are the most extreme case. Right and we have to draw the line somewhere. >> 3) Your approach of making it unconditionally available via the >> proposed #NM prevents the OS and subsequently the system admin / >> system designer to implement fine grained control over that >> resource. >> >> And no, an opt-in approach by providing a non-mandatory >> preallocation prctl does not solve that problem. > > I'm perfectly fine with making the explicit allocation (aka opt-in) mandatory, > and enforcing it. Great! Thanks, tglx ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-19 21:33 ` Len Brown 2021-04-19 21:58 ` Borislav Petkov @ 2021-04-19 23:52 ` Paul Eggert 1 sibling, 0 replies; 130+ messages in thread From: Paul Eggert @ 2021-04-19 23:52 UTC (permalink / raw) To: Len Brown Cc: Florian Weimer, linux-abi, libc-alpha, Bae, Chang Seok, X86 ML, LKML, Dave Hansen, Kyle Huey, Rich Felker, Andy Lutomirski, Keno Fischer, Willy Tarreau, Borislav Petkov On 4/19/21 2:33 PM, Len Brown via Libc-alpha wrote: > the AI guys are super excited about matrix multiplication, > but I have a hard time imagining why grep(1) would find a use for it. I don't. Matrix multiplication is used in modern string-searching algorithms that could be useful in running 'grep' on CPUs that have relevant hardware support. See, for example: Susanina Y, Yaveyn A, Grigorev S. Modification of Valiant’s Parsing Algorithm for the String-Searching Problem. CIBB 2019. https://doi.org/10.1007/978-3-030-63061-4_17 Although nowadays this technology is typically proposed for bioinformatics (DNA pattern matching, etc.), it's not that much of a stretch to imagine a future 'grep' or 'diff' that does matrix multiplication. After all, GNU 'diff' currently uses an algorithm designed by a DNA expert. (We now return you to the regular AMX debates. :-) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-12 23:46 ` Len Brown 2021-04-13 0:17 ` Thomas Gleixner 2021-04-13 3:43 ` Willy Tarreau @ 2021-04-13 20:16 ` Andy Lutomirski 2021-04-13 22:47 ` Len Brown 2 siblings, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-04-13 20:16 UTC (permalink / raw) To: Len Brown, Willy Tarreau Cc: Andy Lutomirski, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Mon, Apr 12, 2021 at 4:46 PM Len Brown <lenb@kernel.org> wrote: > > On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski <luto@kernel.org> wrote: > > > AMX: Multiplying a 4x4 matrix probably looks *great* in a > > microbenchmark. Do it once and you permanently allocate 8kB (is that > > even a constant? can it grow in newer parts?), potentially hurts all > > future context switches, and does who-knows-what to Turbo licenses and > > such. > > Intel expects that AMX will be extremely valuable to key workloads. > It is true that you may never run that kind of workload on the machine > in front of you, > and so you have every right to be doubtful about the value of AMX. I fully believe that AMX will be amazing when used for the right workload. The problem is that a library may have no way to tell whether a workload is the type of computationally intensive workload for which it makes sense. Imagine you have a little function: int matrix_times_vector(int dim, float *out, const float *matrix, const float *vector); A clever library might use AMX for this. If dim == 4 and the caller is planning to call it in a long, tight loop, maybe this even makes sense. If dim == 4 and it's being called once, AMX is probably a losing proposition. With previous technologies, at least the impact was limited to the function itself and maybe once per call to the caller. But now, with AMX, the program that invoked this takes a performance and memory hit *forever* if it uses AMX once. Beyond that, we have the signal handling issue. One solution, going off of what WIlly mentioned, is: bool amx_begin(void *signal_save_buffer); void amx_end(); In the amx_begin() region, if you get a signal, the AMX state is saved in the buffer. Outside the region, if you get a signal and AMX is in use, the kernel will either unceremoniously kill the task or will deliver SIGYOUBLEWIT. [0] I'm really hoping some userspace people can chime in. [0] We really ought to have a SIGSIGNALFAILURE or something for the case where normal signal delivery fails. This is the userspace equivalent of #DF. SIGYOUBLEWIT could be folded in. There would be a flag in the signal frame saying "don't even try to sigreturn". ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-13 20:16 ` Andy Lutomirski @ 2021-04-13 22:47 ` Len Brown 2021-04-13 22:58 ` Andy Lutomirski 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-04-13 22:47 UTC (permalink / raw) To: Andy Lutomirski Cc: Willy Tarreau, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Tue, Apr 13, 2021 at 4:16 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Mon, Apr 12, 2021 at 4:46 PM Len Brown <lenb@kernel.org> wrote: > > > > On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > > AMX: Multiplying a 4x4 matrix probably looks *great* in a > > > microbenchmark. Do it once and you permanently allocate 8kB (is that > > > even a constant? can it grow in newer parts?), potentially hurts all > > > future context switches, and does who-knows-what to Turbo licenses and > > > such. > > > > Intel expects that AMX will be extremely valuable to key workloads. > > It is true that you may never run that kind of workload on the machine > > in front of you, > > and so you have every right to be doubtful about the value of AMX. > > I fully believe that AMX will be amazing when used for the right > workload. The problem is that a library may have no way to tell > whether a workload is the type of computationally intensive workload > for which it makes sense. Imagine you have a little function: > > int matrix_times_vector(int dim, float *out, const float *matrix, > const float *vector); > > A clever library might use AMX for this. If dim == 4 and the caller > is planning to call it in a long, tight loop, maybe this even makes > sense. If dim == 4 and it's being called once, AMX is probably a > losing proposition. With previous technologies, at least the impact > was limited to the function itself and maybe once per call to the > caller. But now, with AMX, the program that invoked this takes a > performance and memory hit *forever* if it uses AMX once. Again... As this is a "clever" library, built with a clever toolchain, and the result is that TILERELEASE was properly issued at the end of computation. Thus the hardware knows that the (volatile) AMX registers are no longer live. The XSAVE hardware recognizes this INIT=1 condition and transfers NO DATA "*forever*". This is true both on context switch (compacted) where it is automatic, and on (uncompacted) signal delivery, where we check for this case. Was that the "performance hit" of concern, or did I miss something? Yes, it is true that the kernel allocated a context switch buffer for the lifetime of that task, and it will not be freed until that task exits. If this proves to be an issue, there is nothing preventing us from implementing a re-claim scheme for a rarely used buffer. After recognizing this situation, we'd simply arm XFD, free the buffer, and from then onwards, the task behaves exactly as if had never touched AMX. However, nobody has yet suggested that would be a common situation worth an optimization to reclaim that task's 8KB. > Beyond that, we have the signal handling issue. I'm unaware of any unresolved feedback on the signal handling series other than a wistful "wouldn't a new SIGFAIL be more clear (for future apps) than the existing SIGSEGV?" I agree with this sentiment, but I don't think we should hold up a patch to prevent corrupting user data because a new signal number to describe the scenario doesn't exit. Particularly since the new code that knows about the new SIGFAIL will also be new code that has been compiled with the new glibc that for most cases will prevent this scenario in the first place... > One solution, going > off of what WIlly mentioned, is: > > bool amx_begin(void *signal_save_buffer); > void amx_end(); > > In the amx_begin() region, if you get a signal, the AMX state is saved > in the buffer. Outside the region, if you get a signal and AMX is in > use, the kernel will either unceremoniously kill the task or will > deliver SIGYOUBLEWIT. [0] I think it is clear that if a new signal ABI is going to be invented, that it should be opt-in on state, so that it can run fast on machines far into the future by not choosing to opt-in on anything. It isn't clear that changing the signal save state around critical regions (in multiple threads) so that a single (per process definition) of a signal handler gets a different result at different times is going to make that (new) signal handler author especially happy. More likely they either always want the state, or they do not. > I'm really hoping some userspace people can chime in. Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-13 22:47 ` Len Brown @ 2021-04-13 22:58 ` Andy Lutomirski 2021-04-14 21:48 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-04-13 22:58 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, Willy Tarreau, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Tue, Apr 13, 2021 at 3:47 PM Len Brown <lenb@kernel.org> wrote: > > On Tue, Apr 13, 2021 at 4:16 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > On Mon, Apr 12, 2021 at 4:46 PM Len Brown <lenb@kernel.org> wrote: > > > > > > On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > > AMX: Multiplying a 4x4 matrix probably looks *great* in a > > > > microbenchmark. Do it once and you permanently allocate 8kB (is that > > > > even a constant? can it grow in newer parts?), potentially hurts all > > > > future context switches, and does who-knows-what to Turbo licenses and > > > > such. > > > > > > Intel expects that AMX will be extremely valuable to key workloads. > > > It is true that you may never run that kind of workload on the machine > > > in front of you, > > > and so you have every right to be doubtful about the value of AMX. > > > > I fully believe that AMX will be amazing when used for the right > > workload. The problem is that a library may have no way to tell > > whether a workload is the type of computationally intensive workload > > for which it makes sense. Imagine you have a little function: > > > > int matrix_times_vector(int dim, float *out, const float *matrix, > > const float *vector); > > > > A clever library might use AMX for this. If dim == 4 and the caller > > is planning to call it in a long, tight loop, maybe this even makes > > sense. If dim == 4 and it's being called once, AMX is probably a > > losing proposition. With previous technologies, at least the impact > > was limited to the function itself and maybe once per call to the > > caller. But now, with AMX, the program that invoked this takes a > > performance and memory hit *forever* if it uses AMX once. > > Again... > > As this is a "clever" library, built with a clever toolchain, and the > result is that > TILERELEASE was properly issued at the end of computation. > Thus the hardware knows that the (volatile) AMX registers are no longer live. My argument has *nothing* to do with TILERELEASE. Let me try again. Suppose I write some user code an call into a library that uses AMX because the library authors benchmarked it and determined that using AMX is faster when called in a loop. But I don't call it in a loop. Then I take the transition penalty into and out of AMX code (I'll believe there is no penalty when I see it -- we've had a penalty with VEX and with AVX-512) and my program runs *slower*. And, to top it off, I've just permanently allocated 8kB of extra FPU state buffer, *and* I'm taking either an XCR0 or an XFD write penalty on every future context switch. Someone or something needs to make a decision as to whether AMX should actually be used for a given algorithm. The user library community has swept this under the rug by declaring that libraries should use the best-in-a-tight-loop code for the entire existence of extensions beyond XMM, and the cost keeps getting higher. > > Beyond that, we have the signal handling issue. > > I'm unaware of any unresolved feedback on the signal handling series > other than a wistful "wouldn't a new SIGFAIL be more clear (for future apps) > than the existing SIGSEGV?" I agree with this sentiment, but I don't > think we should hold up a patch to prevent corrupting user data > because a new signal number to describe the scenario doesn't exit. > Particularly since the new code that knows about the new SIGFAIL > will also be new code that has been compiled with the new glibc > that for most cases will prevent this scenario in the first place... > > > One solution, going > > off of what WIlly mentioned, is: > > > > bool amx_begin(void *signal_save_buffer); > > void amx_end(); > > > > In the amx_begin() region, if you get a signal, the AMX state is saved > > in the buffer. Outside the region, if you get a signal and AMX is in > > use, the kernel will either unceremoniously kill the task or will > > deliver SIGYOUBLEWIT. [0] > > I think it is clear that if a new signal ABI is going to be invented, > that it should be opt-in on state, so that it can run fast on machines > far into the future by not choosing to opt-in on anything. > > It isn't clear that changing the signal save state around critical regions > (in multiple threads) so that a single (per process definition) of a signal > handler gets a different result at different times is going to make that > (new) signal handler author especially happy. More likely they > either always want the state, or they do not. Perhaps some form of decision should be reached before AMX lands? Landing AMX in its current form is a decision, and we should make a credible effort to decide if it's the right one. --Andy ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-13 22:58 ` Andy Lutomirski @ 2021-04-14 21:48 ` Len Brown 2021-04-15 16:24 ` Andy Lutomirski 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-04-14 21:48 UTC (permalink / raw) To: Andy Lutomirski Cc: Willy Tarreau, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Tue, Apr 13, 2021 at 6:59 PM Andy Lutomirski <luto@kernel.org> wrote: > Suppose I write some user code and call into a library that uses AMX > because the library authors benchmarked it and determined that using > AMX is faster when called in a loop. But I don't call it in a loop. Again... AMX registers are volatile. That means if a routine uses them and expects them to persist across a call, the caller must save them. It also means that they can't be used as globals, they can't be used for passing parameters, and they can't be used for static information in a called routine. And so a routine that uses AMX starts from scratch, and finishes with no state in registers. Everything gets loaded from memory into AMX, joyous number crunching proceeds, and the answer is stored in memory when the routine is complete. Could somebody write a routine that uses AMX to perform a single multiply, and then call that routine from a loop? Yes, that is possible, but I would challenge them to demonstrate it is a good idea under *any* conditions. More realistic, perhaps would be a routine that does a matrix multiply and is passed matrices of variable size. It would be easy to demonstrate that is a performance win on a big matrix, but then what happens when somebody calls that routine with a matrix size of 1? Surely that would be a net loss. This is pretty much exactly what Willy described. It is faster to copy a 5-byte structure by hand, than to call bcopy(). Indeed, dedicated co-processors have been built to do what AMX does, but the work has to be big enough to out-weigh the overhead of invoking them. This trade-off is as old as the hills, and yes, it is possible to screw up. > Then I take the transition penalty into and out of AMX code (I'll > believe there is no penalty when I see it -- we've had a penalty with > VEX and with AVX-512) and my program runs *slower*. If you have a clear definition of what "transition penalty" is, please share it. Lacking one, I'll assume you are referring to the impact on turbo frequency of using AMX hardware? Again... On the hardware that supports AMX, there is zero impact on frequency due to the presence of AMX state, whether modified or unmodified. We resolved on another thread that Linux will never allow entry into idle with modified AMX state, and so AMX will have zero impact on the ability of the process to enter deep power-saving C-states. It is true that AMX activity is considered when determining max turbo. (as it must be) However, the *release* of the turbo credits consumed by AMX is "several orders of magnitude" faster on this generation than it was for AVX-512 on pre-AMX hardware. I respect your right to not believe me about performance until you have this hardware. But proposing a new ABI based on concern of a problem that hasn't been shown to exist would be folly. > And, to top it > off, I've just permanently allocated 8kB of extra FPU state buffer, > *and* I'm taking either an XCR0 or an XFD write penalty on every > future context switch. Again... We allocate an 8kB FPU state buffer for tasks that use AMX. We do not allocate that buffer for tasks that do not use AMX. If it turns out to be common that a long running task touches AMX once and never again, it would not be difficult to optimize for that case and free the buffer. Again... Yes, the proposal, and the working patch set on the list, context switches XFD -- which is exactly what that hardware was designed to do. If the old and new tasks have the same value of XFD, the MSR write is skipped. I'm not aware of any serious proposal to context-switch XCR0, as it would break the current programming model, where XCR0 advertises what the OS supports. It would also impact performance, as every write to XCR0 necessarily provokes a VMEXIT. > Someone or something needs to make a decision as to whether AMX should > actually be used for a given algorithm. The user library community > has swept this under the rug by declaring that libraries should use > the best-in-a-tight-loop code for the entire existence of extensions > beyond XMM, and the cost keeps getting higher. Is this a plea for library writers to run the simplest microbenchmarks to determine if their code makes any sense at all before releasing it? If so, I agree. > Perhaps some form of decision should be reached before AMX lands? > Landing AMX in its current form is a decision, and we should make a > credible effort to decide if it's the right one. Three questions come to mind: 1. Do we have a power or performance issue? 2. Do we have ABI breakage? 3. Can we do better in the long term? 1. Power or Performance issue? Per above, and multiple other threads, I'm not aware of any unresolved power or performance issues on AMX-capable hardware. 2. ABI Breakage? We all recognize that the signal.h hard-coded alt-sig-stack size ABI was ill conceived, and overlooked -- for decades. We all recognize that there are a non-zero number of applications that fail on AVX-512 hardware because of this issue. We all recognize that if not addressed, AVX would increase the likelihood of an application failing due to the too-small-alternative-signal-stack issue. I thank the ARM community for taking action on this issue and setting the example of the ALT-VEC solution to compute and expose required stack size at run-time. I further thank HJ Lu and the libc team for picking up that ball and shipping the solution to this problem with an updated ABI in glibc 2.34. I acknowledge that it is not impossible to fail after that fix -- you can ignore the ABI, or you could hardcode sizes, or you could be statically linked to the old libc. But it gets increasingly harder to fail, the kernel signal series has a new run-time check to prevent data corruption that could have happened in the past, and the remedy is clear -- re-build with the new glibc. Yes, it would have been good if this were done before AVX-512 deployed. 3. Can we do better in the long term? Assuming the ABI update in #2 addresses the issue with applications that declare their own alt-sig-stack, we have a backward compatible solution today. Somewhere in the past, the decision was made that all architectural state should be exposed to signal handlers. And a decision was made on x86 that it should be in uncompacted XSTATE format. There are programs today that count on both of these things being true, and if we change either of those, we break applications. But the irony is that there are a vanishingly small number of signal handlers that actually care at all about that state, and it seems to be wasteful to give it to them. So the question is whether to continue giving them all that information, or to give them a way to decline -- to give the option for signals to be more lightweight. We can certainly do this. The question is if it is important enough to bother. What applications would notice if signal handlers were faster? Would those applications be willing to update to opt-in to a new incompatible signal handling ABI, where the kernel took the time to supply only the state that they request? thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-14 21:48 ` Len Brown @ 2021-04-15 16:24 ` Andy Lutomirski 2021-04-15 17:00 ` Dave Hansen 2021-04-16 21:54 ` Len Brown 0 siblings, 2 replies; 130+ messages in thread From: Andy Lutomirski @ 2021-04-15 16:24 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, Willy Tarreau, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Wed, Apr 14, 2021 at 2:48 PM Len Brown <lenb@kernel.org> wrote: > > > > Then I take the transition penalty into and out of AMX code (I'll > > believe there is no penalty when I see it -- we've had a penalty with > > VEX and with AVX-512) and my program runs *slower*. > > If you have a clear definition of what "transition penalty" is, please share it. Given the generally awful state of Intel's documentation about these issues, it's quite hard to tell for real. But here are some examples. VEX: Figures 11-1 ("AVX-SSE Transitions in the Broadwell, and Prior Generation Microarchitectures") and 11-2 ("AVX-SSE Transitions in the Skylake Microarchitecture"). We *still* have a performance regression in the upstream kernel because, despite all common sense, the CPUs consider LDMXCSR to be an SSE instruction and VLDMXCSR to be an AVX instruction despite the fact that neither one of them touch the XMM or YMM state at all. AVX-512: https://lore.kernel.org/linux-crypto/CALCETrU06cuvUF5NDSm8--dy3dOkxYQ88cGWaakOQUE4Vkz88w@mail.gmail.com/ https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html > > Lacking one, I'll assume you are referring to the > impact on turbo frequency of using AMX hardware? > > Again... > > On the hardware that supports AMX, there is zero impact on frequency > due to the presence of AMX state, whether modified or unmodified. > > We resolved on another thread that Linux will never allow entry > into idle with modified AMX state, and so AMX will have zero impact > on the ability of the process to enter deep power-saving C-states. > > It is true that AMX activity is considered when determining max turbo. > (as it must be) > However, the *release* of the turbo credits consumed by AMX is > "several orders of magnitude" faster on this generation > than it was for AVX-512 on pre-AMX hardware. What is the actual impact of a trivial function that initializes the tile config, does one tiny math op, and then does TILERELEASE? > Yes, the proposal, and the working patch set on the list, context > switches XFD -- which is exactly what that hardware was designed to do. > If the old and new tasks have the same value of XFD, the MSR write is skipped. > > I'm not aware of any serious proposal to context-switch XCR0, > as it would break the current programming model, where XCR0 > advertises what the OS supports. It would also impact performance, > as every write to XCR0 necessarily provokes a VMEXIT. You're arguing against a nonsensical straw man. In the patches, *as submitted*, if you trip the XFD #NM *once* and you are the only thread on the system to do so, you will eat the cost of a WRMSR on every subsequent context switch. This is not free. If we use XCR0 (I'm not saying we will -- I'm just mentioning at a possibility), then the penalty is presumably worse due to the VMX issue. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-15 16:24 ` Andy Lutomirski @ 2021-04-15 17:00 ` Dave Hansen 2021-04-15 17:38 ` Andy Lutomirski 2021-04-16 21:54 ` Len Brown 1 sibling, 1 reply; 130+ messages in thread From: Dave Hansen @ 2021-04-15 17:00 UTC (permalink / raw) To: Andy Lutomirski, Len Brown Cc: Willy Tarreau, Florian Weimer, Bae, Chang Seok, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On 4/15/21 9:24 AM, Andy Lutomirski wrote: > In the patches, *as submitted*, if you trip the XFD #NM *once* and you > are the only thread on the system to do so, you will eat the cost of a > WRMSR on every subsequent context switch. I think you're saying: If a thread trips XFD #NM *once*, every switch to and from *that* thread will incur the WRMSR cost. The first time I read this, I thought you were saying that all threads would incur a WRMSR cost on every context switch. If that's the case, I grossly misread the patches. :) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-15 17:00 ` Dave Hansen @ 2021-04-15 17:38 ` Andy Lutomirski 0 siblings, 0 replies; 130+ messages in thread From: Andy Lutomirski @ 2021-04-15 17:38 UTC (permalink / raw) To: Dave Hansen Cc: Andy Lutomirski, Len Brown, Willy Tarreau, Florian Weimer, Bae, Chang Seok, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer > On Apr 15, 2021, at 10:00 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 4/15/21 9:24 AM, Andy Lutomirski wrote: >> In the patches, *as submitted*, if you trip the XFD #NM *once* and you >> are the only thread on the system to do so, you will eat the cost of a >> WRMSR on every subsequent context switch. > > I think you're saying: If a thread trips XFD #NM *once*, every switch to > and from *that* thread will incur the WRMSR cost. Indeed. My sentence was missing a few words at the end. > > The first time I read this, I thought you were saying that all threads > would incur a WRMSR cost on every context switch. If that's the case, I > grossly misread the patches. :) ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-15 16:24 ` Andy Lutomirski 2021-04-15 17:00 ` Dave Hansen @ 2021-04-16 21:54 ` Len Brown 2021-04-16 22:03 ` Andy Lutomirski 1 sibling, 1 reply; 130+ messages in thread From: Len Brown @ 2021-04-16 21:54 UTC (permalink / raw) To: Andy Lutomirski Cc: Willy Tarreau, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Thu, Apr 15, 2021 at 12:24 PM Andy Lutomirski <luto@kernel.org> wrote: > On Wed, Apr 14, 2021 at 2:48 PM Len Brown <lenb@kernel.org> wrote: > > > ... the transition penalty into and out of AMX code The concept of 'transition' exists between AVX and SSE instructions because it is possible to mix both instruction sets and touch different parts of the same registers. The "unused" parts of those registers need to be tracked to assure that data is not lost when mixing. This concept is moot with AMX, which has its own dedicated registers. > What is the actual impact of a trivial function that initializes the > tile config, does one tiny math op, and then does TILERELEASE? 1. Task takes #NM on first touch of TILE registers 2. Kernel allocates 8KB for that task and dis-arms XFD 3. Kernel context switches XFD with task state If the task takes a signal *before* TILERELEASE 4. XSAVE transfers AMX state to signal stack, XRESTOR the reverse. If the task context switches *before* TILERELEASE 5. kernel context switch XSAVES the AMX state to 8KB context switch buffer, XRESTORE the reverse. If the task takes a signal *after* TILERELEASE 4. XSAVE does NOT transfer AMX state (or zeros) to signal stack, 8KB is consumed on signal stack but not touched. XRESTOR, the reverse. If the task context switches *after* TILERELEASE 5. kernel contexts switch ignores INIT=1 AMX state. 8KB buffer is quiescent. As we discussed previously, there is no impact to frequency from either INIT=0 or INIT=1 AMX state. Frequency is impacted by *compute*, and since there isn't any compute this scenario, there is no frequency impact. As we discussed previously, for INIT=1 (which the kernel guarantees, there is also no impact on power. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-16 21:54 ` Len Brown @ 2021-04-16 22:03 ` Andy Lutomirski 2021-04-16 22:10 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-04-16 22:03 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, Willy Tarreau, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Fri, Apr 16, 2021 at 2:54 PM Len Brown <lenb@kernel.org> wrote: > > On Thu, Apr 15, 2021 at 12:24 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Wed, Apr 14, 2021 at 2:48 PM Len Brown <lenb@kernel.org> wrote: > > > > > ... the transition penalty into and out of AMX code > > The concept of 'transition' exists between AVX and SSE instructions > because it is possible to mix both instruction sets and touch different > parts of the same registers. The "unused" parts of those registers > need to be tracked to assure that data is not lost when mixing. I get it. That does not explain why LDMXCSR and VLDMXCSR cause pipelines stalls. > > This concept is moot with AMX, which has its own dedicated registers. > > > What is the actual impact of a trivial function that initializes the > > tile config, does one tiny math op, and then does TILERELEASE? ^^^^ "does one tiny math op" AVX-512 *also* has sort-of-dedicated registers: ZMM16 and up. I still can't find any conclusive evidence as to whether that avoids the performance hit. Intel's track record at actually explaining what operations cause what particular performance disasters is poor, and your explanation is not helping the situation. Sorry. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-16 22:03 ` Andy Lutomirski @ 2021-04-16 22:10 ` Len Brown 2021-04-16 22:14 ` Andy Lutomirski 0 siblings, 1 reply; 130+ messages in thread From: Len Brown @ 2021-04-16 22:10 UTC (permalink / raw) To: Andy Lutomirski Cc: Willy Tarreau, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer > I get it. That does not explain why LDMXCSR and VLDMXCSR cause > pipelines stalls. Sorry, I thought this thread was about AMX. I don't know the answer to your LDMXCSR and VLDMXCSR question. ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-16 22:10 ` Len Brown @ 2021-04-16 22:14 ` Andy Lutomirski 2021-04-17 1:57 ` Len Brown 0 siblings, 1 reply; 130+ messages in thread From: Andy Lutomirski @ 2021-04-16 22:14 UTC (permalink / raw) To: Len Brown Cc: Andy Lutomirski, Willy Tarreau, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Fri, Apr 16, 2021 at 3:11 PM Len Brown <lenb@kernel.org> wrote: > > > I get it. That does not explain why LDMXCSR and VLDMXCSR cause > > pipelines stalls. > > Sorry, I thought this thread was about AMX. > I don't know the answer to your LDMXCSR and VLDMXCSR question. My point is that every single major math extension since the original XMM extensions (SSE, etc) has come with performance gotchas. Given Intel's general unwillingness to document the gotchas in hardware that is actually shipping, I'm sceptical that AMX is as delightfully gotcha-free as you are making it out to be. Is there any authoritative guidance at all on what actually happens, performance-wise, when someone does AMX math? ^ permalink raw reply [flat|nested] 130+ messages in thread
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features 2021-04-16 22:14 ` Andy Lutomirski @ 2021-04-17 1:57 ` Len Brown 0 siblings, 0 replies; 130+ messages in thread From: Len Brown @ 2021-04-17 1:57 UTC (permalink / raw) To: Andy Lutomirski Cc: Willy Tarreau, Florian Weimer, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, linux-abi, libc-alpha, Rich Felker, Kyle Huey, Keno Fischer On Fri, Apr 16, 2021 at 6:14 PM Andy Lutomirski <luto@kernel.org> wrote: > My point is that every ... I encourage you to continue to question everything and trust nobody. While it may cost you a lot in counseling, it is certainly valuable, at least to me! :-) I do request, however, that feedback stay specific, stay technical, and stay on-topic. We all have plenty of real challenges we can be tackling with our limited time. > Is there any authoritative guidance at all on what actually happens, > performance-wise, when someone does AMX math? Obviously, I can't speak to the performance of AMX itself pre-production, and somebody who does that for a living will release stuff on or before release day. What I've told you about the performance side-effects on the system (and lack thereof) from running AMX code is an authoritative answer, and is as much as I can tell you today. If I failed to answer a question about AMX, my apologies, please re-ask it. And if we learn something new between now and release day that is relevant to this discussion, I will certainly request to share it. Our team (Intel Open Source Technology Center) advocated getting the existing public AMX documentation published as early as possible. However, if you are really nto the details of how AMX works, you may also be interested to know that the AMX hardware patent filings are fully public ;-) cheers, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 130+ messages in thread
end of thread, other threads:[~2021-06-30 15:26 UTC | newest] Thread overview: 130+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-26 23:12 Candidate Linux ABI for Intel AMX and hypothetical new related features Andy Lutomirski 2021-03-26 23:18 ` Andy Lutomirski 2021-03-27 3:39 ` Len Brown 2021-03-27 9:14 ` Borislav Petkov 2021-03-27 9:58 ` Greg KH 2021-03-29 15:47 ` Len Brown 2021-03-29 16:38 ` Len Brown 2021-03-29 16:48 ` Florian Weimer 2021-03-29 18:14 ` Andy Lutomirski 2021-03-29 18:16 ` Andy Lutomirski 2021-03-29 22:38 ` Len Brown 2021-03-30 5:08 ` Andy Lutomirski 2021-03-30 5:50 ` Noah Goldstein 2021-03-30 17:01 ` Len Brown 2021-03-30 17:05 ` Andy Lutomirski 2021-03-30 17:56 ` Len Brown 2021-03-30 19:12 ` Dave Hansen 2021-03-30 20:20 ` Andy Lutomirski 2021-03-30 20:42 ` Len Brown 2021-03-30 22:01 ` David Laight 2021-03-31 16:31 ` Len Brown 2021-03-31 16:53 ` Andy Lutomirski 2021-03-31 21:42 ` Robert O'Callahan 2021-03-31 22:11 ` Len Brown 2021-03-31 22:28 ` Len Brown 2021-03-31 22:45 ` Andy Lutomirski 2021-04-09 20:52 ` Len Brown 2021-04-09 21:44 ` Andy Lutomirski 2021-04-11 19:07 ` Len Brown 2021-04-12 7:59 ` David Laight 2021-04-12 12:19 ` Borislav Petkov 2021-04-12 17:14 ` Sean Christopherson 2021-03-31 22:52 ` Borislav Petkov 2021-04-09 20:55 ` Len Brown 2021-03-28 0:53 ` Thomas Gleixner 2021-03-29 7:27 ` Peter Zijlstra 2021-03-29 15:06 ` Dave Hansen 2021-03-31 8:24 ` Borislav Petkov [not found] ` <87lf9nk2ku.fsf@oldenburg.str.redhat.com> 2021-04-12 14:31 ` Borislav Petkov 2021-04-12 14:38 ` Florian Weimer 2021-04-12 15:08 ` Borislav Petkov 2021-04-12 15:10 ` Andy Lutomirski 2021-04-12 15:21 ` Andy Lutomirski 2021-04-12 23:46 ` Len Brown 2021-04-13 0:17 ` Thomas Gleixner 2021-04-13 1:25 ` Len Brown 2021-04-13 3:43 ` Willy Tarreau 2021-04-13 19:51 ` Len Brown 2021-04-14 9:58 ` Borislav Petkov 2021-04-14 10:06 ` Willy Tarreau 2021-04-14 10:08 ` Borislav Petkov 2021-04-14 21:57 ` Len Brown 2021-04-15 4:43 ` Borislav Petkov 2021-04-15 5:29 ` Willy Tarreau 2021-04-15 5:47 ` Borislav Petkov 2021-04-16 22:05 ` Len Brown 2021-04-19 14:14 ` Borislav Petkov 2021-04-19 18:18 ` Len Brown 2021-04-19 19:15 ` Borislav Petkov 2021-04-19 21:33 ` Len Brown 2021-04-19 21:58 ` Borislav Petkov 2021-04-23 19:35 ` Len Brown 2021-04-23 19:57 ` Borislav Petkov 2021-05-02 15:27 ` Len Brown 2021-05-03 5:18 ` Florian Weimer 2021-05-03 13:43 ` Dave Hansen 2021-05-03 13:47 ` Florian Weimer 2021-05-03 14:14 ` Dave Hansen 2021-05-07 18:44 ` Thomas Gleixner 2021-05-07 18:50 ` Andy Lutomirski 2021-05-07 19:22 ` Thomas Gleixner 2021-05-08 9:45 ` Thomas Gleixner 2021-05-18 20:39 ` Len Brown 2021-05-19 23:29 ` Andy Lutomirski 2021-05-20 19:16 ` Len Brown 2021-05-17 9:45 ` Thomas Gleixner 2021-05-17 9:56 ` Florian Weimer 2021-05-17 10:18 ` Thomas Gleixner 2021-05-21 16:29 ` Len Brown 2021-05-17 13:49 ` Arjan van de Ven 2021-05-20 15:35 ` Len Brown 2021-05-20 20:54 ` Thomas Gleixner 2021-05-20 21:13 ` Dave Hansen 2021-05-20 21:41 ` Len Brown 2021-05-20 22:53 ` Dave Hansen 2021-05-21 9:41 ` Thomas Gleixner 2021-05-21 14:44 ` Florian Weimer 2021-05-21 14:49 ` Peter Zijlstra 2021-06-23 15:06 ` Florian Weimer 2021-06-23 23:11 ` Len Brown 2021-06-28 10:14 ` Enrico Weigelt, metux IT consult 2021-06-28 12:49 ` Florian Weimer 2021-06-30 12:22 ` Enrico Weigelt, metux IT consult 2021-06-30 12:41 ` Willy Tarreau 2021-06-30 13:55 ` Arjan van de Ven 2021-06-30 15:20 ` Len Brown 2021-06-30 15:25 ` Enrico Weigelt, metux IT consult 2021-05-21 16:14 ` Dave Hansen 2021-05-21 16:19 ` Florian Weimer 2021-05-21 16:26 ` Len Brown 2021-05-21 16:28 ` Dave Hansen 2021-05-21 16:31 ` Andy Lutomirski 2021-05-21 19:10 ` Thomas Gleixner 2021-05-21 20:07 ` Andy Lutomirski 2021-05-21 21:43 ` Thomas Gleixner 2021-05-21 22:07 ` Len Brown 2021-05-21 22:46 ` Thomas Gleixner 2021-05-21 23:31 ` Len Brown 2021-05-22 7:16 ` Florian Weimer 2021-05-22 23:55 ` Andy Lutomirski 2021-05-21 23:06 ` Dave Hansen 2021-05-21 23:08 ` Len Brown 2021-05-21 19:05 ` Thomas Gleixner 2021-05-20 21:22 ` Len Brown 2021-05-20 21:41 ` Thomas Gleixner 2021-05-20 21:49 ` Len Brown 2021-05-21 9:26 ` Thomas Gleixner 2021-04-19 23:52 ` Paul Eggert 2021-04-13 20:16 ` Andy Lutomirski 2021-04-13 22:47 ` Len Brown 2021-04-13 22:58 ` Andy Lutomirski 2021-04-14 21:48 ` Len Brown 2021-04-15 16:24 ` Andy Lutomirski 2021-04-15 17:00 ` Dave Hansen 2021-04-15 17:38 ` Andy Lutomirski 2021-04-16 21:54 ` Len Brown 2021-04-16 22:03 ` Andy Lutomirski 2021-04-16 22:10 ` Len Brown 2021-04-16 22:14 ` Andy Lutomirski 2021-04-17 1:57 ` Len Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).