Linux-api Archive on lore.kernel.org
 help / color / Atom feed
* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
       [not found] <CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ@mail.gmail.com>
@ 2021-03-26 23:18 ` Andy Lutomirski
  2021-03-27  3:39   ` Len Brown
  2021-03-28  0:53   ` Thomas Gleixner
  0 siblings, 2 replies; 36+ messages in thread
From: Andy Lutomirski @ 2021-03-26 23:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha,
	Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API

Sigh, cc linux-api, not linux-abi.

On Fri, Mar 26, 2021 at 4:12 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> Hi all-
>
> After some discussion on IRC, I have a proposal for a Linux ABI for
> using Intel AMX and other similar features.  It works like this:
>
> First, we make XCR0 dynamic.  This looks a lot like Keno's patch but
> with a different API, outlined below.  Different tasks can have
> different XCR0 values.  The default XCR0 for new tasks does not
> include big features like AMX.  XMM and YMM are still there.  The AVX2
> states are debatable -- see below.
>
> To detect features and control XCR0, we add some new arch_prctls:
>
> arch_prctl(ARCH_GET_XCR0_SUPPORT, 0, ...);
>
> returns the set of XCR0 bits supported on the current kernel.
>
> arch_prctl(ARCH_GET_XCR0_LAZY_SUPPORT, 0, ...);
>
> returns 0.  See below.
>
> arch_prctl(ARCH_SET_XCR0, xcr0, lazy_states, sigsave_states,
> sigclear_states, 0);
>
> Sets xcr0.  All states are preallocated except that states in
> lazy_states may be unallocated in the kernel until used.  (Not
> supported at all in v1.  lazy_states & ~xcr0 != 0 is illegal.)  States
> in sigsave_states are saved in the signal frame.  States in
> sigclear_states are reset to the init state on signal delivery.
> States in sigsave_states are restored by sigreturn, and states not in
> sigsave_states are left alone by sigreturn.
>
> Optionally we do not support PKRU at all in XCR0 -- it doesn't make
> that much sense as an XSAVE feature, and I'm not convinced that trying
> to correctly context switch XINUSE[PKRU] is worthwhile.  I doubt we
> get it right today.
>
> Optionally we come up with a new format for new features in the signal
> frame, since the current format is showing its age.  Taking 8kB for a
> signal with AMX is one thing.  Taking another 8kB for a nested signal
> if AMX is not in use is worse.
>
> Optionally we make AVX-512 also default off, which fixes what is
> arguably a serious ABI break with AVX-512: lots of programs, following
> POSIX (!), seem to think that they know much much space to allocate
> for sigaltstack().   AVX-512 is too big.
>
> Thoughts?
>
> --Andy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-26 23:18 ` Candidate Linux ABI for Intel AMX and hypothetical new related features Andy Lutomirski
@ 2021-03-27  3:39   ` Len Brown
  2021-03-27  9:14     ` Borislav Petkov
  2021-03-27  9:58     ` Greg KH
  2021-03-28  0:53   ` Thomas Gleixner
  1 sibling, 2 replies; 36+ messages in thread
From: Len Brown @ 2021-03-27  3:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha,
	Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API

Hi Andy,

Say a mainline links with a math library that uses AMX without the
knowledge of the mainline.
Say the mainline is also linked with a userspace threading library
that thinks it has a concept of XSAVE area size.

Wouldn't the change in XCR0, resulting in XSAVE size change, risk
confusing the threading library?

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-27  3:39   ` Len Brown
@ 2021-03-27  9:14     ` Borislav Petkov
  2021-03-27  9:58     ` Greg KH
  1 sibling, 0 replies; 36+ messages in thread
From: Borislav Petkov @ 2021-03-27  9:14 UTC (permalink / raw)
  To: Len Brown
  Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML,
	libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer,
	Linux API

On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote:
> Say a mainline links with a math library that uses AMX without the
> knowledge of the mainline.

What is a "mainline"?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-27  3:39   ` Len Brown
  2021-03-27  9:14     ` Borislav Petkov
@ 2021-03-27  9:58     ` Greg KH
  2021-03-29 15:47       ` Len Brown
  1 sibling, 1 reply; 36+ messages in thread
From: Greg KH @ 2021-03-27  9:58 UTC (permalink / raw)
  To: Len Brown
  Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML,
	libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer,
	Linux API

On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote:
> Hi Andy,
> 
> Say a mainline links with a math library that uses AMX without the
> knowledge of the mainline.

What does this mean?  What happened to the context here?

> Say the mainline is also linked with a userspace threading library
> that thinks it has a concept of XSAVE area size.

How can the kernel (what I think you mean by "mainline" here) be linked
with a userspace library at all?

> Wouldn't the change in XCR0, resulting in XSAVE size change, risk
> confusing the threading library?

Shouldn't that be the job of the kernel and not userspace?

totally confused,

greg k-h

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-26 23:18 ` Candidate Linux ABI for Intel AMX and hypothetical new related features Andy Lutomirski
  2021-03-27  3:39   ` Len Brown
@ 2021-03-28  0:53   ` Thomas Gleixner
  2021-03-29  7:27     ` Peter Zijlstra
  2021-03-29 15:06     ` Dave Hansen
  1 sibling, 2 replies; 36+ messages in thread
From: Thomas Gleixner @ 2021-03-28  0:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha,
	Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API

Andy,

On Fri, Mar 26 2021 at 16:18, Andy Lutomirski wrote:
> arch_prctl(ARCH_SET_XCR0, xcr0, lazy_states, sigsave_states,
> sigclear_states, 0);
>
> Sets xcr0.  All states are preallocated except that states in
> lazy_states may be unallocated in the kernel until used.  (Not
> supported at all in v1.  lazy_states & ~xcr0 != 0 is illegal.)  States
> in sigsave_states are saved in the signal frame.  States in
> sigclear_states are reset to the init state on signal delivery.
> States in sigsave_states are restored by sigreturn, and states not in
> sigsave_states are left alone by sigreturn.

I like the idea in principle.

> Optionally we come up with a new format for new features in the signal
> frame, since the current format is showing its age.  Taking 8kB for a
> signal with AMX is one thing.  Taking another 8kB for a nested signal
> if AMX is not in use is worse.

I don't think that we should make that optional to begin with. Sizing
sigaltstack is lottery as of today and making it more so does not help
at all.

> Optionally we make AVX-512 also default off, which fixes what is
> arguably a serious ABI break with AVX-512: lots of programs, following
> POSIX (!), seem to think that they know much much space to allocate
> for sigaltstack().   AVX-512 is too big.

I really wish we could do that. That AVX512 disaster is not trivial to
sort.

Let's focus on AMX first. That ship at least has not sailed yet, but if
it does without a proper resolution then it's going to sail deep south.
Maybe we end up with some ideas about the AVX512 issue as well that way.

The main problem I see is simply historical. Every other part of the
user stack space from libraries to applications tries to be "smart"
about utilizing the assumed best instruction set, feature extensions
which are detected when something is initialized. I can sing a song of
that because I was casually involved porting debian to an unsupported
architecture. Magic all over the place. Now add the whole pile of
proprietary software stacks, libraries on top of that picture and things
get completely out of control.

Why? Simply because user space has absolutely no concept about
orchestrating these things at all. That worked for a while by some
definition of works and this model is still proliferated today even by
players who should know better.

Even if you expected that some not so distant events and the experience
with fleet consistency would have stopped the 'performance first,
features first' chorus in some way, that's not what reality is.

Linux is not necessarily innocent. For years we just crammed features
into the kernel without thinking too hard about the big picture. But,
yes we realized the hard way that there is a problem and just adding yet
another magic 'make it work' hack for AMX is definitely the wrong
approach.

What are the possible problems when we make it a hard requirement for
AMX to be requested by an application/task in order to use it?

For the kernel itself. Not really any consequence I can think off
aside of unhappy campers in user space.

For user space this is disruptive and we have at least to come up with
some reasonable model how all involved components with different ideas
of how to best utilize a given CPU can be handled.

That starts at the very simple problem of feature enumeration. Up to now
CPUID is non-priviledged and a large amount of user space just takes
that as the ultimate reference. We can change that when CPUID faulting
in CPL3 is supported by the CPU which we can't depend on because it is
not architectural.

Though the little devil in my head tells me, that making AMX support
depend on the CPUID faulting capability might be not the worst thing.

Then we actually enforce CPUID faulting (finally) on CPUs which support
it, which would be a first step into the right direction simply because
then random library X has to go to the kernel and ask for it explicitely
or just shrug and use whatever the kernel is willing to hand out in
CPUID.

Now take that one step further. When the first part of some user space
application asks for it, then you can register that with the process and
make sane decisions for all other requesters which come after it, which
is an important step into the direction of having a common orchestration
for this.

Sure you can do that via XCR0 as well to some extent, but that CPUID
fault would solve a whole class of other problems which people who care
about feature consistency face today at least to some extent.

And contrary to XCR0, which is orthogonal and obviously still required
for the AMX (and hint AVX512) problem, CPUID faulting would just hand
out the feature bits which the kernel want's to hand out.

If the app, library or whatever still tries to use them, then they get
the #UD, #GP or whatever penalty is associated to that particular XCR0
disabled piece. It's not there, you tried, keep the pieces.

Making it solely depend on XCR0 and fault if not requested upfront is
bringing you into the situation that you broke 'legacy code' which
relied on the CPUID bit and that worked until now which gets you
in the no-regression trap.

I haven't thought this through obviously, but depending solely on XCR0
faults did not really sum up, so I thought I share that evil idea for
broader discussion.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-28  0:53   ` Thomas Gleixner
@ 2021-03-29  7:27     ` Peter Zijlstra
  2021-03-29 15:06     ` Dave Hansen
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2021-03-29  7:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML,
	libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer,
	Linux API

On Sun, Mar 28, 2021 at 01:53:15AM +0100, Thomas Gleixner wrote:
> Though the little devil in my head tells me, that making AMX support
> depend on the CPUID faulting capability might be not the worst thing.
> 
> Then we actually enforce CPUID faulting (finally) on CPUs which support
> it, which would be a first step into the right direction simply because
> then random library X has to go to the kernel and ask for it explicitely
> or just shrug and use whatever the kernel is willing to hand out in
> CPUID.
> 
> Now take that one step further. When the first part of some user space
> application asks for it, then you can register that with the process and
> make sane decisions for all other requesters which come after it, which
> is an important step into the direction of having a common orchestration
> for this.

I wrote something like that at least one...

  https://lore.kernel.org/lkml/20190212164833.GK32494@hirez.programming.kicks-ass.net/

we just need to make sure AMD implements that before it ships a chip
with AVX512 on.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-28  0:53   ` Thomas Gleixner
  2021-03-29  7:27     ` Peter Zijlstra
@ 2021-03-29 15:06     ` Dave Hansen
  1 sibling, 0 replies; 36+ messages in thread
From: Dave Hansen @ 2021-03-29 15:06 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirski
  Cc: Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer,
	Rich Felker, Kyle Huey, Keno Fischer, Linux API

On 3/27/21 5:53 PM, Thomas Gleixner wrote:
> Making it solely depend on XCR0 and fault if not requested upfront is
> bringing you into the situation that you broke 'legacy code' which
> relied on the CPUID bit and that worked until now which gets you
> in the no-regression trap.

Trying to find the right place to jump into this thread... :)

I don't know what apps do in practice.  But, the enumeration of the
features in the SDM describes three steps:
1. Check for XGETBV support
2. Use XGETBV[0] to check that the OS is aware of the feature and is
   context-switching it
3. Detect the feature itself

So, apps *are* supposed to be checking XCR0 via XGETBV.  If they don't,
they run the risk of a feature being supported by the CPU and the
registers "working" but not being context-switched.

Zeroing out bits in XCR0 will have the effect of telling the app that
the OS isn't context-switching the state.  I think this means that apps
will see the same thing in both situations:
1. If they run an old (say pre-AVX-512) kernel on new AVX-512-enabled
   hardware, or
2. They run a new kernel with this fancy proposed XCR0-switching
   mechanism

I _think_ that gets us off the hook for an ABI break, at least for AVX-512.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-27  9:58     ` Greg KH
@ 2021-03-29 15:47       ` Len Brown
  2021-03-29 16:38         ` Len Brown
  2021-03-29 18:16         ` Andy Lutomirski
  0 siblings, 2 replies; 36+ messages in thread
From: Len Brown @ 2021-03-29 15:47 UTC (permalink / raw)
  To: Greg KH
  Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML,
	libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer,
	Linux API

On Sat, Mar 27, 2021 at 5:58 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote:
> > Hi Andy,
> >
> > Say a mainline links with a math library that uses AMX without the
> > knowledge of the mainline.

sorry for the confusion.

mainline = main().

ie. the part of the program written by you, and not the library you linked with.

In particular, the library may use instructions that main() doesn't know exist.

-- 
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-29 15:47       ` Len Brown
@ 2021-03-29 16:38         ` Len Brown
  2021-03-29 16:48           ` Florian Weimer
  2021-03-29 18:14           ` Andy Lutomirski
  2021-03-29 18:16         ` Andy Lutomirski
  1 sibling, 2 replies; 36+ messages in thread
From: Len Brown @ 2021-03-29 16:38 UTC (permalink / raw)
  To: Greg KH
  Cc: Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML, LKML,
	libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer,
	Linux API

> In particular, the library may use instructions that main() doesn't know exist.

And so I'll ask my question another way.

How is it okay to change the value of XCR0 during the run time of a program?

I submit that it is not, and that is a deal-killer for a request/release API.

eg.  main() doesn't know that the math library wants to use AMX,
and neither does the threading library.  So main() doesn't know to
call the API before either library is invoked.  The threading library starts up
and creates user-space threads based on the initial value from XCR0.
Then the math library calls the API, which adds bits to XCRO,
and then the user-space context switch in the threading library corrupts data
because the new XCR0 size doesn't match the initial size.

-Len

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-29 16:38         ` Len Brown
@ 2021-03-29 16:48           ` Florian Weimer
  2021-03-29 18:14           ` Andy Lutomirski
  1 sibling, 0 replies; 36+ messages in thread
From: Florian Weimer @ 2021-03-29 16:48 UTC (permalink / raw)
  To: Len Brown via Libc-alpha
  Cc: Greg KH, Len Brown, Rich Felker, Linux API, Bae, Chang Seok,
	X86 ML, LKML, Dave Hansen, Kyle Huey, Andy Lutomirski,
	Keno Fischer

* Len Brown via Libc-alpha:

>> In particular, the library may use instructions that main() doesn't know exist.
>
> And so I'll ask my question another way.
>
> How is it okay to change the value of XCR0 during the run time of a
> program?
>
> I submit that it is not, and that is a deal-killer for a
> request/release API.
>
> eg.  main() doesn't know that the math library wants to use AMX, and
> neither does the threading library.  So main() doesn't know to call
> the API before either library is invoked.  The threading library
> starts up and creates user-space threads based on the initial value
> from XCR0.  Then the math library calls the API, which adds bits to
> XCRO, and then the user-space context switch in the threading
> library corrupts data because the new XCR0 size doesn't match the
> initial size.

I agree that this doesn't quite work.  (Today, it's not the thread
library, but the glibc dynamic loader trampoline.)

I disagree that CPU feature enablement has been a failure.  I think we
are pretty good at enabling new CPU features on older operating
systems, not just bleeding edge mainline kernels.  Part of that is
that anything but the kernel stays out of the way, and most features
are available directly via inline assembly (you can even use .byte
hacks if you want).  There is no need to switch to new userspace
libraries, compile out-of-tree kernel drivers that have specific
firmware requirements, and so on.

If the operations that need a huge context can be made idempotent,
with periodic checkpoints, it might be possible to avoid saving the
context completely by some rseq-like construct.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-29 16:38         ` Len Brown
  2021-03-29 16:48           ` Florian Weimer
@ 2021-03-29 18:14           ` Andy Lutomirski
  1 sibling, 0 replies; 36+ messages in thread
From: Andy Lutomirski @ 2021-03-29 18:14 UTC (permalink / raw)
  To: Len Brown
  Cc: Greg KH, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML,
	LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey,
	Keno Fischer, Linux API



> On Mar 29, 2021, at 9:39 AM, Len Brown <lenb@kernel.org> wrote:
> 
> 
>> 
>> In particular, the library may use instructions that main() doesn't know exist.
> 
> And so I'll ask my question another way.
> 
> How is it okay to change the value of XCR0 during the run time of a program?
> 
> I submit that it is not, and that is a deal-killer for a request/release API.
> 
> eg.  main() doesn't know that the math library wants to use AMX,
> and neither does the threading library.  So main() doesn't know to
> call the API before either library is invoked.  The threading library starts up
> and creates user-space threads based on the initial value from XCR0.
> Then the math library calls the API, which adds bits to XCRO,
> and then the user-space context switch in the threading library corrupts data
> because the new XCR0 size doesn't match the initial size.
> 

In the most extreme case, userspace could require that every loaded DSO be tagged with a new ELF note indicating support for dynamic XCR0 before changing XCR0.

I would like to remind everyone that kernel enablement of AVX512 *already* broke old userspace. AMX will further break something. At least with dynamic XCR0 we can make the breakage opt-in.

The ISA could have helped here by allowing the non-compacted XSTATE format to be frozen even in the face of changing XCR0.  But it didn’t.  At the end of the day, we are faced with the fact that XSTATE is a poor design, and we have to make the best of it.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-29 15:47       ` Len Brown
  2021-03-29 16:38         ` Len Brown
@ 2021-03-29 18:16         ` Andy Lutomirski
  2021-03-29 22:38           ` Len Brown
  1 sibling, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-03-29 18:16 UTC (permalink / raw)
  To: Len Brown
  Cc: Greg KH, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML,
	LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey,
	Keno Fischer, Linux API


> On Mar 29, 2021, at 8:47 AM, Len Brown <lenb@kernel.org> wrote:
> 
> On Sat, Mar 27, 2021 at 5:58 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>>> On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote:
>>> Hi Andy,
>>> Say a mainline links with a math library that uses AMX without the
>>> knowledge of the mainline.
> 
> sorry for the confusion.
> 
> mainline = main().
> 
> ie. the part of the program written by you, and not the library you linked with.
> 
> In particular, the library may use instructions that main() doesn't know exist.

If we pretend for a bit that AMX were a separate device instead of a part of the CPU, this would be a no brainer: something would be responsible for opening a device node or otherwise requesting access to the device. 

Real AMX isn’t so different. Programs acquire access either by syscall or by a fault, they use it, and (hopefully) they release it again using TILERELEASE. The only thing special about it is that, supposedly, acquiring and releasing access (at least after the first time) is quite fast.  But holding access is *not* free — despite all your assertions to the contrary, the kernel *will* correctly context switch it to avoid blowing up power consumption, and this will have overhead.

We’ve seen the pattern of programs thinking that, just because something is a CPU insn, it’s free and no thought is needed before using it. This happened with AVX and AVX512, and it will happen again with AMX. We *still* have a significant performance regression in the kernel due to screwing up the AVX state machine, and the only way I know about any of the details is that I wrote silly test programs to try to reverse engineer the nonsensical behavior of the CPUs.

I might believe that Intel has figured out how to make a well behaved XSTATE feature after Intel demonstrates at least once that it’s possible.  That means full documentation of all the weird issues, no new special cases, and the feature actually making sense in the context of XSTATE.  This has not happened.  Let’s list all of them:

- SSE.  Look for all the MXCSR special cases in the pseudocode and tell me with a straight face that this one works sensibly.

- AVX.  Also has special cases in the pseudocode. And has transition issues that are still problems and still not fully documented. L

- AVX2.  Horrible undocumented performance issues.  Otherwise maybe okay?

- MPX: maybe the best example, but the compat mode part got flubbed and it’s MPX.

- PKRU: Should never have been in XSTATE. (Also, having WRPKRU in the ISA was a major mistake, now unfixable, that seriously limits the usefulness of the whole feature.  I suppose Intel could release PKRU2 with a better ISA and deprecate the original PKRU, but I’m not holding my breath.)

- AVX512: Yet more uarch-dependent horrible performance issues, and Intel has still not responded about documentation.  The web is full of people speculating differently about when, exactly, using AVX512 breaks performance. This is NAKked in kernel until docs arrive. Also, it broke old user programs.  If we had noticed a few years ago, AVX512 enablement would have been reverted.

- AMX: This mess.

The current system of automatic user enablement does not work. We need something better.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-29 18:16         ` Andy Lutomirski
@ 2021-03-29 22:38           ` Len Brown
  2021-03-30  5:08             ` Andy Lutomirski
  0 siblings, 1 reply; 36+ messages in thread
From: Len Brown @ 2021-03-29 22:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Greg KH, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML,
	LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey,
	Keno Fischer, Linux API

On Mon, Mar 29, 2021 at 2:16 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
> > On Mar 29, 2021, at 8:47 AM, Len Brown <lenb@kernel.org> wrote:
> >
> > On Sat, Mar 27, 2021 at 5:58 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> >>> On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote:
> >>> Hi Andy,
> >>> Say a mainline links with a math library that uses AMX without the
> >>> knowledge of the mainline.
> >
> > sorry for the confusion.
> >
> > mainline = main().
> >
> > ie. the part of the program written by you, and not the library you linked with.
> >
> > In particular, the library may use instructions that main() doesn't know exist.
>
> If we pretend for a bit that AMX were a separate device instead of a part of the CPU, this would be a no brainer: something would be responsible for opening a device node or otherwise requesting access to the device.
>
> Real AMX isn’t so different. Programs acquire access either by syscall or by a fault, they use it, and (hopefully) they release it again using TILERELEASE. The only thing special about it is that, supposedly, acquiring and releasing access (at least after the first time) is quite fast.  But holding access is *not* free — despite all your assertions to the contrary, the kernel *will* correctly context switch it to avoid blowing up power consumption, and this will have overhead.
>
> We’ve seen the pattern of programs thinking that, just because something is a CPU insn, it’s free and no thought is needed before using it. This happened with AVX and AVX512, and it will happen again with AMX. We *still* have a significant performance regression in the kernel due to screwing up the AVX state machine, and the only way I know about any of the details is that I wrote silly test programs to try to reverse engineer the nonsensical behavior of the CPUs.
>
> I might believe that Intel has figured out how to make a well behaved XSTATE feature after Intel demonstrates at least once that it’s possible.  That means full documentation of all the weird issues, no new special cases, and the feature actually making sense in the context of XSTATE.  This has not happened.  Let’s list all of them:
>
> - SSE.  Look for all the MXCSR special cases in the pseudocode and tell me with a straight face that this one works sensibly.
>
> - AVX.  Also has special cases in the pseudocode. And has transition issues that are still problems and still not fully documented. L
>
> - AVX2.  Horrible undocumented performance issues.  Otherwise maybe okay?
>
> - MPX: maybe the best example, but the compat mode part got flubbed and it’s MPX.
>
> - PKRU: Should never have been in XSTATE. (Also, having WRPKRU in the ISA was a major mistake, now unfixable, that seriously limits the usefulness of the whole feature.  I suppose Intel could release PKRU2 with a better ISA and deprecate the original PKRU, but I’m not holding my breath.)
>
> - AVX512: Yet more uarch-dependent horrible performance issues, and Intel has still not responded about documentation.  The web is full of people speculating differently about when, exactly, using AVX512 breaks performance. This is NAKked in kernel until docs arrive. Also, it broke old user programs.  If we had noticed a few years ago, AVX512 enablement would have been reverted.
>
> - AMX: This mess.
>
> The current system of automatic user enablement does not work. We need something better.

Hi Andy,

Can you provide a concise definition of the exact problemI(s) this thread
is attempting to address?

Thank ahead-of-time for excluding "blow up power consumption",
since that paranoia is not grounded in fact.

thanks,
-Len

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-29 22:38           ` Len Brown
@ 2021-03-30  5:08             ` Andy Lutomirski
  2021-03-30  5:50               ` Noah Goldstein
  2021-03-30 17:01               ` Len Brown
  0 siblings, 2 replies; 36+ messages in thread
From: Andy Lutomirski @ 2021-03-30  5:08 UTC (permalink / raw)
  To: Len Brown
  Cc: Greg KH, Andy Lutomirski, Bae, Chang Seok, Dave Hansen, X86 ML,
	LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey,
	Keno Fischer, Linux API

On Mon, Mar 29, 2021 at 3:38 PM Len Brown <lenb@kernel.org> wrote:
>
> On Mon, Mar 29, 2021 at 2:16 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >

> Hi Andy,
>
> Can you provide a concise definition of the exact problemI(s) this thread
> is attempting to address?

The AVX-512 state, all by itself, is more than 2048 bytes.  Quoting
the POSIX sigaltstack page (man 3p sigaltstack):

       The  value  SIGSTKSZ is a system default specifying the number of bytes
       that would be used to cover the usual case when manually allocating  an
       alternate  stack area. The value MINSIGSTKSZ is defined to be the mini‐
       mum stack size for a signal handler. In computing  an  alternate  stack
       size, a program should add that amount to its stack requirements to al‐
       low for the system implementation overhead. The  constants  SS_ONSTACK,
       SS_DISABLE, SIGSTKSZ, and MINSIGSTKSZ are defined in <signal.h>.

arch/x86/include/uapi/asm/signal.h:#define MINSIGSTKSZ    2048
arch/x86/include/uapi/asm/signal.h:#define SIGSTKSZ    8192

Regrettably, the Linux signal frame format is the uncompacted format
and, also regrettably, the uncompacted format has the nasty property
that its format depends on XCR0 but not on the set of registers that
are actually used or wanted, so, with the current ABI, the signal
frame is stuck being quite large for all programs on a machine that
supports avx512 and has it enabled by the kernel.  And it's even
larger for AMX and violates SIGSTKSZ as well as MINSTKSZ.

There are apparently real programs that break as a result.  We need to
find a way to handle new, large extended states without breaking user
ABI.  We should also find a way to handle them without consuming silly
amounts of stack space for programs that don't use them.

Sadly, if the solution we settle on involves context switching XCR0,
performance on first-generation hardware will suffer because VMX does
not have any way to allow guests to write XCR0 without exiting.  I
don't consider this to be a showstopper -- if we end up having this
problem, fixing it in subsequent CPUs is straightforward.

>
> Thank ahead-of-time for excluding "blow up power consumption",
> since that paranoia is not grounded in fact.
>

I will gladly exclude power consumption from this discussion, since
that's a separate issue that has nothing to do with the user<->kernel
ABI.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-30  5:08             ` Andy Lutomirski
@ 2021-03-30  5:50               ` Noah Goldstein
  2021-03-30 17:01               ` Len Brown
  1 sibling, 0 replies; 36+ messages in thread
From: Noah Goldstein @ 2021-03-30  5:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Len Brown, Florian Weimer, Rich Felker, libc-alpha, Greg KH, Bae,
	Chang Seok, X86 ML, LKML, Dave Hansen, Kyle Huey, Linux API,
	Keno Fischer

Forgive if this is silly but would it be possible to do something
simliar to rseq where the user can register a set of features for a
program counter region and then on interrupt check that to determine
what needs to be saved?

For example if a user doesn't use any AMX but loads a library that
does, for all ip in the users code AMX state won't be saved but an
interrupt in ip range of the library will save AMX state.

One advantage of this is it would be pretty easy silently do this
right with compiler support and to preserve old code the "ip not found
in table" case could default to the worst case the CPU supports.

On Tue, Mar 30, 2021 at 1:09 AM Andy Lutomirski via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Mon, Mar 29, 2021 at 3:38 PM Len Brown <lenb@kernel.org> wrote:
> >
> > On Mon, Mar 29, 2021 at 2:16 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > >
>
> > Hi Andy,
> >
> > Can you provide a concise definition of the exact problemI(s) this thread
> > is attempting to address?
>
> The AVX-512 state, all by itself, is more than 2048 bytes.  Quoting
> the POSIX sigaltstack page (man 3p sigaltstack):
>
>        The  value  SIGSTKSZ is a system default specifying the number of bytes
>        that would be used to cover the usual case when manually allocating  an
>        alternate  stack area. The value MINSIGSTKSZ is defined to be the mini‐
>        mum stack size for a signal handler. In computing  an  alternate  stack
>        size, a program should add that amount to its stack requirements to al‐
>        low for the system implementation overhead. The  constants  SS_ONSTACK,
>        SS_DISABLE, SIGSTKSZ, and MINSIGSTKSZ are defined in <signal.h>.
>
> arch/x86/include/uapi/asm/signal.h:#define MINSIGSTKSZ    2048
> arch/x86/include/uapi/asm/signal.h:#define SIGSTKSZ    8192
>
> Regrettably, the Linux signal frame format is the uncompacted format
> and, also regrettably, the uncompacted format has the nasty property
> that its format depends on XCR0 but not on the set of registers that
> are actually used or wanted, so, with the current ABI, the signal
> frame is stuck being quite large for all programs on a machine that
> supports avx512 and has it enabled by the kernel.  And it's even
> larger for AMX and violates SIGSTKSZ as well as MINSTKSZ.
>
> There are apparently real programs that break as a result.  We need to
> find a way to handle new, large extended states without breaking user
> ABI.  We should also find a way to handle them without consuming silly
> amounts of stack space for programs that don't use them.
>
> Sadly, if the solution we settle on involves context switching XCR0,
> performance on first-generation hardware will suffer because VMX does
> not have any way to allow guests to write XCR0 without exiting.  I
> don't consider this to be a showstopper -- if we end up having this
> problem, fixing it in subsequent CPUs is straightforward.
>
> >
> > Thank ahead-of-time for excluding "blow up power consumption",
> > since that paranoia is not grounded in fact.
> >
>
> I will gladly exclude power consumption from this discussion, since
> that's a separate issue that has nothing to do with the user<->kernel
> ABI.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-30  5:08             ` Andy Lutomirski
  2021-03-30  5:50               ` Noah Goldstein
@ 2021-03-30 17:01               ` Len Brown
  2021-03-30 17:05                 ` Andy Lutomirski
  1 sibling, 1 reply; 36+ messages in thread
From: Len Brown @ 2021-03-30 17:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Greg KH, Bae, Chang Seok, Dave Hansen, X86 ML, LKML, libc-alpha,
	Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API

Andy,

I agree, completely, with your description of the challenge,
thank you for focusing the discussion on that problem statement.

Question:

Is it required (by the "ABI") that a user program has everything
on the stack for user-space XSAVE/XRESTOR to get back
to the state of the program just before receiving the signal?

My understanding is that there are programs that do this.
However, if it is not guaranteed to work, that could greatly simplify
what we are required to put on the signal stack.

thanks,
-Len

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-30 17:01               ` Len Brown
@ 2021-03-30 17:05                 ` Andy Lutomirski
  2021-03-30 17:56                   ` Len Brown
  0 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-03-30 17:05 UTC (permalink / raw)
  To: Len Brown
  Cc: Andy Lutomirski, Greg KH, Bae, Chang Seok, Dave Hansen, X86 ML,
	LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey,
	Keno Fischer, Linux API



> On Mar 30, 2021, at 10:01 AM, Len Brown <lenb@kernel.org> wrote:
> 
> Andy,
> 
> I agree, completely, with your description of the challenge,
> thank you for focusing the discussion on that problem statement.
> 
> Question:
> 
> Is it required (by the "ABI") that a user program has everything
> on the stack for user-space XSAVE/XRESTOR to get back
> to the state of the program just before receiving the signal?

The current Linux signal frame format has XSTATE in uncompacted format, so everything has to be there. Maybe we could have an opt in new signal frame format, but the details would need to be worked out.

It is certainly the case that a signal should be able to be delivered, run “async-signal-safe” code, and return, without corrupting register contents.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-30 17:05                 ` Andy Lutomirski
@ 2021-03-30 17:56                   ` Len Brown
  2021-03-30 19:12                     ` Dave Hansen
  0 siblings, 1 reply; 36+ messages in thread
From: Len Brown @ 2021-03-30 17:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Greg KH, Bae, Chang Seok, Dave Hansen, X86 ML,
	LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey,
	Keno Fischer, Linux API

On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski <luto@amacapital.net> wrote:

> > On Mar 30, 2021, at 10:01 AM, Len Brown <lenb@kernel.org> wrote:

> > Is it required (by the "ABI") that a user program has everything
> > on the stack for user-space XSAVE/XRESTOR to get back
> > to the state of the program just before receiving the signal?
>
> The current Linux signal frame format has XSTATE in uncompacted format,
> so everything has to be there.
> Maybe we could have an opt in new signal frame format, but the details would need to be worked out.
>
> It is certainly the case that a signal should be able to be delivered, run “async-signal-safe” code,
> and return, without corrupting register contents.

And so an an acknowledgement:

We can't change the legacy signal stack format without breaking
existing programs.  The legacy is uncompressed XSTATE.  It is a
complete set of architectural state -- everything necessary to
XRESTOR.  Further, the sigreturn flow allows the signal handler to
*change* any of that state, so that it becomes active upon return from
signal.

And a proposal:

Future programs, which know that they don't need the full-blown legacy
signal stack format, can opt-in to a new format.  That new format, can
be minimal (fast) by default.  Perhaps, as Noah suggests, it could
have some sort of mechanism where the program can explicitly select
which state components they would want included on their signal stack,
and restored by sigreturn.

If the new fast-signal format is successful, in a number of years, it
will have spread to have taken over the world.

thoughts?

Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-30 17:56                   ` Len Brown
@ 2021-03-30 19:12                     ` Dave Hansen
  2021-03-30 20:20                       ` Andy Lutomirski
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Hansen @ 2021-03-30 19:12 UTC (permalink / raw)
  To: Len Brown, Andy Lutomirski
  Cc: Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML, LKML,
	libc-alpha, Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer,
	Linux API

On 3/30/21 10:56 AM, Len Brown wrote:
> On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Mar 30, 2021, at 10:01 AM, Len Brown <lenb@kernel.org> wrote:
>>> Is it required (by the "ABI") that a user program has everything
>>> on the stack for user-space XSAVE/XRESTOR to get back
>>> to the state of the program just before receiving the signal?
>> The current Linux signal frame format has XSTATE in uncompacted format,
>> so everything has to be there.
>> Maybe we could have an opt in new signal frame format, but the details would need to be worked out.
>>
>> It is certainly the case that a signal should be able to be delivered, run “async-signal-safe” code,
>> and return, without corrupting register contents.
> And so an an acknowledgement:
> 
> We can't change the legacy signal stack format without breaking
> existing programs.  The legacy is uncompressed XSTATE.  It is a
> complete set of architectural state -- everything necessary to
> XRESTOR.  Further, the sigreturn flow allows the signal handler to
> *change* any of that state, so that it becomes active upon return from
> signal.

One nit with this: XRSTOR itself can work with the compacted format or
uncompacted format.  Unlike the XSAVE/XSAVEC side where compaction is
explicit from the instruction itself, XRSTOR changes its behavior by
reading XCOMP_BV.  There's no XRSTORC.

The issue with using the compacted format is when legacy software in the
signal handler needs to go access the state.  *That* is what can't
handle a change in the XSAVE buffer format (either optimized/XSAVEOPT,
or compacted/XSAVEC).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-30 19:12                     ` Dave Hansen
@ 2021-03-30 20:20                       ` Andy Lutomirski
  2021-03-30 20:42                         ` Len Brown
  0 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-03-30 20:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Len Brown, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML,
	LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey,
	Keno Fischer, Linux API


> On Mar 30, 2021, at 12:12 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 3/30/21 10:56 AM, Len Brown wrote:
>> On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Mar 30, 2021, at 10:01 AM, Len Brown <lenb@kernel.org> wrote:
>>>> Is it required (by the "ABI") that a user program has everything
>>>> on the stack for user-space XSAVE/XRESTOR to get back
>>>> to the state of the program just before receiving the signal?
>>> The current Linux signal frame format has XSTATE in uncompacted format,
>>> so everything has to be there.
>>> Maybe we could have an opt in new signal frame format, but the details would need to be worked out.
>>> 
>>> It is certainly the case that a signal should be able to be delivered, run “async-signal-safe” code,
>>> and return, without corrupting register contents.
>> And so an an acknowledgement:
>> 
>> We can't change the legacy signal stack format without breaking
>> existing programs.  The legacy is uncompressed XSTATE.  It is a
>> complete set of architectural state -- everything necessary to
>> XRESTOR.  Further, the sigreturn flow allows the signal handler to
>> *change* any of that state, so that it becomes active upon return from
>> signal.
> 
> One nit with this: XRSTOR itself can work with the compacted format or
> uncompacted format.  Unlike the XSAVE/XSAVEC side where compaction is
> explicit from the instruction itself, XRSTOR changes its behavior by
> reading XCOMP_BV.  There's no XRSTORC.
> 
> The issue with using the compacted format is when legacy software in the
> signal handler needs to go access the state.  *That* is what can't
> handle a change in the XSAVE buffer format (either optimized/XSAVEOPT,
> or compacted/XSAVEC).

The compacted format isn’t compact enough anyway. If we want to keep AMX and AVX512 enabled in XCR0 then we need to further muck with the format to omit the not-in-use features. I *think* we can pull this off in a way that still does the right thing wrt XRSTOR.

If we go this route, I think we want a way for sigreturn to understand a pointer to the state instead of inline state to allow programs to change the state.  Or maybe just to have a way to ask sigreturn to skip the restore entirely.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-30 20:20                       ` Andy Lutomirski
@ 2021-03-30 20:42                         ` Len Brown
  2021-03-30 22:01                           ` David Laight
  0 siblings, 1 reply; 36+ messages in thread
From: Len Brown @ 2021-03-30 20:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML,
	LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey,
	Keno Fischer, Linux API

On Tue, Mar 30, 2021 at 4:20 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
>
> > On Mar 30, 2021, at 12:12 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 3/30/21 10:56 AM, Len Brown wrote:
> >> On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >>>> On Mar 30, 2021, at 10:01 AM, Len Brown <lenb@kernel.org> wrote:
> >>>> Is it required (by the "ABI") that a user program has everything
> >>>> on the stack for user-space XSAVE/XRESTOR to get back
> >>>> to the state of the program just before receiving the signal?
> >>> The current Linux signal frame format has XSTATE in uncompacted format,
> >>> so everything has to be there.
> >>> Maybe we could have an opt in new signal frame format, but the details would need to be worked out.
> >>>
> >>> It is certainly the case that a signal should be able to be delivered, run “async-signal-safe” code,
> >>> and return, without corrupting register contents.
> >> And so an an acknowledgement:
> >>
> >> We can't change the legacy signal stack format without breaking
> >> existing programs.  The legacy is uncompressed XSTATE.  It is a
> >> complete set of architectural state -- everything necessary to
> >> XRESTOR.  Further, the sigreturn flow allows the signal handler to
> >> *change* any of that state, so that it becomes active upon return from
> >> signal.
> >
> > One nit with this: XRSTOR itself can work with the compacted format or
> > uncompacted format.  Unlike the XSAVE/XSAVEC side where compaction is
> > explicit from the instruction itself, XRSTOR changes its behavior by
> > reading XCOMP_BV.  There's no XRSTORC.
> >
> > The issue with using the compacted format is when legacy software in the
> > signal handler needs to go access the state.  *That* is what can't
> > handle a change in the XSAVE buffer format (either optimized/XSAVEOPT,
> > or compacted/XSAVEC).
>
> The compacted format isn’t compact enough anyway. If we want to keep AMX and AVX512 enabled in XCR0 then we need to further muck with the format to omit the not-in-use features. I *think* we can pull this off in a way that still does the right thing wrt XRSTOR.

Agreed.  Compacted format doesn't save any space when INIT=0, so it is
only a half-step forward.

> If we go this route, I think we want a way for sigreturn to understand a pointer to the state instead of inline state to allow programs to change the state.  Or maybe just to have a way to ask sigreturn to skip the restore entirely.

The legacy approach puts all architectural state on the signal stack
in XSTATE format.

If we make the signal stack smaller with a new fast-signal scheme, we
need to find another place for that state to live.

It can't live in the task context switch buffer.  If we put it there
and then take an interrupt while running the signal handler, then we'd
overwrite the signaled thread's state with the signal handler's state.

Can we leave it in live registers?  That would be the speed-of-light
signal handler approach.  But we'd need to teach the signal handler to
not clobber it.  Perhaps that could be part of the contract that a
fast signal handler signs?  INIT=0 AMX state could simply sit
patiently in the AMX registers for the duration of the signal handler.
You can't get any faster than doing nothing :-)

Of course part of the contract for the fast signal handler is that it
knows that it can't possibly use XRESTOR of the stuff on the stack to
necessarily get back to the state of the signaled thread (assuming we
even used XSTATE format on the fast signal handler stack, it would
forget the contents of the AMX registers, in this example)

Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-30 20:42                         ` Len Brown
@ 2021-03-30 22:01                           ` David Laight
  2021-03-31 16:31                             ` Len Brown
  0 siblings, 1 reply; 36+ messages in thread
From: David Laight @ 2021-03-30 22:01 UTC (permalink / raw)
  To: 'Len Brown', Andy Lutomirski
  Cc: Dave Hansen, Andy Lutomirski, Greg KH, Bae, Chang Seok, X86 ML,
	LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey,
	Keno Fischer, Linux API

From: Len Brown
> Sent: 30 March 2021 21:42
> 
> On Tue, Mar 30, 2021 at 4:20 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >
> > > On Mar 30, 2021, at 12:12 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > On 3/30/21 10:56 AM, Len Brown wrote:
> > >> On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > >>>> On Mar 30, 2021, at 10:01 AM, Len Brown <lenb@kernel.org> wrote:
> > >>>> Is it required (by the "ABI") that a user program has everything
> > >>>> on the stack for user-space XSAVE/XRESTOR to get back
> > >>>> to the state of the program just before receiving the signal?
> > >>> The current Linux signal frame format has XSTATE in uncompacted format,
> > >>> so everything has to be there.
> > >>> Maybe we could have an opt in new signal frame format, but the details would need to be worked
> out.
> > >>>
> > >>> It is certainly the case that a signal should be able to be delivered, run “async-signal-safe”
> code,
> > >>> and return, without corrupting register contents.
> > >> And so an an acknowledgement:
> > >>
> > >> We can't change the legacy signal stack format without breaking
> > >> existing programs.  The legacy is uncompressed XSTATE.  It is a
> > >> complete set of architectural state -- everything necessary to
> > >> XRESTOR.  Further, the sigreturn flow allows the signal handler to
> > >> *change* any of that state, so that it becomes active upon return from
> > >> signal.
> > >
> > > One nit with this: XRSTOR itself can work with the compacted format or
> > > uncompacted format.  Unlike the XSAVE/XSAVEC side where compaction is
> > > explicit from the instruction itself, XRSTOR changes its behavior by
> > > reading XCOMP_BV.  There's no XRSTORC.
> > >
> > > The issue with using the compacted format is when legacy software in the
> > > signal handler needs to go access the state.  *That* is what can't
> > > handle a change in the XSAVE buffer format (either optimized/XSAVEOPT,
> > > or compacted/XSAVEC).
> >
> > The compacted format isn’t compact enough anyway. If we want to keep AMX and AVX512 enabled in XCR0
> then we need to further muck with the format to omit the not-in-use features. I *think* we can pull
> this off in a way that still does the right thing wrt XRSTOR.
> 
> Agreed.  Compacted format doesn't save any space when INIT=0, so it is
> only a half-step forward.
> 
> > If we go this route, I think we want a way for sigreturn to understand a pointer to the state
> instead of inline state to allow programs to change the state.  Or maybe just to have a way to ask
> sigreturn to skip the restore entirely.
> 
> The legacy approach puts all architectural state on the signal stack
> in XSTATE format.
> 
> If we make the signal stack smaller with a new fast-signal scheme, we
> need to find another place for that state to live.
> 
> It can't live in the task context switch buffer.  If we put it there
> and then take an interrupt while running the signal handler, then we'd
> overwrite the signaled thread's state with the signal handler's state.
> 
> Can we leave it in live registers?  That would be the speed-of-light
> signal handler approach.  But we'd need to teach the signal handler to
> not clobber it.  Perhaps that could be part of the contract that a
> fast signal handler signs?  INIT=0 AMX state could simply sit
> patiently in the AMX registers for the duration of the signal handler.
> You can't get any faster than doing nothing :-)
> 
> Of course part of the contract for the fast signal handler is that it
> knows that it can't possibly use XRESTOR of the stuff on the stack to
> necessarily get back to the state of the signaled thread (assuming we
> even used XSTATE format on the fast signal handler stack, it would
> forget the contents of the AMX registers, in this example)

gcc will just use the AVX registers for 'normal' code within
the signal handler.
So it has to have its own copy of all the registers.
(Well, maybe you could make the TMX instructions fault,
but that would need a nested signal delivered.)

There is also the register save buffer that you need in order
to long-jump out of a signal handler.
Unfortunately that is required to work.
I'm pretty sure the original setjmp/longjmp just saved the stack
pointer - but that really doesn't work any more.

OTOH most signal handlers don't care - but there isn't a flag
to sigset() (etc) so ask for a specific register layout.

I did have 'fun' changing the x86 segment registers so that
the 'return to user' faulted in kernel during the last bit
of the 'return to user' path - and then fixing the fallout.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-30 22:01                           ` David Laight
@ 2021-03-31 16:31                             ` Len Brown
  2021-03-31 16:53                               ` Andy Lutomirski
  0 siblings, 1 reply; 36+ messages in thread
From: Len Brown @ 2021-03-31 16:31 UTC (permalink / raw)
  To: David Laight
  Cc: Andy Lutomirski, Dave Hansen, Andy Lutomirski, Greg KH, Bae,
	Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer,
	Rich Felker, Kyle Huey, Keno Fischer, Linux API

On Tue, Mar 30, 2021 at 6:01 PM David Laight <David.Laight@aculab.com> wrote:

> > Can we leave it in live registers?  That would be the speed-of-light
> > signal handler approach.  But we'd need to teach the signal handler to
> > not clobber it.  Perhaps that could be part of the contract that a
> > fast signal handler signs?  INIT=0 AMX state could simply sit
> > patiently in the AMX registers for the duration of the signal handler.
> > You can't get any faster than doing nothing :-)
> >
> > Of course part of the contract for the fast signal handler is that it
> > knows that it can't possibly use XRESTOR of the stuff on the stack to
> > necessarily get back to the state of the signaled thread (assuming we
> > even used XSTATE format on the fast signal handler stack, it would
> > forget the contents of the AMX registers, in this example)
>
> gcc will just use the AVX registers for 'normal' code within
> the signal handler.
> So it has to have its own copy of all the registers.
> (Well, maybe you could make the TMX instructions fault,
> but that would need a nested signal delivered.)

This is true, by default, but it doesn't have to be true.

Today, gcc has an annotation for user-level interrupts
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes

An analogous annotation could be created for fast signals.
gcc can be told exactly what registers and instructions it can use for
that routine.

Of course, this begs the question about what routines that handler calls,
and that would need to be constrained too.

Today signal-safety(7) advises programmers to limit what legacy signal handlers
can call.  There is no reason that a fast-signal-safety(7) could not be created
for the fast path.

> There is also the register save buffer that you need in order
> to long-jump out of a signal handler.
> Unfortunately that is required to work.
> I'm pretty sure the original setjmp/longjmp just saved the stack
> pointer - but that really doesn't work any more.
>
> OTOH most signal handlers don't care - but there isn't a flag
> to sigset() (etc) so ask for a specific register layout.

Right, the idea is to optimize for *most* signal handlers,
since making any changes to *all* signal handlers is intractable.

So the idea is that opting-in to a fast signal handler would opt-out
of some legacy signal capibilities.  Complete state is one of them,
and thus long-jump is not supported, because the complete state
may not automatically be available.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-31 16:31                             ` Len Brown
@ 2021-03-31 16:53                               ` Andy Lutomirski
  2021-03-31 21:42                                 ` Robert O'Callahan
  2021-03-31 22:28                                 ` Len Brown
  0 siblings, 2 replies; 36+ messages in thread
From: Andy Lutomirski @ 2021-03-31 16:53 UTC (permalink / raw)
  To: Len Brown
  Cc: David Laight, Dave Hansen, Andy Lutomirski, Greg KH, Bae,
	Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer,
	Rich Felker, Kyle Huey, Keno Fischer, Linux API


> On Mar 31, 2021, at 9:31 AM, Len Brown <lenb@kernel.org> wrote:
> 
> On Tue, Mar 30, 2021 at 6:01 PM David Laight <David.Laight@aculab.com> wrote:
> 
>>> Can we leave it in live registers?  That would be the speed-of-light
>>> signal handler approach.  But we'd need to teach the signal handler to
>>> not clobber it.  Perhaps that could be part of the contract that a
>>> fast signal handler signs?  INIT=0 AMX state could simply sit
>>> patiently in the AMX registers for the duration of the signal handler.
>>> You can't get any faster than doing nothing :-)
>>> 
>>> Of course part of the contract for the fast signal handler is that it
>>> knows that it can't possibly use XRESTOR of the stuff on the stack to
>>> necessarily get back to the state of the signaled thread (assuming we
>>> even used XSTATE format on the fast signal handler stack, it would
>>> forget the contents of the AMX registers, in this example)
>> 
>> gcc will just use the AVX registers for 'normal' code within
>> the signal handler.
>> So it has to have its own copy of all the registers.
>> (Well, maybe you could make the TMX instructions fault,
>> but that would need a nested signal delivered.)
> 
> This is true, by default, but it doesn't have to be true.
> 
> Today, gcc has an annotation for user-level interrupts
> https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes
> 
> An analogous annotation could be created for fast signals.
> gcc can be told exactly what registers and instructions it can use for
> that routine.
> 
> Of course, this begs the question about what routines that handler calls,
> and that would need to be constrained too.
> 
> Today signal-safety(7) advises programmers to limit what legacy signal handlers
> can call.  There is no reason that a fast-signal-safety(7) could not be created
> for the fast path.
> 
>> There is also the register save buffer that you need in order
>> to long-jump out of a signal handler.
>> Unfortunately that is required to work.
>> I'm pretty sure the original setjmp/longjmp just saved the stack
>> pointer - but that really doesn't work any more.
>> 
>> OTOH most signal handlers don't care - but there isn't a flag
>> to sigset() (etc) so ask for a specific register layout.
> 
> Right, the idea is to optimize for *most* signal handlers,
> since making any changes to *all* signal handlers is intractable.
> 
> So the idea is that opting-in to a fast signal handler would opt-out
> of some legacy signal capibilities.  Complete state is one of them,
> and thus long-jump is not supported, because the complete state
> may not automatically be available.

Long jump is probably the easiest problem of all: sigsetjmp() is a *function*, following ABI, so sigsetjmp() is expected to clobber most or all of the extended state.

But this whole annotation thing will require serious compiler support. We already have problems with compilers inlining functions and getting confused about attributes.

An API like:

if (get_amx()) {
 use AMX;
} else {
 don’t;
}

Avoids this problem. And making XCR0 dynamic, for all its faults, at least helps force a degree of discipline on user code.


> 
> thanks,
> Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-31 16:53                               ` Andy Lutomirski
@ 2021-03-31 21:42                                 ` Robert O'Callahan
  2021-03-31 22:11                                   ` Len Brown
  2021-03-31 22:28                                 ` Len Brown
  1 sibling, 1 reply; 36+ messages in thread
From: Robert O'Callahan @ 2021-03-31 21:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Len Brown, David Laight, Dave Hansen, Andy Lutomirski, Greg KH,
	Bae, Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer,
	Rich Felker, Kyle Huey, Keno Fischer, Linux API

For the record, the benefits of dynamic XCR0 for rr recording
portability still apply. I guess it'd be useful for CRIU too. We would
also benefit from anything that incentivizes increased support for
CPUID faulting.

Rob
--
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-31 21:42                                 ` Robert O'Callahan
@ 2021-03-31 22:11                                   ` Len Brown
  0 siblings, 0 replies; 36+ messages in thread
From: Len Brown @ 2021-03-31 22:11 UTC (permalink / raw)
  To: robert
  Cc: Andy Lutomirski, David Laight, Dave Hansen, Andy Lutomirski,
	Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha,
	Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API

On Wed, Mar 31, 2021 at 5:42 PM Robert O'Callahan <robert@ocallahan.org> wrote:
>
> For the record, the benefits of dynamic XCR0 for rr recording
> portability still apply. I guess it'd be useful for CRIU too. We would
> also benefit from anything that incentivizes increased support for
> CPUID faulting.

As previously mentioned, today we don't have an architectural way to
trap a user into the kernel on CPUID,
even though we can do this for a VMM.

But spoofing CPUID isn't a solution to all problems.
The feature really needs to be OFF to prevent users from using it,
even if the supported mechanisms of discovering that feature say "NOT PRESENT".

Today there are plenty of users who will opportunistically try everything
in the cloud and choose the machine that allows them to do something
that other machines will not -- even if it is not officially supported.

If something is not enumerated, it really needs to also be turned off.

cheers,
--Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-31 16:53                               ` Andy Lutomirski
  2021-03-31 21:42                                 ` Robert O'Callahan
@ 2021-03-31 22:28                                 ` Len Brown
  2021-03-31 22:45                                   ` Andy Lutomirski
  2021-03-31 22:52                                   ` Borislav Petkov
  1 sibling, 2 replies; 36+ messages in thread
From: Len Brown @ 2021-03-31 22:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Laight, Dave Hansen, Andy Lutomirski, Greg KH, Bae,
	Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer,
	Rich Felker, Kyle Huey, Keno Fischer, Linux API

On Wed, Mar 31, 2021 at 12:53 PM Andy Lutomirski <luto@amacapital.net> wrote:

> But this whole annotation thing will require serious compiler support.
> We already have problems with compilers inlining functions and getting confused about attributes.

We added compiler annotation for user-level interrupt handlers.
I'm not aware of it failing, or otherwise being confused.

Why would compiler support for fast-signals be any more "serious"?

> An API like:
>
> if (get_amx()) {
>  use AMX;
> } else {
>  don’t;
> }
>
> Avoids this problem. And making XCR0 dynamic, for all its faults, at least helps force a degree of discipline on user code.

dynamic XCR0 breaks the installed base, I thought we had established that.

We've also established that when running in a VMM, every update to
XCR0 causes a VMEXIT.

I thought the goal was to allow new programs to have fast signal handlers.
By default, those fast signal handlers would have a stable state
image, and would
not inherit large architectural state on their stacks, and could thus
have minimal overhead on all hardware.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-31 22:28                                 ` Len Brown
@ 2021-03-31 22:45                                   ` Andy Lutomirski
  2021-04-09 20:52                                     ` Len Brown
  2021-03-31 22:52                                   ` Borislav Petkov
  1 sibling, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-03-31 22:45 UTC (permalink / raw)
  To: Len Brown
  Cc: David Laight, Dave Hansen, Andy Lutomirski, Greg KH, Bae,
	Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer,
	Rich Felker, Kyle Huey, Keno Fischer, Linux API

On Wed, Mar 31, 2021 at 3:28 PM Len Brown <lenb@kernel.org> wrote:
>
> On Wed, Mar 31, 2021 at 12:53 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> > But this whole annotation thing will require serious compiler support.
> > We already have problems with compilers inlining functions and getting confused about attributes.
>
> We added compiler annotation for user-level interrupt handlers.
> I'm not aware of it failing, or otherwise being confused.

I followed your link and found nothing. Can you elaborate?  In the
kernel, we have noinstr, and gcc gives approximately no help toward
catching problems.

>
> Why would compiler support for fast-signals be any more "serious"?
>
> > An API like:
> >
> > if (get_amx()) {
> >  use AMX;
> > } else {
> >  don’t;
> > }
> >
> > Avoids this problem. And making XCR0 dynamic, for all its faults, at least helps force a degree of discipline on user code.
>
> dynamic XCR0 breaks the installed base, I thought we had established that.

I don't think this is at all established.  If some code thinks it
knows the uncompacted XSTATE size and XCR0 changes, it crashes.  This
is not necessarily a showstopper.

>
> We've also established that when running in a VMM, every update to
> XCR0 causes a VMEXIT.

This is true, it sucks, and Intel could fix it going forward.

>
> I thought the goal was to allow new programs to have fast signal handlers.
> By default, those fast signal handlers would have a stable state
> image, and would
> not inherit large architectural state on their stacks, and could thus
> have minimal overhead on all hardware.

That is *a* goal, but not necessarily the only goal.

--Andy

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-31 22:28                                 ` Len Brown
  2021-03-31 22:45                                   ` Andy Lutomirski
@ 2021-03-31 22:52                                   ` Borislav Petkov
  2021-04-09 20:55                                     ` Len Brown
  1 sibling, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2021-03-31 22:52 UTC (permalink / raw)
  To: Len Brown
  Cc: Andy Lutomirski, David Laight, Dave Hansen, Andy Lutomirski,
	Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha,
	Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API

On Wed, Mar 31, 2021 at 06:28:27PM -0400, Len Brown wrote:
> dynamic XCR0 breaks the installed base, I thought we had established
> that.

We should do a clear cut and have legacy stuff which has its legacy
expectations on the XSTATE layout and not touch those at all.

And then all new apps which will use these new APIs can go and request
whatever fancy new state constellations we support. Including how they
want their signals handled, etc.

Fat states like avx512, amx etc will be off by default and apps
explicitly requesting those, can get them.

That's it.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-31 22:45                                   ` Andy Lutomirski
@ 2021-04-09 20:52                                     ` Len Brown
  2021-04-09 21:44                                       ` Andy Lutomirski
  0 siblings, 1 reply; 36+ messages in thread
From: Len Brown @ 2021-04-09 20:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Laight, Dave Hansen, Greg KH, Bae, Chang Seok, X86 ML,
	LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey,
	Keno Fischer, Linux API

On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Mar 31, 2021 at 3:28 PM Len Brown <lenb@kernel.org> wrote:
>
> > We added compiler annotation for user-level interrupt handlers.
> > I'm not aware of it failing, or otherwise being confused.
>
> I followed your link and found nothing. Can you elaborate?  In the
> kernel, we have noinstr, and gcc gives approximately no help toward
> catching problems.

A search for the word "interrupt" on this page
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes
comes to the description of this attribute:

__attribute__ ((interrupt))

> > dynamic XCR0 breaks the installed base, I thought we had established that.
>
> I don't think this is at all established.  If some code thinks it
> knows the uncompacted XSTATE size and XCR0 changes, it crashes.  This
> is not necessarily a showstopper.

My working assumption is that crashing applications actually *is* a showstopper.
Please clarify.

> > We've also established that when running in a VMM, every update to
> > XCR0 causes a VMEXIT.
>
> This is true, it sucks, and Intel could fix it going forward.

What hardware fix do you suggest?
If a guest is permitted to set XCR0 bits without notifying the VMM,
what happens when it sets bits that the VMM doesn't know about?

> > I thought the goal was to allow new programs to have fast signal handlers.
> > By default, those fast signal handlers would have a stable state
> > image, and would
> > not inherit large architectural state on their stacks, and could thus
> > have minimal overhead on all hardware.
>
> That is *a* goal, but not necessarily the only goal.

I fully support coming up with a scheme for fast future-proof signal handlers,
and I'm willing to back that up by putting work into it.

I don't see any other goals articulated in this thread.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-03-31 22:52                                   ` Borislav Petkov
@ 2021-04-09 20:55                                     ` Len Brown
  0 siblings, 0 replies; 36+ messages in thread
From: Len Brown @ 2021-04-09 20:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, David Laight, Dave Hansen, Andy Lutomirski,
	Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha,
	Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API

On Wed, Mar 31, 2021 at 6:54 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Wed, Mar 31, 2021 at 06:28:27PM -0400, Len Brown wrote:
> > dynamic XCR0 breaks the installed base, I thought we had established
> > that.
>
> We should do a clear cut and have legacy stuff which has its legacy
> expectations on the XSTATE layout and not touch those at all.
>
> And then all new apps which will use these new APIs can go and request
> whatever fancy new state constellations we support. Including how they
> want their signals handled, etc.
>
> Fat states like avx512, amx etc will be off by default and apps
> explicitly requesting those, can get them.
>
> That's it.

100% agreement from me!  (does anybody disagree?)

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-04-09 20:52                                     ` Len Brown
@ 2021-04-09 21:44                                       ` Andy Lutomirski
  2021-04-11 19:07                                         ` Len Brown
  0 siblings, 1 reply; 36+ messages in thread
From: Andy Lutomirski @ 2021-04-09 21:44 UTC (permalink / raw)
  To: Len Brown
  Cc: Andy Lutomirski, David Laight, Dave Hansen, Greg KH, Bae,
	Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer,
	Rich Felker, Kyle Huey, Keno Fischer, Linux API

On Fri, Apr 9, 2021 at 1:53 PM Len Brown <lenb@kernel.org> wrote:
>
> On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Mar 31, 2021 at 3:28 PM Len Brown <lenb@kernel.org> wrote:
> >
> > > We added compiler annotation for user-level interrupt handlers.
> > > I'm not aware of it failing, or otherwise being confused.
> >
> > I followed your link and found nothing. Can you elaborate?  In the
> > kernel, we have noinstr, and gcc gives approximately no help toward
> > catching problems.
>
> A search for the word "interrupt" on this page
> https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes
> comes to the description of this attribute:
>
> __attribute__ ((interrupt))
>

I read that and I see no mention of anything saying "this will
generate code that does not touch extended state".  Instead I see,
paraphrasing, "this will generate code with an ABI that is completely
inappropriate for use in a user space signal handler".  Am I missing
something?

> > > dynamic XCR0 breaks the installed base, I thought we had established that.
> >
> > I don't think this is at all established.  If some code thinks it
> > knows the uncompacted XSTATE size and XCR0 changes, it crashes.  This
> > is not necessarily a showstopper.
>
> My working assumption is that crashing applications actually *is* a showstopper.
> Please clarify.

I think you're presuming that some program actually does this.  If no
program does this, it's not an ABI break.

More relevantly, this can only happen in a process that uses XSAVE and
thinks it knows the size that *also* does the prctl to change XCR0.
By construction, existing programs can't break unless they load new
dynamic libraries that break them.

>
> > > We've also established that when running in a VMM, every update to
> > > XCR0 causes a VMEXIT.
> >
> > This is true, it sucks, and Intel could fix it going forward.
>
> What hardware fix do you suggest?
> If a guest is permitted to set XCR0 bits without notifying the VMM,
> what happens when it sets bits that the VMM doesn't know about?

The VM could have a mask of allowed XCR0 bits that don't exist.

TDX solved this problem *somehow* -- XSETBV doesn't (visibly?) exit on
TDX.  Surely plain VMX could fix it too.

>
> > > I thought the goal was to allow new programs to have fast signal handlers.
> > > By default, those fast signal handlers would have a stable state
> > > image, and would
> > > not inherit large architectural state on their stacks, and could thus
> > > have minimal overhead on all hardware.
> >
> > That is *a* goal, but not necessarily the only goal.
>
> I fully support coming up with a scheme for fast future-proof signal handlers,
> and I'm willing to back that up by putting work into it.
>
> I don't see any other goals articulated in this thread.

Before we get too carried away with *fast* signal handlers, something
that works with existing programs is also a pretty strong goal.  RIght
now AVX-512 breaks existing programs, even if they don't use AVX-512.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-04-09 21:44                                       ` Andy Lutomirski
@ 2021-04-11 19:07                                         ` Len Brown
  2021-04-12  7:59                                           ` David Laight
                                                             ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Len Brown @ 2021-04-11 19:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Laight, Dave Hansen, Greg KH, Bae, Chang Seok, X86 ML,
	LKML, libc-alpha, Florian Weimer, Rich Felker, Kyle Huey,
	Keno Fischer, Linux API

On Fri, Apr 9, 2021 at 5:44 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Fri, Apr 9, 2021 at 1:53 PM Len Brown <lenb@kernel.org> wrote:
> >
> > On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > On Wed, Mar 31, 2021 at 3:28 PM Len Brown <lenb@kernel.org> wrote:
> > >
> > > > We added compiler annotation for user-level interrupt handlers.
> > > > I'm not aware of it failing, or otherwise being confused.
> > >
> > > I followed your link and found nothing. Can you elaborate?  In the
> > > kernel, we have noinstr, and gcc gives approximately no help toward
> > > catching problems.
> >
> > A search for the word "interrupt" on this page
> > https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes
> > comes to the description of this attribute:
> >
> > __attribute__ ((interrupt))
> >
>
> I read that and I see no mention of anything saying "this will
> generate code that does not touch extended state".  Instead I see,
> paraphrasing, "this will generate code with an ABI that is completely
> inappropriate for use in a user space signal handler".  Am I missing
> something?

Again...

An analogous annotation could be created for fast signals.
gcc can be told exactly what registers and instructions it can use for
that routine.

If somebody can suggest a way to make fast signal handers faster
than saving only the state that they-themselves actually use, I'm all ears.

> > > > dynamic XCR0 breaks the installed base, I thought we had established that.
> > >
> > > I don't think this is at all established.  If some code thinks it
> > > knows the uncompacted XSTATE size and XCR0 changes, it crashes.  This
> > > is not necessarily a showstopper.
> >
> > My working assumption is that crashing applications actually *is* a showstopper.
> > Please clarify.
>
> I think you're presuming that some program actually does this.  If no
> program does this, it's not an ABI break.

So you agree that for a program that uses xgetbv to read XCR0 and compute
XSTATE size for user-space use of XSAVE can break if XCR0 changes
during its lifetime.
But you don't believe such software exists?

> More relevantly, this can only happen in a process that uses XSAVE and
> thinks it knows the size that *also* does the prctl to change XCR0.
> By construction, existing programs can't break unless they load new
> dynamic libraries that break them.

Let's say that a program does math.
It calls a library to do that math.
It doesn't know or care what instructions the library uses to do math.
eg. the library uses SSE on an Atom, and uses AVX512 on a Xeon.

Who calls the new prctl, the program, or the library?

If it is the program, how does it know that the library wants to use
what instructions?

If it is the library, then you have just changed XCR0 at run-time and
you expose breakage of the thread library that has computed XSAVE size.

> > > > We've also established that when running in a VMM, every update to
> > > > XCR0 causes a VMEXIT.
> > >
> > > This is true, it sucks, and Intel could fix it going forward.
> >
> > What hardware fix do you suggest?
> > If a guest is permitted to set XCR0 bits without notifying the VMM,
> > what happens when it sets bits that the VMM doesn't know about?
>
> The VM could have a mask of allowed XCR0 bits that don't exist.
>
> TDX solved this problem *somehow* -- XSETBV doesn't (visibly?) exit on
> TDX.  Surely plain VMX could fix it too.

There are two cases.

1. Hardware that exists today and in the foreseeable future.

VM modification of XCR0 results in VMEXIT to VMM.
The VMM sees bits set by the guest, and so it can accept what
it supports, or send the VM a fault for non-support.

Here it is not possible for the VMM to change XCR0 without the VMM knowing.

2. Future Hardware that allows guests to write XCR0 w/o VMEXIT.

Not sure I follow your proposal.

Yes, the VM effectively has a mask of what is supported,
because it can issue CPUID.

The VMM virtualizes CPUID, and needs to know it must not
expose to the VM any state features it doesn't support.
Also, the VMM needs to audit XCR0 before it uses XSAVE,
else the guest could attack or crash the VMM through
buffer overrun.  Is this what you suggest?

If yes, what do you suggest in the years between now and when
that future hardware and VMM exist?

> > > > I thought the goal was to allow new programs to have fast signal handlers.
> > > > By default, those fast signal handlers would have a stable state
> > > > image, and would
> > > > not inherit large architectural state on their stacks, and could thus
> > > > have minimal overhead on all hardware.
> > >
> > > That is *a* goal, but not necessarily the only goal.
> >
> > I fully support coming up with a scheme for fast future-proof signal handlers,
> > and I'm willing to back that up by putting work into it.
> >
> > I don't see any other goals articulated in this thread.
>
> Before we get too carried away with *fast* signal handlers, something
> that works with existing programs is also a pretty strong goal.  RIght
> now AVX-512 breaks existing programs, even if they don't use AVX-512.

Re: "AVX-512 breaks existing programs, even if they don't use AVX-512"

Perhaps it would be useful to review how that breakage can happen,
recognize when it is a problem,  when it is not a problem, and what we
are doing to address it today.

The "ABI" here, is the signal.h definition of the MIN and SIG
stacksize to 2KB and 8KB
(on all architectures).  These hard coded constants may be used by
programs that choose
to manually allocate and register alternative signal stacks.

The signal delivery ABI we use today, where all x86 architecture state
is XSAVED onto
the signal stack will exceed 2KB when running on hardware that supports AVX-512.

This issue is real.  There do exist programs that use alternative stacks,
and of those, there do exist programs that use these constants, and if they
do take a signal on that size stack on that hardware, they do fail.

As evidenced that AVX-512 shipped several years ago, and the world didn't stop,
however, there are not a lot of programs with this exposure.
That said, adding 8KB to the architecture state on systems that support AMX/TMUL
makes this existing issue significantly more acute.

Glibc 2.34, to be released in July, re-defines these constants into
run-time values.
It uses CPUID to compute values that work, and so a program that uses this ABI
and is compiled with glibc 2.34 or later will not fail.

Further, Chang's kernel patch series does two important things.
First, it inspects the destination stack and computes the stack frame size
and it refuses to write onto a stack that will overflow.  We should have always
been making that check.

Second, it exports the kernel's notion of how big the signal stack needs to be
via the altvec, and glibc 2.34 picks this up and uses it in preference over
its own CPUID calculation, above.

So in a perfect world, you have AMX hardware, and the OS that supports your
AMX hardware has a kernel and glibc that support it.  Everything that comes
with that OS, or is built on that OS, uses that new library.
This mechanism similarly addresses the AVX-512 stack issue.

Granted, if you have an application that is statically linked and run
on new hardware
and new OS, it can still fail.

Granted, you have an application that creates small signal stacks
without using the ABI,
even a re-compile with the new library will not help it.

Granted, signal stacks -- whether they be normal or these alternative
signal stacks,
are bigger on hardware that has more hardware architecgture state.

But applications that use the ABI do not need to be modified.

I believe that this plan is sane.

I acknowledge that it doesn't address the desire for minimum size fast
signal handlers
that are minimal and fast on all hardware. I think we can address that
with a NEW ABI,
but not the old one.

thanks,
Len Brown, Intel Open Source Technology Center





-- 
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-04-11 19:07                                         ` Len Brown
@ 2021-04-12  7:59                                           ` David Laight
  2021-04-12 12:19                                           ` Borislav Petkov
  2021-04-12 17:14                                           ` Sean Christopherson
  2 siblings, 0 replies; 36+ messages in thread
From: David Laight @ 2021-04-12  7:59 UTC (permalink / raw)
  To: 'Len Brown', Andy Lutomirski
  Cc: Dave Hansen, Greg KH, Bae, Chang Seok, X86 ML, LKML, libc-alpha,
	Florian Weimer, Rich Felker, Kyle Huey, Keno Fischer, Linux API

From: Len Brown
> Sent: 11 April 2021 20:07
...
> Granted, if you have an application that is statically linked and run
> on new hardware
> and new OS, it can still fail.

That also includes anything compiled and released as a program
binary that must run on older Linux installations.

Such programs have to be compiled with old copies of the system
headers (and probably with an of gcc).

While such programs themselves won't use AVX without checking
for OS support, the glibc code on the installed system might.

Such programs can be modified to run-time detect the required
signal stack size - but cannot rely on glibc to convert
SIGSTKSZ into a function call.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-04-11 19:07                                         ` Len Brown
  2021-04-12  7:59                                           ` David Laight
@ 2021-04-12 12:19                                           ` Borislav Petkov
  2021-04-12 17:14                                           ` Sean Christopherson
  2 siblings, 0 replies; 36+ messages in thread
From: Borislav Petkov @ 2021-04-12 12:19 UTC (permalink / raw)
  To: Len Brown
  Cc: Andy Lutomirski, David Laight, Dave Hansen, Greg KH, Bae,
	Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer,
	Rich Felker, Kyle Huey, Keno Fischer, Linux API

On Sun, Apr 11, 2021 at 03:07:29PM -0400, Len Brown wrote:
> If it is the program, how does it know that the library wants to use
> what instructions?
> 
> If it is the library, then you have just changed XCR0 at run-time and
> you expose breakage of the thread library that has computed XSAVE size.

So, when old programs which cannot possibly know about the arch_prctl()
extension we're proposing here, link against that library, then that
library should not be allowed to go use "fat" states.

Unless the library can "tell" the process which links to it, that it
has dynamically enlarged the save state. If it can and the process can
handle that, then all is fine, save state gets enlarged dynamically and
it all continues merrily.

Also, in order for the library to use fat states, it needs to ask the
kernel for such support - not CPUID - because the kernel is doing the
state handling for everybody and doing all the CR4.OSXSAVE setup etc.

Which also means that the kernel can help here by telling the library:

- No, you cannot use fat states with this process because it hasn't
called arch_prctl() so it cannot handle them properly.

- Yes, this process allowes me to handle fat states for it so you can
use those states and thus those instructions when doing operations for
it.

So the kernel becomes the arbiter in all this - as it should be - and
then all should work fine.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
  2021-04-11 19:07                                         ` Len Brown
  2021-04-12  7:59                                           ` David Laight
  2021-04-12 12:19                                           ` Borislav Petkov
@ 2021-04-12 17:14                                           ` Sean Christopherson
  2 siblings, 0 replies; 36+ messages in thread
From: Sean Christopherson @ 2021-04-12 17:14 UTC (permalink / raw)
  To: Len Brown
  Cc: Andy Lutomirski, David Laight, Dave Hansen, Greg KH, Bae,
	Chang Seok, X86 ML, LKML, libc-alpha, Florian Weimer,
	Rich Felker, Kyle Huey, Keno Fischer, Linux API

On Sun, Apr 11, 2021, Len Brown wrote:
> On Fri, Apr 9, 2021 at 5:44 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Fri, Apr 9, 2021 at 1:53 PM Len Brown <lenb@kernel.org> wrote:
> > >
> > > On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > >
> > > > On Wed, Mar 31, 2021 at 3:28 PM Len Brown <lenb@kernel.org> wrote:
> > > > > We've also established that when running in a VMM, every update to
> > > > > XCR0 causes a VMEXIT.
> > > >
> > > > This is true, it sucks, and Intel could fix it going forward.
> > >
> > > What hardware fix do you suggest?
> > > If a guest is permitted to set XCR0 bits without notifying the VMM,
> > > what happens when it sets bits that the VMM doesn't know about?
> >
> > The VM could have a mask of allowed XCR0 bits that don't exist.
> >
> > TDX solved this problem *somehow* -- XSETBV doesn't (visibly?) exit on
> > TDX.  Surely plain VMX could fix it too.
> 
> There are two cases.
> 
> 1. Hardware that exists today and in the foreseeable future.
> 
> VM modification of XCR0 results in VMEXIT to VMM.
> The VMM sees bits set by the guest, and so it can accept what
> it supports, or send the VM a fault for non-support.
> 
> Here it is not possible for the VMM to change XCR0 without the VMM knowing.
> 
> 2. Future Hardware that allows guests to write XCR0 w/o VMEXIT.
> 
> Not sure I follow your proposal.
> 
> Yes, the VM effectively has a mask of what is supported,
> because it can issue CPUID.
> 
> The VMM virtualizes CPUID, and needs to know it must not
> expose to the VM any state features it doesn't support.
> Also, the VMM needs to audit XCR0 before it uses XSAVE,
> else the guest could attack or crash the VMM through
> buffer overrun.

The VMM already needs to context switch XCR0 and XSS, so this is a non-issue.

> Is this what you suggest?

Yar.  In TDX, XSETBV exits, but only to the TDX module.  I.e. TDX solves the
problem in software by letting the VMM tell the TDX module what features the
guest can set in XCR0/XSS via the XFAM (Extended Features Allowed Mask).

But, that software "fix" can also be pushed into ucode, e.g. add an XFAM VMCS
field, the guest can set any XCR0 bits that are '1' in VMCS.XFAM without exiting.

Note, SGX has similar functionality in the form of XFRM (XSAVE-Feature Request
Mask).  The enclave author can specify what features will be enabled in XCR0
when the enclave is running.  Not that relevant, other than to reinforce that
this is a solvable problem.

> If yes, what do you suggest in the years between now and when
> that future hardware and VMM exist?

Burn some patch space? :-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, back to index

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ@mail.gmail.com>
2021-03-26 23:18 ` Candidate Linux ABI for Intel AMX and hypothetical new related features Andy Lutomirski
2021-03-27  3:39   ` Len Brown
2021-03-27  9:14     ` Borislav Petkov
2021-03-27  9:58     ` Greg KH
2021-03-29 15:47       ` Len Brown
2021-03-29 16:38         ` Len Brown
2021-03-29 16:48           ` Florian Weimer
2021-03-29 18:14           ` Andy Lutomirski
2021-03-29 18:16         ` Andy Lutomirski
2021-03-29 22:38           ` Len Brown
2021-03-30  5:08             ` Andy Lutomirski
2021-03-30  5:50               ` Noah Goldstein
2021-03-30 17:01               ` Len Brown
2021-03-30 17:05                 ` Andy Lutomirski
2021-03-30 17:56                   ` Len Brown
2021-03-30 19:12                     ` Dave Hansen
2021-03-30 20:20                       ` Andy Lutomirski
2021-03-30 20:42                         ` Len Brown
2021-03-30 22:01                           ` David Laight
2021-03-31 16:31                             ` Len Brown
2021-03-31 16:53                               ` Andy Lutomirski
2021-03-31 21:42                                 ` Robert O'Callahan
2021-03-31 22:11                                   ` Len Brown
2021-03-31 22:28                                 ` Len Brown
2021-03-31 22:45                                   ` Andy Lutomirski
2021-04-09 20:52                                     ` Len Brown
2021-04-09 21:44                                       ` Andy Lutomirski
2021-04-11 19:07                                         ` Len Brown
2021-04-12  7:59                                           ` David Laight
2021-04-12 12:19                                           ` Borislav Petkov
2021-04-12 17:14                                           ` Sean Christopherson
2021-03-31 22:52                                   ` Borislav Petkov
2021-04-09 20:55                                     ` Len Brown
2021-03-28  0:53   ` Thomas Gleixner
2021-03-29  7:27     ` Peter Zijlstra
2021-03-29 15:06     ` Dave Hansen

Linux-api Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-api/0 linux-api/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-api linux-api/ https://lore.kernel.org/linux-api \
		linux-api@vger.kernel.org
	public-inbox-index linux-api

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-api


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git