All of lore.kernel.org
 help / color / mirror / Atom feed
* x86 CPU features detection for applications (and AMX)
@ 2021-06-23 15:04 Florian Weimer
  2021-06-23 15:32 ` Dave Hansen
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Florian Weimer @ 2021-06-23 15:04 UTC (permalink / raw)
  To: libc-alpha, linux-api, x86, linux-arch; +Cc: H.J. Lu

We have an interface in glibc to query CPU features:

  X86-specific Facilities
  <https://www.gnu.org/software/libc/manual/html_node/X86.html>

CPU_FEATURE_USABLE all preconditions for a feature are met,
HAS_CPU_FEATURE means it's in silicon but possibly dormant.
CPU_FEATURE_USABLE is supposed to look at XCR0, AT_HWCAP2 etc. before
enabling the relevant bit (so it cannot pass through any unknown bits).

It turns out we screwed up in the glibc 2.33 release the absolutely
required headers weren't actually installed:

  [PATCH] x86: Install <bits/platform/x86.h> [BZ #27958]
  <https://sourceware.org/pipermail/libc-alpha/2021-June/127215.html>

Given that the magic constants aren't available in any other way, this
feature was completely unusable, so we can perhaps revisit it and switch
to a different approach.

Previously kernel developers have expressed dismay that we didn't
coordinate the interface with them.  This is why I want raise this now.

When we designed this glibc interface, we assumed that bits would be
static during the life-time of the process, initialized at process
start.  That follows the model of previous x86 CPU feature enablement.
In the background, CPU_FEATURE_USABLE/HAS_CPU_FEATURE calls a function
which returns a pointer to eight 32-bit words, based on the index passed
to the function (out-of-range indidces return a pointer to zeros,
enabling forward compatibility).  The macros then use a magic constants
that encodes he lookup index and which of those 128 bits to extract to
find that bit, plus the feature/usable choice.  This means that we
*could* keep this interface unchanged if the kernel gives us a way to
read up-to-date feature state from a 256 bit area (or at least 32 bit
word) in thread-specific data.  Similar to what we have with
set_robust_list and rseq today.

This still wouldn't cover the enable/disable side, but at least it would
work for CPU features which are modal and come and go.  The fact that we
tell GCC to cache the returned pointer from that internal function, but
not that the data is immutable works to our advantage here.

On the other hand, maybe there is a way to give users a better
interface.  Obviously we want to avoid a syscall for a simple CPU
feature check.  And we also need something to enable/disable CPU
features.

Thanks,
Florian

PS: Is it true that there is no public mailing list for Linux
discussions specific to x86?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-23 15:04 x86 CPU features detection for applications (and AMX) Florian Weimer
@ 2021-06-23 15:32 ` Dave Hansen
  2021-07-08  6:05   ` Florian Weimer
  2021-06-25 23:31 ` Thiago Macieira
  2021-07-08 17:56 ` Mark Brown
  2 siblings, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2021-06-23 15:32 UTC (permalink / raw)
  To: Florian Weimer, libc-alpha, linux-api, x86, linux-arch; +Cc: H.J. Lu

On 6/23/21 8:04 AM, Florian Weimer wrote:
> https://www.gnu.org/software/libc/manual/html_node/X86.html
...
> Previously kernel developers have expressed dismay that we didn't
> coordinate the interface with them.  This is why I want raise this now.

This looks basically like someone dumped a bunch of CPUID bit values and
exposed them to applications without considering whether applications
would ever need them.  For instance, why would an app ever care about:

	PKS – Protection keys for supervisor-mode pages.

And how could glibc ever give applications accurate information about
whether PKS "is supported by the operating system"?  It just plain
doesn't know, or at least only knows from a really weak ABI like
/proc/cpuinfo.

It also doesn't seem to tell applications what they want which is, "can
I, the application, *use* this feature?"

> PS: Is it true that there is no public mailing list for Linux
> discussions specific to x86?

Yes.  I've asked recently for something x86-related, but folks were to
concerned what I was asking for was too specific, which was more of a
brainstorming place to put x86-specific RFC's.

	https://subspace.kernel.org/lists.linux.dev.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-23 15:04 x86 CPU features detection for applications (and AMX) Florian Weimer
  2021-06-23 15:32 ` Dave Hansen
@ 2021-06-25 23:31 ` Thiago Macieira
  2021-06-28 12:40   ` Enrico Weigelt, metux IT consult
  2021-07-08  7:08   ` Florian Weimer
  2021-07-08 17:56 ` Mark Brown
  2 siblings, 2 replies; 35+ messages in thread
From: Thiago Macieira @ 2021-06-25 23:31 UTC (permalink / raw)
  To: fweimer; +Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

On 23 Jun 2021 17:04:27 +0200, Florian Weimer wrote:
> We have an interface in glibc to query CPU features:
> X86-specific Facilities
> <https://www.gnu.org/software/libc/manual/html_node/X86.html>
>
> CPU_FEATURE_USABLE all preconditions for a feature are met,
> HAS_CPU_FEATURE means it's in silicon but possibly dormant.
> CPU_FEATURE_USABLE is supposed to look at XCR0, AT_HWCAP2 etc. before
> enabling the relevant bit (so it cannot pass through any unknown bits).

It's a nice initiative, but it doesn't help library and applications that need 
to be either cross-platform or backwards compatible.

The first problem is the cross-platformness need. Because we library and 
application developers need to support other OSes, we'll need to deploy our 
own CPUID-based detection. It's far better to use common code everywhere, 
where one developer working on Linux can fix bugs in FreeBSD, macOS or Windows 
or any of the permutations. Every platform-specific deviation adds to 
maintenance requirements and is a source of potential latent bugs, now or in 
the future due to refactoring. That is why doing everything in the form of 
instructions would be far better and easier, rather than system calls.

[Unless said system calls were standardised and actually deployed. Making this 
a cross-platform library that is not part of libc would be a major step in 
that direction]

The second problem is going to be backwards compatibility. Applications and 
libraries may want to ship precompiled binaries that make use of the new CPU 
features, whether they are open source or not. It comes as no surprise to 
anyone that we CPU makers will have made software that use those features and 
want to have it ready on Day 1 of the HW being available for the market (if 
we're doing our jobs right). That often involves precompiling because everyone 
who installed their compilers more than one year ago will not have the 
necessary tools to build. That runs counter to the need to use a libc 
interface that didn't exist until recently.

And by "recently", I mean "anything since the glibc that came with Red Hat 
Enterprise Linux 7" (2.17).

So no, application and library developers will not use libc functions they 
don't need to, especially if it adds to their problems, unless there's no way 
around it.

> Previously kernel developers have expressed dismay that we didn't
> coordinate the interface with them.  This is why I want raise this now.

You also need to coordinate with your users.

A platform-specific API to solve a problem that is already solved is "knock 
yourself out, we're not going to use this." So my first suggestion is to 
remove the "platform-specific" part and make this a cross-platform solution.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-25 23:31 ` Thiago Macieira
@ 2021-06-28 12:40   ` Enrico Weigelt, metux IT consult
  2021-06-28 13:20     ` Peter Zijlstra
  2021-06-28 15:08     ` Thiago Macieira
  2021-07-08  7:08   ` Florian Weimer
  1 sibling, 2 replies; 35+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-28 12:40 UTC (permalink / raw)
  To: Thiago Macieira, fweimer
  Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

On 26.06.21 01:31, Thiago Macieira wrote:

Hi folks,

> The first problem is the cross-platformness need. Because we library and
> application developers need to support other OSes, we'll need to deploy our
> own CPUID-based detection. It's far better to use common code everywhere,
> where one developer working on Linux can fix bugs in FreeBSD, macOS or Windows
> or any of the permutations. Every platform-specific deviation adds to
> maintenance requirements and is a source of potential latent bugs, now or in
> the future due to refactoring. That is why doing everything in the form of
> instructions would be far better and easier, rather than system calls.

hmm, maybe some libcpuid ?

Personally, I believe that glibc already is too big, too much things
that belong into a separate library.

> The second problem is going to be backwards compatibility. Applications and
> libraries may want to ship precompiled binaries that make use of the new CPU
> features, whether they are open source or not. 

Since we're talking about GNU libc here, binary-only stuff is probably
out of scope here. OTOH, using differnt libc versions in those special
cases isn't such a big deal.

OTOH, depending on the programming language, those things might be
better put into the compiler. (Of course that needs good coordination
between different compiler vendors.)

Why do I believe so ?

a) If you really want optimal performance, it's not just about whether
    you can call certain opcode (and have a fallback if the opcode's
    missing), also the order matters.

    For example, back in P4 times, and old friend of mine (raytracing
    expert) did in-depth measurement of the P4's timing behaviour
    depending on opcode ordering, and he managed to achieve 5..10x (!)
    performance improve on raytracing workloads. But this was only
    possible by hand written assembler code, a C compiler wouldn't be
    able to do that (we actually designed some DSL for that, so a special
    compiler can generate the optional CPU specific code)

b) In recent years we're moving more and more to higher level (but still
    compiling to machine code) languages, where writing such machine
    specific code is either very hard or practically impossible - it's
    just the duty to handle such things.

> It comes as no surprise to anyone that we CPU makers will have made software 
> that use those features and
> want to have it ready on Day 1 of the HW being available for the market (if
> we're doing our jobs right). That often involves precompiling because everyone > who installed their compilers more than one year ago will not have the
> necessary tools to build. 

Actually, you should talk to the compiler folks much more early, at the
point where you know how those features look like.

Of course, there are cases where folks can't upgrade their compilers
easily due to certain formal processes. But those are just special case,
they've got tons of other problems for the same reason, and quite
frankly their processes are just broken to begin with. I've done several
projects in regulated areas (medical, automotive, etc) where folks use
ancient stuff just by the excuse of too much paper work and the silly
idea that certain sheet of certification paper for some old piece of
code gives any clue on actual quality or helps with their own
qualification processes (which, when reading the regulations carefully,
quickly turns out as nonsense).

For using certain new CPU specific features, the need for a compiler
upgrade really should be no excuse. And at least for vast majority of
cases, a proper compiler could do it much better than the average
programmer.

> And by "recently", I mean "anything since the glibc that came with Red Hat
> Enterprise Linux 7" (2.17).

Uh, that's really ancient. Nobody can seriously expect modern features
on such an ancient distro. If people really insist spending so much
money for running such old code, instead of just doing a dist upgrade,
then I can only reply with "not our problem".

In that case it's the duty of Redhat, doing their job right for once and
provide newer libc and gcc versions (and yes, it really isn't a big deal
having multiple versions of them installed concurrently).

I'm really not a fan of running bleeding edge software, but doing a dist
upgrade every 1..2 years really isn't asking too much (except perhaps
for some very exceptional cases). Especially not for big organisation
with enough money for being able to spend so much money to such vendors.
(or are their pockets empty because they already pay so much for RHEL ?)

> A platform-specific API to solve a problem that is already solved is "knock
> yourself out, we're not going to use this." So my first suggestion is to
> remove the "platform-specific" part and make this a cross-platform solution.

Speaking of platform specific problems: that's where I need to tell that
I'm *very* unhappy with you CPU vendor folks. Actually it's you, who
created a those - not just platform but CPU model specific - problems.
The whole topic is anything but new, there could have been a long term
solution decades ago, on silicon level. (I could say a lot about the
IMHO really weird opcode layout, but that'd be too far out of scope)

What we SW engineers need is an easy and fast method to act depending on
whether some CPU supports some feature (eg. a new opcode). Things like
cpuinfo are only a tiny piece of that. What we could really use is a
conditional jump/call based on whether feature X is supported - without
any kernel intervention. Then the machine code could be easily layed out
to support both cases with our without some feature X. Alternatively we
could have a fast trapping in useland - hw generated call already would
be a big help.

If we had something in that direction, we wouldn't have to have this
kind discussion here anymore - it would be entirely up to compiler and
library folks, no need for any kernel support at all.

Going back to AMX - just had a quick look at the spec (*1). Sorry, but
this thing is really weird and horrible to use. Come on, these chips
already have billions of transistors, it really can't hurt so much
spending a few more to provide a clean and easy to use machine code
interface. Grmmpf! (This is a general problem we've got with so many
HW folks, why can't them just talk to us SW folks first so we can find
a good solution for both sides, before that goes into the field ?)

And one point that immediately jumps into my mind (w/o looking deeper
into it): it introduces completely new registers - do we now need extra
code for tasks switching etc ?

Since this stuff seems to be for tensor operations, I really wonder why
that has to be inline with classic operations ? Maybe I'm just lacking
phantasy, but I don't see much use cases where the application would not
want to do those operations in larger batches and we already have
auxillary processors like TPUs or GPUs for. What is the intended use
case for that ? Who would actually benefit from that ?


--mtx


*1) 
https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 12:40   ` Enrico Weigelt, metux IT consult
@ 2021-06-28 13:20     ` Peter Zijlstra
  2021-06-30 12:50       ` Enrico Weigelt, metux IT consult
  2021-06-28 15:08     ` Thiago Macieira
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2021-06-28 13:20 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Thiago Macieira, fweimer, hjl.tools, libc-alpha, linux-api,
	linux-arch, x86

On Mon, Jun 28, 2021 at 02:40:32PM +0200, Enrico Weigelt, metux IT consult wrote:

> Going back to AMX - just had a quick look at the spec (*1). Sorry, but
> this thing is really weird and horrible to use. Come on, these chips
> already have billions of transistors, it really can't hurt so much
> spending a few more to provide a clean and easy to use machine code
> interface. Grmmpf! (This is a general problem we've got with so many
> HW folks, why can't them just talk to us SW folks first so we can find
> a good solution for both sides, before that goes into the field ?)
> 
> And one point that immediately jumps into my mind (w/o looking deeper
> into it): it introduces completely new registers - do we now need extra
> code for tasks switching etc ?

No, but because it's register state and part of XSAVE, it has immediate
impact in ABI. In particular, the signal stack layout includes XSAVE (as
does ptrace()).

At the same time, 'legacy' applications (up until _very_ recently) had a
minimum signal stack size of 2K, which is already violated by the
addition of AVX512 (there's actual breakage due to that).

Adding the insane AMX state (8k+) into that is a complete trainwreck
waiting to happen. Not to mention that having !INIT AMX state has direct
consequences for P-state selection and thus performance.

For these reasons, us OS folks, will mandate you get to do a prctl() to
request/release AMX (and we get to say: no). If you use AMX without
this, the instruction will fault (because not set in XCR0) and we'll
SIGBUS or something.

Userspace will have to do something like:

 - check CPUID, if !AMX -> fail
 - issue prctl(), if error -> fail
 - issue XGETBV and check the AMX bit it set, if not -> fail
 - request the signal stack size / spawn threads
 - use AMX

Spawning threads prior to enabling AMX will result in using the wrong
signal stack size and result in malfunction, you get to keep the pieces.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 12:40   ` Enrico Weigelt, metux IT consult
  2021-06-28 13:20     ` Peter Zijlstra
@ 2021-06-28 15:08     ` Thiago Macieira
  2021-06-28 15:27       ` Peter Zijlstra
  2021-06-30 14:32       ` Enrico Weigelt, metux IT consult
  1 sibling, 2 replies; 35+ messages in thread
From: Thiago Macieira @ 2021-06-28 15:08 UTC (permalink / raw)
  To: fweimer, Enrico Weigelt, metux IT consult
  Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

On Monday, 28 June 2021 05:40:32 PDT Enrico Weigelt, metux IT consult wrote:
> > The first problem is the cross-platformness need. Because we library and
> > application developers need to support other OSes, we'll need to deploy
> > our
> > own CPUID-based detection. It's far better to use common code everywhere,
> > where one developer working on Linux can fix bugs in FreeBSD, macOS or
> > Windows or any of the permutations. Every platform-specific deviation
> > adds to maintenance requirements and is a source of potential latent
> > bugs, now or in the future due to refactoring. That is why doing
> > everything in the form of instructions would be far better and easier,
> > rather than system calls.
> hmm, maybe some libcpuid ?

Indeed. I'm querying inside Intel to see if I can get buy-in to create such a 
library.

> > The second problem is going to be backwards compatibility. Applications
> > and
> > libraries may want to ship precompiled binaries that make use of the new
> > CPU features, whether they are open source or not.
> 
> Since we're talking about GNU libc here, binary-only stuff is probably
> out of scope here. OTOH, using differnt libc versions in those special
> cases isn't such a big deal.

Shipping a libc is not trivial, either technically or due to licensing 
requirements. Most applications want to link against whatever libc the system 
already provides, if that's possible.

> > It comes as no surprise to anyone that we CPU makers will have made
> > software that use those features and
> > want to have it ready on Day 1 of the HW being available for the market
> > (if
> > we're doing our jobs right). That often involves precompiling because
> > everyone > who installed their compilers more than one year ago will not
> > have the necessary tools to build.
> 
> Actually, you should talk to the compiler folks much more early, at the
> point where you know how those features look like.

We do, but it's not enough.

GCC releases once a year, so it's easy to miss the feature freeze. Then there 
are Linux distros that do LTS every 2 years or so. Worse, those two are 
usually out of phase. For example, if you're using the current Ubuntu LTS 
today (almost July 2021), you're using 20.04, which was released one month 
before the GCC 10 release. So you're using GCC 9, released May 2019, which 
means its features were frozen on December 2018. That's an incredibly long 
lead time.

As a consequence, you will see precompiled binaries.

> For using certain new CPU specific features, the need for a compiler
> upgrade really should be no excuse. And at least for vast majority of
> cases, a proper compiler could do it much better than the average
> programmer.

To compile the software that uses those instructions, undoubtedly. But what if 
I did that for you and you could simply download the binary for the library 
and/or plugins such that you could slot into your existing systems and CI? 
This could make a difference between adoption or not.

> > And by "recently", I mean "anything since the glibc that came with Red Hat
> > Enterprise Linux 7" (2.17).
> 
> Uh, that's really ancient. Nobody can seriously expect modern features
> on such an ancient distro. If people really insist spending so much
> money for running such old code, instead of just doing a dist upgrade,
> then I can only reply with "not our problem".

Yes and no.

Red Hat has been incredibly successful in backporting kernel features to the 
old 3.10 that came with RHEL 7. Whether they will do that for AMX state saving 
and the system call that we're discussing here, I can't say. AFAIU, they did 
backport the AVX512 state-saving to that 3.10, so they may.

Even if they don't, the *software* that people deploy may be the same build 
for RHEL 7 and for a modern distro that will have a 5.14 kernel. That software 
may have non-AVX, AVX2, AVX512 and AMX-specific code paths and would do 
runtime detection of which one is best to use. If a system call is needed, the 
system call needs to be issued even on that 3.10 and if it responds with 
-ENOSYS or -EINVAL, then it will fall back to the next best option.

So my point is: this shouldn't be in glibc because the glibc will not have the 
new system call wrappers or TLS fields.

> What we SW engineers need is an easy and fast method to act depending on
> whether some CPU supports some feature (eg. a new opcode). Things like
> cpuinfo are only a tiny piece of that. What we could really use is a
> conditional jump/call based on whether feature X is supported - without
> any kernel intervention. Then the machine code could be easily layed out
> to support both cases with our without some feature X. Alternatively we
> could have a fast trapping in useland - hw generated call already would
> be a big help.

That's what cpuid is for. With GCC function multi-versioning or equivalent 
manually-rolled solutions, you can get exactly what you're asking for.

Yes, the checking became far more complex with the need to check XCR0 after 
AVX came along, but since the instruction itself is a slow and serialising, 
any library will just cache the results. And as a result, the level of CPU 
features is not expected to change. It never has in the past, so this hasn't 
been an issue.

> If we had something in that direction, we wouldn't have to have this
> kind discussion here anymore - it would be entirely up to compiler and
> library folks, no need for any kernel support at all.

For most features, there isn't. You don't see us discussing 
AVX512VP2INTERSECT, for example. This discussion only exists because AMX 
requires more state to be saved during context switches and signal delivery. 
See Peter's email.

> And one point that immediately jumps into my mind (w/o looking deeper
> into it): it introduces completely new registers - do we now need extra
> code for tasks switching etc ?

Yes, this is the crux of this discussion.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 15:08     ` Thiago Macieira
@ 2021-06-28 15:27       ` Peter Zijlstra
  2021-06-28 16:13         ` Thiago Macieira
  2021-06-30 14:32       ` Enrico Weigelt, metux IT consult
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2021-06-28 15:27 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Mon, Jun 28, 2021 at 08:08:41AM -0700, Thiago Macieira wrote:
> On Monday, 28 June 2021 05:40:32 PDT Enrico Weigelt, metux IT consult wrote:

> > What we SW engineers need is an easy and fast method to act depending on
> > whether some CPU supports some feature (eg. a new opcode). Things like
> > cpuinfo are only a tiny piece of that. What we could really use is a
> > conditional jump/call based on whether feature X is supported - without
> > any kernel intervention. Then the machine code could be easily layed out
> > to support both cases with our without some feature X. Alternatively we
> > could have a fast trapping in useland - hw generated call already would
> > be a big help.
> 
> That's what cpuid is for. With GCC function multi-versioning or equivalent 
> manually-rolled solutions, you can get exactly what you're asking for.

Right, lots of self-modifying code solutions there, some of which can be
linker driven, some not. In the kernel we use alternative() to replace
short code sequences depending on CPUID.

Userspace *could* do the same, rewriting code before first execution is
fairly straight forward.

> Yes, the checking became far more complex with the need to check XCR0 after 
> AVX came along, but since the instruction itself is a slow and serialising, 
> any library will just cache the results. And as a result, the level of CPU 
> features is not expected to change. It never has in the past, so this hasn't 
> been an issue.

Arguably you should be checking XCR0 for any feature there, including
SSE/AVX/AVX512 and now AMX.

Ideally we'd do a prctl() for AVX512 too, except it's too late :-(

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 15:27       ` Peter Zijlstra
@ 2021-06-28 16:13         ` Thiago Macieira
  2021-06-28 17:11           ` Peter Zijlstra
  2021-06-28 17:43           ` Peter Zijlstra
  0 siblings, 2 replies; 35+ messages in thread
From: Thiago Macieira @ 2021-06-28 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Monday, 28 June 2021 08:27:24 PDT Peter Zijlstra wrote:
> > That's what cpuid is for. With GCC function multi-versioning or equivalent
> > manually-rolled solutions, you can get exactly what you're asking for.
> 
> Right, lots of self-modifying code solutions there, some of which can be
> linker driven, some not. In the kernel we use alternative() to replace
> short code sequences depending on CPUID.
> 
> Userspace *could* do the same, rewriting code before first execution is
> fairly straight forward.

Userspace shouldn't do SMC. It's bad enough that JITs without caching exist, 
but having pure paged code is better. Pure pages are shared as needed by the 
kernel.

All you need is a simple bit test. You can then either branch to different 
code paths or write to a function pointer so it'll go there directly the next 
time. You can also choose to load different plugins depending on what CPU 
features were found.

Consequence: CPU feature checking is done *very* early, often before main().

> Arguably you should be checking XCR0 for any feature there, including
> SSE/AVX/AVX512 and now AMX.
> 
> Ideally we'd do a prctl() for AVX512 too, except it's too late :-(

Right.

But speaking of which, this library would deal with Apple having done the 
allocate-state-on-demand feature for AVX512 without XFD. See
https://github.com/qt/qtbase/blob/dev/src/corelib/global/qsimd.cpp#L346-L369

Anyway, what's the current thinking on what the arch_prctl() should be? Is 
that a per-thread state or will it affect the entire process group? And is it 
a sticky functionality, or are we talking about ref/deref?

Maybe in order to answer that, we need to understand what the worst case 
scenario we need to support is. What are they?

1) alt-stack signal handlers, usually for crashing signals (to catch a stack 
overflow)

2) cooperative user-space task schedulers, e.g. coroutines

3) preemptive user-space task schedulers (I don't know if such software exists 
or even if it is possible)

4) combination of 1 and 3

5) #4, in which each part is comes from a separate library with no knowledge 
of each other, and initialised concurrently in different threads

I'd *assume* that any user-space task scheduler is aware of XSAVE at the very 
least and will know how to allocate context-saving buffers of sufficient size 
for each task. I think this is a safe assumption because AVX is over 10 years 
old now and XSAVE is a required feature for enabling the AVX state. That is, 
any library that knows to save AVX state (the upper 128-bits of the YMM 
registers) is aware of XSAVE and the fact that the state size is dynamic.

Crash handlers are another story. Speaking from experience, my first attempt 
at doing them simply used a global char array of MINSIGSTKSZ and that failed 
to get delivered (note that code will now fail to compile because MINSIGSTKSZ 
is no longer a constant expression). My code was attempting to launch gdb on 
itself, so it wasn't even a SA_SIGINFO signal and therefore the failure was 
baffling. I had to read the kernel source code to figure out that regardless 
of SA_SIGINFO, the state is saved on stack anyway and therefore needs to be 
big enough. So I simply increased the global variable's size until it 
succeeded in delivering on my AVX512 machine. And because it is no longer 
using MINSIGSTKSZ, it will not fail to compile after the glibc upgrade, but it 
will fail to deliver with AMX state enabled.

[I've since learned to check the XSAVE state size in order to create the alt-
stack]

How much do we need to worry about these crash handlers?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 16:13         ` Thiago Macieira
@ 2021-06-28 17:11           ` Peter Zijlstra
  2021-06-28 17:23             ` Thiago Macieira
  2021-06-28 17:43           ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2021-06-28 17:11 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Mon, Jun 28, 2021 at 09:13:29AM -0700, Thiago Macieira wrote:
> On Monday, 28 June 2021 08:27:24 PDT Peter Zijlstra wrote:
> > > That's what cpuid is for. With GCC function multi-versioning or equivalent
> > > manually-rolled solutions, you can get exactly what you're asking for.
> > 
> > Right, lots of self-modifying code solutions there, some of which can be
> > linker driven, some not. In the kernel we use alternative() to replace
> > short code sequences depending on CPUID.
> > 
> > Userspace *could* do the same, rewriting code before first execution is
> > fairly straight forward.
> 
> Userspace shouldn't do SMC. It's bad enough that JITs without caching exist, 
> but having pure paged code is better. Pure pages are shared as needed by the 
> kernel.

I don't feel that strongly; if SMC gets you measurable performance
gains, go for it. If you're short on memory, buy more.

> All you need is a simple bit test. You can then either branch to different 
> code paths or write to a function pointer so it'll go there directly the next 
> time. You can also choose to load different plugins depending on what CPU 
> features were found.

Both bit tests and indirect function calls suffer the extra memory load,
which is not free.

> Consequence: CPU feature checking is done *very* early, often before main().

For the linker based ones, yes. IIRC the ifunc() attribute is
particularly useful here.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 17:11           ` Peter Zijlstra
@ 2021-06-28 17:23             ` Thiago Macieira
  2021-06-28 19:08               ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Thiago Macieira @ 2021-06-28 17:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Monday, 28 June 2021 10:11:16 PDT Peter Zijlstra wrote:
> > Consequence: CPU feature checking is done *very* early, often before
> > main().
> For the linker based ones, yes. IIRC the ifunc() attribute is
> particularly useful here.

Exactly. ifunc was designed for this exact purpose. And hence the fact that 
CPUID initialisation will be done very, very early.

Anyway, if the AMX state is a sticky "set once per process", it's likely going 
to get set early for every process that *may* use AMX. And this is assuming we 
do the library right and only set it if has AMX code at all, instead of all 
the time.

On the other hand, if it's not set once and for all, we'll have to contend 
with the size changing. TBH, this is a lot more complicated to deal with. Take 
the hypothetical example of a preemptive user-space task scheduler that 
interrupts an AMX routine (let's say for the sake of the argument that it is 
an on-stack signal; I don't see why a scheduler would need to be alt-stack). 
It will record the state and then transition to another routine. And this 
routine may be resumed in another thread of the same process.

Will the kernel understand that the new routine does not need the AMX state? 
Will it understand that the *other* routine, in the other thread will? If this 
is not done automatically by the kernel, then the task scheduler will need to 
know to ask the kernel what the reference count for the AMX state is and will 
need a syscall to set it (not just increment/decrement, though one could 
implement that with a loop).

This applies differently in the case of cooperative scheduling. The SysV ABI 
will probably say that the AMX state is caller-save, so the function call from 
the AMX-using routine implies all its state has been saved somewhere. But what 
about the kernel-side AMX refcount? Is that part of the ABI?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 16:13         ` Thiago Macieira
  2021-06-28 17:11           ` Peter Zijlstra
@ 2021-06-28 17:43           ` Peter Zijlstra
  2021-06-28 19:05             ` Thiago Macieira
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2021-06-28 17:43 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Mon, Jun 28, 2021 at 09:13:29AM -0700, Thiago Macieira wrote:

> Anyway, what's the current thinking on what the arch_prctl() should be? Is 
> that a per-thread state or will it affect the entire process group? And is it 
> a sticky functionality, or are we talking about ref/deref?

So I didn't follow the initial discussion too well; so I might be
getting this wrong. In which case I'm hoping Thomas and/or Andy will
correct me.

But I think the proposal was per process. Having this per thread would
be really unfortunate IMO.

> Maybe in order to answer that, we need to understand what the worst case 
> scenario we need to support is. What are they?
> 
> 1) alt-stack signal handlers, usually for crashing signals (to catch a stack 
> overflow)
> 
> 2) cooperative user-space task schedulers, e.g. coroutines
> 
> 3) preemptive user-space task schedulers (I don't know if such software exists 
> or even if it is possible)

I think it's been done; use sigsetmask()/pthread_sigmask() as 'IRQ'
disable, and run a preemption tick off of SIGALRM or something.

> 4) combination of 1 and 3

None of those I think. The worst case is old executables using
MINSIGSTKSZ and not using the magic signal context at all, just regular
old signals. If you run them on an AVX512 enabled machine, they overflow
their stack and cause memory corruption.

AFAICT the only feasible way forward with that is some sysctl which
default disables AVX512 and add the prctl() and have some unsafe wrapper
that enables AVX512 for select 'legacy' programs for as long as they
exist :/

That is, binaries/libraries compiled against a static (and small)
MINSIGSTKSZ are the enemy. Which brings us to:

> 5) #4, in which each part is comes from a separate library with no knowledge 
> of each other, and initialised concurrently in different threads

That's terrible... library init should *NEVER* spawn threads (I know,
don't start).

Anything that does this is basically unfixable, because we can't
guarantee the AMX prctl() gets done before the first thread.

So yes, worst case I suppose...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 17:43           ` Peter Zijlstra
@ 2021-06-28 19:05             ` Thiago Macieira
  0 siblings, 0 replies; 35+ messages in thread
From: Thiago Macieira @ 2021-06-28 19:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Monday, 28 June 2021 10:43:47 PDT Peter Zijlstra wrote:
> None of those I think. The worst case is old executables using
> MINSIGSTKSZ and not using the magic signal context at all, just regular
> old signals. If you run them on an AVX512 enabled machine, they overflow
> their stack and cause memory corruption.

Indeed, they are already broken today. Retroactively fixing them 5 years later 
can be an additional goal, but it shouldn't stop us from having an AMX 
solution.

BTW, exactly MINSIGSTKSZ on an SKX overflows inside the kernel, so there's no 
memory corruption in userspace. The signal handler is not even invoked and the 
process is killed by a second signal delivery. But somewhere between 
MINSIGSTKSZ and SIGSTKSZ, it will invoke the signal handler and will overflow 
inside the user code. If that alt-stack wasn't designed with guard pages, it 
will corrupt memory, as you said.

> AFAICT the only feasible way forward with that is some sysctl which
> default disables AVX512 and add the prctl() and have some unsafe wrapper
> that enables AVX512 for select 'legacy' programs for as long as they
> exist :/
> 
> That is, binaries/libraries compiled against a static (and small)
> MINSIGSTKSZ are the enemy. Which brings us to:

> > 5) #4, in which each part is comes from a separate library with no
> > knowledge of each other, and initialised concurrently in different
> > threads
> 
> That's terrible... library init should *NEVER* spawn threads (I know,
> don't start).

Indeed, but that wasn't even what I was suggesting. I was thinking that the 
application would have started threads and the library init was run on 
separate threads. This may happen with lazy initialisation on first use.

But threads aren't required for the problem to happen. If the crash-handling 
case runs first, before AMX state is enabled, it might decide that it doesn't 
need to allocate sufficient alt-stack space for AMX. I think the 
recommendation here for userspace is clear: allocate the maximum that XSAVE 
tells you that you'll need, regardless of what the ambient enabled feature set 
is.

[And pretty please set up guard pages. Given that the XSAVE state area for SPR 
appears to be too close to 12 kB (see below), I'd say they should mmap() 20 kB 
and then mprotect() the lowest page to PROT_NONE.]

> Anything that does this is basically unfixable, because we can't
> guarantee the AMX prctl() gets done before the first thread.
> 
> So yes, worst case I suppose...

To wit: the worst case is a static, small alt-stack *without* guard pages 
(e.g., a malloc()ed or even static buffer) of sufficient size to let the 
kernel transition back to userspace but not for the userspace routine to run.

Any code using the old MINSIGSTKSZ (2048) will fail to run for AVX512 in the 
first place. There will be no data corruption, the crash handler will not run 
and the application will simply crash. An out-of-process core dump handler (if 
any, like systemd-coredump) will still get run.

Code using SIGSTKSZ (8192) will run for AVX512 and there'll be about 5 kB left 
to the user routine. So if it doesn't have too deep a call stack, will not 
corrupt memory. And this code will not run for AMX state, because it's smaller 
than the XSAVE state (see below), making that the same case as the MINSIGSTKSZ 
for AVX512 above.

It's possible someone would use something between those two values, but why? I 
expect that alt-stack handlers that use a constant value use either of the two 
constants or maybe some multiple of SIGSTKSZ, but nothing in-between them.

$ /opt/intel/sde-external-8.63.0-2021-01-18-lin/sde64 -spr -- cpuid -1 -l 0xd        
CPU:
   XSAVE features (0xd/0):
      XCR0 lower 32 bits valid bit field mask = 0x000600ff
      XCR0 upper 32 bits valid bit field mask = 0x00000000
         XCR0 supported: x87 state            = true
         XCR0 supported: SSE state            = true
         XCR0 supported: AVX state            = true
         XCR0 supported: MPX BNDREGS          = true
         XCR0 supported: MPX BNDCSR           = true
         XCR0 supported: AVX-512 opmask       = true
         XCR0 supported: AVX-512 ZMM_Hi256    = true
         XCR0 supported: AVX-512 Hi16_ZMM     = true
         IA32_XSS supported: PT state         = false
         XCR0 supported: PKRU state           = false
         XCR0 supported: CET_U state          = false
         XCR0 supported: CET_S state          = false
         IA32_XSS supported: HDC state        = false
         IA32_XSS supported: UINTR state      = false
         LBR supported                        = false
         IA32_XSS supported: HWP state        = false
         XTILECFG supported                   = true
         XTILEDATA supported                  = true
      bytes required by fields in XCR0        = 0x00002b00 (11008)
      bytes required by XSAVE/XRSTOR area     = 0x00002b00 (11008)

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 17:23             ` Thiago Macieira
@ 2021-06-28 19:08               ` Peter Zijlstra
  2021-06-28 19:26                 ` Thiago Macieira
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2021-06-28 19:08 UTC (permalink / raw)
  To: Thiago Macieira
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Mon, Jun 28, 2021 at 10:23:47AM -0700, Thiago Macieira wrote:
> On Monday, 28 June 2021 10:11:16 PDT Peter Zijlstra wrote:
> > > Consequence: CPU feature checking is done *very* early, often before
> > > main().
> > For the linker based ones, yes. IIRC the ifunc() attribute is
> > particularly useful here.
> 
> Exactly. ifunc was designed for this exact purpose. And hence the fact that 
> CPUID initialisation will be done very, very early.
> 
> Anyway, if the AMX state is a sticky "set once per process", it's likely going 
> to get set early for every process that *may* use AMX. And this is assuming we 
> do the library right and only set it if has AMX code at all, instead of all 
> the time.

This, AFAIU. If the ifunc() resolver finds we haz AMX it can do the
prctl() and on success pick the AMX routine.

Assuming of course, that if a program links with a library that supports
AMX, it will actually end up using it.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 19:08               ` Peter Zijlstra
@ 2021-06-28 19:26                 ` Thiago Macieira
  0 siblings, 0 replies; 35+ messages in thread
From: Thiago Macieira @ 2021-06-28 19:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: fweimer, Enrico Weigelt, metux IT consult, hjl.tools, libc-alpha,
	linux-api, linux-arch, x86

On Monday, 28 June 2021 12:08:16 PDT Peter Zijlstra wrote:
> > Anyway, if the AMX state is a sticky "set once per process", it's likely
> > going to get set early for every process that *may* use AMX. And this is
> > assuming we do the library right and only set it if has AMX code at all,
> > instead of all the time.
> 
> This, AFAIU. If the ifunc() resolver finds we haz AMX it can do the
> prctl() and on success pick the AMX routine.
> 
> Assuming of course, that if a program links with a library that supports
> AMX, it will actually end up using it.

That's what I meant and I agree. If it has an AMX function for *anything*, it 
will do the arch_prctl() and enable the state, even if said function is never 
called.

This is the good case. The bad case is that it does the arch_prctl() before it 
sees whether there is any AMX function.

Do we expect that the dynamic loader will have this code? It currently 
searches the multiple ABI levels (up to x86-64-v4 to include AVX512) and HW 
capabilities. I can readily see AMX being one of the capabilities, if not an 
ABI level. Though it should be trivial for it to call the arch_prctl() if and 
only if it is about to load an ELF module that declares use of AMX and also 
*not* load it if the syscall fails.

$ LD_DEBUG=libs /lib64/ld-linux-x86-64.so.2 --inhibit-cache /bin/ls 
      1620:     find library=librt.so.1 [0]; searching
      1620:      search path=.....
      1620:       trying file=/usr/lib64/glibc-hwcaps/x86-64-v4/librt.so.1
      1620:       trying file=/usr/lib64/glibc-hwcaps/x86-64-v3/librt.so.1
      1620:       trying file=/usr/lib64/glibc-hwcaps/x86-64-v2/librt.so.1
      1620:       trying file=/usr/lib64/tls/haswell/avx512_1/x86_64/librt.so.
1
      1620:       trying file=/usr/lib64/tls/haswell/avx512_1/librt.so.1
      1620:       trying file=/usr/lib64/tls/haswell/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/tls/haswell/librt.so.1
      1620:       trying file=/usr/lib64/tls/avx512_1/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/tls/avx512_1/librt.so.1
      1620:       trying file=/usr/lib64/tls/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/tls/librt.so.1
      1620:       trying file=/usr/lib64/haswell/avx512_1/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/haswell/avx512_1/librt.so.1
      1620:       trying file=/usr/lib64/haswell/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/haswell/librt.so.1
      1620:       trying file=/usr/lib64/avx512_1/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/avx512_1/librt.so.1
      1620:       trying file=/usr/lib64/x86_64/librt.so.1
      1620:       trying file=/usr/lib64/librt.so.1

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 13:20     ` Peter Zijlstra
@ 2021-06-30 12:50       ` Enrico Weigelt, metux IT consult
  2021-06-30 15:36         ` Thiago Macieira
  0 siblings, 1 reply; 35+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-30 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thiago Macieira, fweimer, hjl.tools, libc-alpha, linux-api,
	linux-arch, x86

On 28.06.21 15:20, Peter Zijlstra wrote:

>> And one point that immediately jumps into my mind (w/o looking deeper
>> into it): it introduces completely new registers - do we now need extra
>> code for tasks switching etc ?
> 
> No, but because it's register state and part of XSAVE, it has immediate
> impact in ABI. In particular, the signal stack layout includes XSAVE (as
> does ptrace()).

OMGs, I've already suspected such sickness. I don't even dare thinking
about consequences for compilers and library ABIs.

Does anyone here know why they designed this as inline operations ? This
thing seems to be pretty much what typical TPUs are doing (or a subset
of it). Why not just adding a TPU next to the CPU on the same chip ?

We already have the same w/ GPUs, and I guess nobody seriously wants to
put GPU functionality directly into CPU.

> At the same time, 'legacy' applications (up until _very_ recently) had a
> minimum signal stack size of 2K, which is already violated by the
> addition of AVX512 (there's actual breakage due to that).

grmpf!

> Adding the insane AMX state (8k+) into that is a complete trainwreck
> waiting to happen. Not to mention that having !INIT AMX state has direct
> consequences for P-state selection and thus performance.

Uh, are those new registers retained in certain sleep states or do they
need to be saved somewhere ?

> For these reasons, us OS folks, will mandate you get to do a prctl() to
> request/release AMX (and we get to say: no). If you use AMX without
> this, the instruction will fault (because not set in XCR0) and we'll
> SIGBUS or something.
> 
> Userspace will have to do something like:
> 
>   - check CPUID, if !AMX -> fail
>   - issue prctl(), if error -> fail
>   - issue XGETBV and check the AMX bit it set, if not -> fail

Can't we to this just by prctl() call ?
IOW: ask the kernel, who gonna say yes or no.

Are there any situations where kernel says yes, but process still can't
use it ? Why so ?

>   - request the signal stack size / spawn threads

Signal stack is separate from the usual stack, right ?
Why can't this all be done in one shot ?

>   - use AMX
> 
> Spawning threads prior to enabling AMX will result in using the wrong
> signal stack size and result in malfunction, you get to keep the pieces.

No way of adjusting this once the threads are running ?
Or could we even do that on per-thread basis ?

A thread here always has a corresponding kernel task, correct ?

--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-28 15:08     ` Thiago Macieira
  2021-06-28 15:27       ` Peter Zijlstra
@ 2021-06-30 14:32       ` Enrico Weigelt, metux IT consult
  2021-06-30 14:34         ` Florian Weimer
  2021-06-30 15:29         ` Thiago Macieira
  1 sibling, 2 replies; 35+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-30 14:32 UTC (permalink / raw)
  To: Thiago Macieira, fweimer
  Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

On 28.06.21 17:08, Thiago Macieira wrote:

>> hmm, maybe some libcpuid ?
> 
> Indeed. I'm querying inside Intel to see if I can get buy-in to create such a
> library.

What does "buy-in" mean in that context ? Some other departement ? Some
external developers ?

I tend to believe that those things should better be done by some
independent party, maybe GNU, FSF, etc, and cpu vendors should just
sponsor this work and provide necessary specs.

>> Since we're talking about GNU libc here, binary-only stuff is probably
>> out of scope here. OTOH, using differnt libc versions in those special
>> cases isn't such a big deal.
> 
> Shipping a libc is not trivial, either technically or due to licensing
> requirements. Most applications want to link against whatever libc the system
> already provides, if that's possible.

Shipping precompiled binaries and linking against system libraries is
always a risky game. The cleanest approach here IMHO would be building
packages for various distros (means: using their toolchains / libs).
This actually isn't as work intensive as it might sound - I'm doing this
all the day and have a bunch of helpful tools for that.

OTOH, if one really needs to be independent of distros, one should build
a complete nano-distro, where everything's installed under certain
prefix, including libc. Isn't a big deal at all - we have plenty tools
for that, daily practise in embedded world. The only difference would be
tweaking the ld scripts to set a different ld.so path.

Licensing with glibc also isn't a serious problem here. All you need to
do is be compliant with the LGPL. In short: publish all your patches to
glibc, offer the license and link dynamically. Already done that a
thousand times.

If there really is a serious demand for that, and you've got clients who
are really interested in doing that properly (a lot of companies out
there just duck an cover, and don't actually wanna talk about that :(),
let's sit together and create some consulting material on that.

>> Actually, you should talk to the compiler folks much more early, at the
>> point where you know how those features look like.
> 
> We do, but it's not enough.
> 
> GCC releases once a year, so it's easy to miss the feature freeze. 

Wait a minute ... how long does it take from the architectural design,
until the real silicon is out in the field ? I would be very surprised
whether the whole process in done in a much shorted time frame.

Note: by "much more early", I meant already at the point where the spec
of the new feature exists, at least on paper.

 > Then there are Linux distros that do LTS every 2 years or so.

Why don't the few actually affected parties just upgrade their compiler
on their build machines when needed ?

Come on, on other languages like rust (often also w/ go), you're
expected to upgrade your compiler almost on weekly basis, or you can't
build the current master branches of many projects anymore :p

> Worse, those two are
> usually out of phase. For example, if you're using the current Ubuntu LTS
> today (almost July 2021), you're using 20.04, which was released one month
> before the GCC 10 release. So you're using GCC 9, released May 2019, which
> means its features were frozen on December 2018. That's an incredibly long
> lead time.

Just install the backport ?

And if the backport hasn't be done at that point of time yet, just do
that ?

>> For using certain new CPU specific features, the need for a compiler
>> upgrade really should be no excuse. And at least for vast majority of
>> cases, a proper compiler could do it much better than the average
>> programmer.
> 
> To compile the software that uses those instructions, undoubtedly. But what if
> I did that for you and you could simply download the binary for the library
> and/or plugins such that you could slot into your existing systems and CI?
> This could make a difference between adoption or not.

For me, it wouldn't, at all. I never download binaries from untrusted
sources. (except for forensic analysis).

And putting my build engineering consultant hat on (which actually is
huge part of my business), I'd clearly say NO to precompiled binaries,
unless there really is no other way - very rare cases, where one usually
has a lot of other trouble, too.

Such precompiled binaries (where you cannot control the whole process)
are dangerous in many ways, for security, as well as safety, as well
economically.

>> Uh, that's really ancient. Nobody can seriously expect modern features
>> on such an ancient distro. If people really insist spending so much
>> money for running such old code, instead of just doing a dist upgrade,
>> then I can only reply with "not our problem".
>  > Yes and no.
> 
> Red Hat has been incredibly successful in backporting kernel features to the
> old 3.10 that came with RHEL 7. Whether they will do that for AMX state saving
> and the system call that we're discussing here, I can't say. AFAIU, they did
> backport the AVX512 state-saving to that 3.10, so they may.

Indeed they did. No idea why they spend so much resources into that,
instead of fixing their upgrade mechanisms once and for all, so a dist
upgrade is as simple and robust as e.g. on Debian. Or at least add newer
kernels to older distro bases - extra repos really dont hurt.

 From my operating perspective, I wouldn't actually call RHEL a very
professional distro, many simple things are just very complicated and
weird ... really no idea what these guys are doing there. Never could
find any serious technical explaination, must be some purely political
or ideological thing.

BUT: we're talking about about brand new silicon here. Why should
anybody - who really needs these new features - install such an ancient
OS on a brand new machine ?

 From my experience w/ larger clients, especially in public service, the
only reason is their own extremely lazy and overcomplicated beaurocratic
processes (problems that they've created on their own - the excuse of
this would be regulatory obligations is just proven completely wrong,
I'm perfectly capable of setting up formalized management processes
that are much more agile and still confirm to the usual regulations like
BSI, GDPO, 62304, etc, etc, and actually make them work as intented :p)
But exactly those organisations wouldn't be able to even buy this new
silicon anytime soon, since they own processes are just too slow in
updating their internal shopping basket.

Oh, and another interesting point: those organisations usually run all
their linux workloads in VMs (mostly ESXI, sometimes hyperv). Is AMX
already supported in these constallations ?

At that point I wonder whether the actual demand for this use case -
brand new silicon features && ancient distros && out-of-distro built
precompiled binaries && precompiled-binaries (not auditable) permitted -
actually a suffiently frequent use case that's worth taking care of it
in such  general way at all ?

[ not talking about very specific scenarios like CERN, who're building
special machinery on their own ]

> Even if they don't, the *software* that people deploy may be the same build
> for RHEL 7 and for a modern distro that will have a 5.14 kernel. 

Now we're getting to the vital point: trying to make "universal"
binaries for verious different distros. This is something I'm strictly
advising against since 25 years, because with that you're putting
yourself into *a lot* trouble (ABI compatibility between arbitrary
distros or even various distro releases always had been pretty much a
myth, only works for some specific cases). Just don't do it, unless you
*really* don't have any other chance.

What one could do is taking the distro's userland completely out of the
way by not using a single piece of it. Once upon a time, glibc could be
statically linked easily - not so easy anymore. But with a few bits of
patching you can have an entirely separate instance. (basically tweak
install pathes, including the ldstub). Or picking another libc supports
this better. Most of the other userland libs are quite easy to link
statically (yes, there're unfriendly exceptions) or at least can be
easily installed in a different location.

Now the questions of the build process for that. I'd recommend using
ptxdist for that. We have to twist a few knobs, but an creating a
generic toolkit (some /opt/foo-install counterpart of DistroKit)
shouldn't take much more than 4..6 weeks, and the problem is solved
once and for all. Anybody can then just put in his own SW, fire the
build and gets a ready to use image/tarball. Pretty much like we're
doing it in embedded world for aeons.

 > That software
 > may have non-AVX, AVX2, AVX512 and AMX-specific code paths and would
 > do runtime detection of which one is best to use.

Yes, nothing new, not unusual. All the SW needs is a way to detect
whether feature X is there or not and pick the right code pathes then.

> So my point is: this shouldn't be in glibc because the glibc will not have the
> new system call wrappers or TLS fields.

Yes, I'm fully on your side here. Glibc already is overloaded with too
much of those kind of things that shouldn't belong in there. Actually,
even stuff like DNS resolving IMHO doensn't belong into libc.

>> What we SW engineers need is an easy and fast method to act depending on
>> whether some CPU supports some feature (eg. a new opcode). Things like
>> cpuinfo are only a tiny piece of that. What we could really use is a
>> conditional jump/call based on whether feature X is supported - without
>> any kernel intervention. Then the machine code could be easily layed out
>> to support both cases with our without some feature X. Alternatively we
>> could have a fast trapping in useland - hw generated call already would
>> be a big help.
> 
> That's what cpuid is for. With GCC function multi-versioning or equivalent
> manually-rolled solutions, you can get exactly what you're asking for.

Not quite what I've been proposing.

My proposal would an conditional jump opcode that directly checks for
specific features. If this is well designed, I believe that can be
resolved by the cpu's internal prefetcher unit. But for that we'd also
need some extra task status bit so the cpu knows it is enabled for the
current task.

Another approach (when we'd design a new cpu) would be a special memory
region for opcode emulation call vectors with enough room for dozens
of potentially upcoming new opcodes. If cpu encounters an unsupported
op, it jumps there. A bit similar to IRQs or exceptions, but designed in
a way that emulation can be implemented entirely in userspace, just like
if the compiler would have put an jump instead of the new op there.
With such an instruction architecture, those kind of feature upgrades
would be much much easier, w/o even touching the kernel.

This wasn't invented by me - ancient mainframes from the 70ths have
prior art.

>> If we had something in that direction, we wouldn't have to have this
>> kind discussion here anymore - it would be entirely up to compiler and
>> library folks, no need for any kernel support at all.
> 
> For most features, there isn't. You don't see us discussing
> AVX512VP2INTERSECT, for example. This discussion only exists because AMX
> requires more state to be saved during context switches and signal delivery.

But over all these years, new some registers have been introduced.
I fail to imagine how context switches can be done properly w/o also
saving/restoring such new registers.

Actually I claim that (user accessible) registers are already the
problem. The good old Burroughs mainframes didn't have that problem,
since their IA doesn't even have registers, it's all operating on stack.
The harware (or microcode) automatically caches the top-n stack elements
in internal registers, so it is as fast as explicit registers, but the
code gets much simpler and more robust (they also have fancy things like
typed memory, etc, but that's another story).

>> And one point that immediately jumps into my mind (w/o looking deeper
>> into it): it introduces completely new registers - do we now need extra
>> code for tasks switching etc ?
> 
> Yes, this is the crux of this discussion.

Yes, and that's the reason why I call that a misdesign. It shouldn't
have been the big deal to do it w/o new registers - we know that's
possible from the 70ths.


Finally, as this stuff changes the ABI completely, perhaps we should
treat it directly that way: call it a new architecture that has its
own ID in the ELF header. By that we'd have a clear border line and
don't need to care about a dozen of corner cases nobody of us has on
the radar yet.


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-30 14:32       ` Enrico Weigelt, metux IT consult
@ 2021-06-30 14:34         ` Florian Weimer
  2021-06-30 15:16           ` Enrico Weigelt, metux IT consult
  2021-06-30 15:29         ` Thiago Macieira
  1 sibling, 1 reply; 35+ messages in thread
From: Florian Weimer @ 2021-06-30 14:34 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86

* Enrico Weigelt:

> OTOH, if one really needs to be independent of distros, one should build
> a complete nano-distro, where everything's installed under certain
> prefix, including libc. Isn't a big deal at all - we have plenty tools
> for that, daily practise in embedded world. The only difference would be
> tweaking the ld scripts to set a different ld.so path.

It breaks integration with system-wide settings, such as user/group
databases, host name lookup, and cryptographic policies.  In many
environments, that is not really an option.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-30 14:34         ` Florian Weimer
@ 2021-06-30 15:16           ` Enrico Weigelt, metux IT consult
  2021-06-30 15:38             ` Florian Weimer
  0 siblings, 1 reply; 35+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-30 15:16 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86

On 30.06.21 16:34, Florian Weimer wrote:

> It breaks integration with system-wide settings, such as user/group
> databases, host name lookup, and cryptographic policies.  In many
> environments, that is not really an option.

Not necessarily, these can still be applied (and fairly simple).
You actually have to twist more extra knobs if to wanted those weird
things to happen.

The only thing that won't work easily is when the operator forces some
custom libraries to be loaded arbitrarily into all processes. Yes,
somebody could write his own nss plugins, but that's exactly the kind
of audience that does NOT just use those (especially old) binary-only
distros. In over 20 years, being inside dozens of corporations, I've
only seen that exactly once. And it was me doing that.

I actually wonder which kind of binary only application that shall be
that's actually affected by that problem and actually used in the
field that way.

Do you have some actual practical (not theoretical) example ?

By the way: today's method of choice for delivering binary only
software is containers. (and I'd even count things like Steam into
that category).


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-30 14:32       ` Enrico Weigelt, metux IT consult
  2021-06-30 14:34         ` Florian Weimer
@ 2021-06-30 15:29         ` Thiago Macieira
  2021-07-01 11:57           ` Enrico Weigelt, metux IT consult
  1 sibling, 1 reply; 35+ messages in thread
From: Thiago Macieira @ 2021-06-30 15:29 UTC (permalink / raw)
  To: fweimer, Enrico Weigelt, metux IT consult
  Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

On Wednesday, 30 June 2021 07:32:29 PDT Enrico Weigelt, metux IT consult 
wrote:
> What does "buy-in" mean in that context ? Some other departement ? Some
> external developers ?
> 
> I tend to believe that those things should better be done by some
> independent party, maybe GNU, FSF, etc, and cpu vendors should just
> sponsor this work and provide necessary specs.

For me specifically, I need to identify some SW dept. that would take on the 
responsibility for it long-term. Abandonware would serve no one.

I wouldn't mind this were a collaborative project under the auspices of 
freedesktop.org or similar, so long as it is cross-platform to at least macOS, 
FreeBSD and Windows. But given that this is in Intel's interest for this 
library to exist and make it easy for people to use our CPU features, it 
seemed like a natural fit for Intel. And even if it isn't an Intel-owned 
project, we probably want to be contributors.

> Shipping precompiled binaries and linking against system libraries is
> always a risky game. The cleanest approach here IMHO would be building
> packages for various distros (means: using their toolchains / libs).
> This actually isn't as work intensive as it might sound - I'm doing this
> all the day and have a bunch of helpful tools for that.

I understand, but whether it is easier and better for 99% of the cases does 
not mean it is so for 100%. And most especially it does not guarantee that it 
will be used for everyone. For reasons real or not, there are precompiled 
binaries. Just see Google Chrome, for example.

> Licensing with glibc also isn't a serious problem here. All you need to
> do is be compliant with the LGPL. In short: publish all your patches to
> glibc, offer the license and link dynamically. Already done that a
> thousand times.

We can agree it's an additional hurdle, which will likely cause people to 
investigate a solution that doesn't require that hurdle.

> Wait a minute ... how long does it take from the architectural design,
> until the real silicon is out in the field ? I would be very surprised
> whether the whole process in done in a much shorted time frame.
> 
> Note: by "much more early", I meant already at the point where the spec
> of the new feature exists, at least on paper.

I'm not going to comment on the timing of architectural decisions. But just 
from the example I gave: in order to be ready for a late 2021 or early 2022 
launch, we'd need to have the feature's specification published and the 
patches accepted by December 2018. That's about 3 years lead time.

How many software projects (let alone mixed software and hardware) do you know 
that know 3 years ahead of time what they will need?

>  > Then there are Linux distros that do LTS every 2 years or so.
> 
> Why don't the few actually affected parties just upgrade their compiler
> on their build machines when needed ?

Have you tried?

Besides, the whole problem here is barrier of entry. If we don't make it easy 
for them to use the new features, they won't. And I was using this as an 
argument for why precompiled binaries will exist: the interested parties will 
take the pain to upgrade the compilers and other supporting software so that 
the build even of Open Source software is the most capable one, then release 
that binary for others who haven't. This lowers the barrier of entry 
significantly.

And this is all to justify that such a functionality shouldn't be part of 
glibc, where it can't be used by those precompiled binaries which, for one 
reason or another, will exist.

It should be in a small, permissively-licensed library that will often get 
statically linked into the binary in question.

> > To compile the software that uses those instructions, undoubtedly. But
> > what if I did that for you and you could simply download the binary for
> > the library and/or plugins such that you could slot into your existing
> > systems and CI? This could make a difference between adoption or not.
> 
> For me, it wouldn't, at all. I never download binaries from untrusted
> sources. (except for forensic analysis).

I understand and I am, myself, almost like you. I do have some precompiled 
binaries (aforementioned Google Chrome), but as a rule I avoid them.

But not everyone is like the two of us.

> BUT: we're talking about about brand new silicon here. Why should
> anybody - who really needs these new features - install such an ancient
> OS on a brand new machine ?

I don't know. It might be for fleet homogeneity: everything has the same SW 
installed, facilitating maintenance. Just coming up with reasons.

> > Even if they don't, the *software* that people deploy may be the same
> > build
> > for RHEL 7 and for a modern distro that will have a 5.14 kernel.
> 
> Now we're getting to the vital point: trying to make "universal"
> binaries for verious different distros. This is something I'm strictly
> advising against since 25 years, because with that you're putting
> yourself into *a lot* trouble (ABI compatibility between arbitrary
> distros or even various distro releases always had been pretty much a
> myth, only works for some specific cases). Just don't do it, unless you
> *really* don't have any other chance.

Well, that's the point, isn't it? Are we ready to call this use-case not 
valid, so it can't be used to support the argument of a solution that needs to 
be deployable to old distros?

> > So my point is: this shouldn't be in glibc because the glibc will not have
> > the new system call wrappers or TLS fields.
> 
> Yes, I'm fully on your side here. Glibc already is overloaded with too
> much of those kind of things that shouldn't belong in there. Actually,
> even stuff like DNS resolving IMHO doensn't belong into libc.

Thanks.

(name resolving is required by POSIX to be there, so it exists in every 
system; might as well be every libc)

> My proposal would an conditional jump opcode that directly checks for
> specific features. If this is well designed, I believe that can be
> resolved by the cpu's internal prefetcher unit. But for that we'd also
> need some extra task status bit so the cpu knows it is enabled for the
> current task.

That's more of a "can I use this now", instead of "can I use this ever". So 
far, the answer to the two has been the same. Therefore, there has been no 
need to have the functionality that you're describing.

> > For most features, there isn't. You don't see us discussing
> > AVX512VP2INTERSECT, for example. This discussion only exists because AMX
> > requires more state to be saved during context switches and signal
> > delivery.
> But over all these years, new some registers have been introduced.
> I fail to imagine how context switches can be done properly w/o also
> saving/restoring such new registers.

There have been a few small registers and state that need to be saved here and 
there, but the biggest blocks were:

- SSE state
- AVX state
- AVX512 state
- AMX state

The first two were small enough (and long enough ago) that the discussions 
were small and aren't relevant today. The AVX512 state was added in the past 
decade. And as you've seen from this thread, that is still a sticky point, and 
that was only about 1.5 kB.

However, the vast majority of CPU features do not add new context state.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-30 12:50       ` Enrico Weigelt, metux IT consult
@ 2021-06-30 15:36         ` Thiago Macieira
  2021-07-01  7:35           ` Enrico Weigelt, metux IT consult
  0 siblings, 1 reply; 35+ messages in thread
From: Thiago Macieira @ 2021-06-30 15:36 UTC (permalink / raw)
  To: Peter Zijlstra, Enrico Weigelt, metux IT consult
  Cc: fweimer, hjl.tools, libc-alpha, linux-api, linux-arch, x86

On Wednesday, 30 June 2021 05:50:30 PDT Enrico Weigelt, metux IT consult 
wrote:
> > No, but because it's register state and part of XSAVE, it has immediate
> > impact in ABI. In particular, the signal stack layout includes XSAVE (as
> > does ptrace()).
> 
> OMGs, I've already suspected such sickness. I don't even dare thinking
> about consequences for compilers and library ABIs.
> 
> Does anyone here know why they designed this as inline operations ? This
> thing seems to be pretty much what typical TPUs are doing (or a subset
> of it). Why not just adding a TPU next to the CPU on the same chip ?

To be clear: this is a SW ABI. It has nothing to do the presence or absence of 
other processing units in the system.

The moment you receive a Unix signal with SA_SIGINFO, the mcontext state needs 
to be saved somewhere. Where would you save it? Please remember that:

- signal handlers can be called at any point in the execution, including
  in the middle of malloc()
- signal handlers can longjmp out of the handler back into non-handler code
- in a multithreaded application, each thread can be handling a signal 
  simultaneously

We could have the kernel hold on to that and have a system call to extract 
them, but that's an ABI change and I think won't work for the longjmp case.

> > Userspace will have to do something like:
> >   - check CPUID, if !AMX -> fail
> >   - issue prctl(), if error -> fail
> >   - issue XGETBV and check the AMX bit it set, if not -> fail
> 
> Can't we to this just by prctl() call ?
> IOW: ask the kernel, who gonna say yes or no.

That's possible. The kernel can't enable an AMX state on a system without AMX.

> Are there any situations where kernel says yes, but process still can't
> use it ? Why so ?

Today there is no such case that I can think of.

> >   - request the signal stack size / spawn threads
> 
> Signal stack is separate from the usual stack, right ?
> Why can't this all be done in one shot ?

Yes, we're talking about the sigaltstack() call.

What is "this all" in the sentence above?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-30 15:16           ` Enrico Weigelt, metux IT consult
@ 2021-06-30 15:38             ` Florian Weimer
  2021-07-01  8:08               ` Enrico Weigelt, metux IT consult
  0 siblings, 1 reply; 35+ messages in thread
From: Florian Weimer @ 2021-06-30 15:38 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86

* Enrico Weigelt:

> On 30.06.21 16:34, Florian Weimer wrote:
>
>> It breaks integration with system-wide settings, such as user/group
>> databases, host name lookup, and cryptographic policies.  In many
>> environments, that is not really an option.
>
> Not necessarily, these can still be applied (and fairly simple).
> You actually have to twist more extra knobs if to wanted those weird
> things to happen.

Sorry, this is just not true.  You cannot load system libraries such as
NSS modules or cryptographic libraries with a custom glibc because the
system glibc could be newer, and glibc does not provide that kind of
compatibility (only the other way round).

Thanks,
Florian


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-30 15:36         ` Thiago Macieira
@ 2021-07-01  7:35           ` Enrico Weigelt, metux IT consult
  0 siblings, 0 replies; 35+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-07-01  7:35 UTC (permalink / raw)
  To: Thiago Macieira, Peter Zijlstra
  Cc: fweimer, hjl.tools, libc-alpha, linux-api, linux-arch, x86

On 30.06.21 17:36, Thiago Macieira wrote:

Hi,

>> Does anyone here know why they designed this as inline operations ? This
>> thing seems to be pretty much what typical TPUs are doing (or a subset
>> of it). Why not just adding a TPU next to the CPU on the same chip ?
> 
> To be clear: this is a SW ABI. It has nothing to do the presence or absence of
> other processing units in the system.

Well, if I'm correct, it's needed because there is some additional unit
whose state need to be saved. And that again necessary, because this
unit is controlled directly by the usual CPU instruction stream (in
contrast to separately programmed devices like a gpu, sdma, etc).

> The moment you receive a Unix signal with SA_SIGINFO, the mcontext state needs
> to be saved somewhere. Where would you save it? Please remember that:
> 
> - signal handlers can be called at any point in the execution, including
>    in the middle of malloc()
> - signal handlers can longjmp out of the handler back into non-handler code
> - in a multithreaded application, each thread can be handling a signal
>    simultaneously

Yes, the last part seems to be the most tricky point.

If we were only talking about kernel controlled context switches (task
switches) and sighandler always return to the kernel, then the kernel
could handle that all internally, w/o userland never knowing it. But
unfortunately that's not the case :(

>>> Userspace will have to do something like:
>>>    - check CPUID, if !AMX -> fail
>>>    - issue prctl(), if error -> fail
>>>    - issue XGETBV and check the AMX bit it set, if not -> fail
>>
>> Can't we to this just by prctl() call ?
>> IOW: ask the kernel, who gonna say yes or no.
> 
> That's possible. The kernel can't enable an AMX state on a system without AMX.

Good, that could at least make the API somewhat simpler.

>>>    - request the signal stack size / spawn threads
>>
>> Signal stack is separate from the usual stack, right ?
>> Why can't this all be done in one shot ?
> 
> Yes, we're talking about the sigaltstack() call.
> 
> What is "this all" in the sentence above?

Taking care of big large enough signal stack along with enabling AMX in
one shot. This might not support all kind of uses of sigaltstack(), but
do really need to support that all ?

IMHO, the whole AMX issue is just for *new* software (and I haven't seen
practical use of alternative sighandler stack for aeons), so it's not
about compatibility to existing software. Theoretically we could declare
the combination AMX and sigaltstack() just isn't supported. (Of course,
some combinations of using old libraries might break - but even if old
library code is reused, it's still new software).

Maybe not a completely satisfying idea, but perhaps something that's
much easier to achieve and gets the actual problem solved.


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-30 15:38             ` Florian Weimer
@ 2021-07-01  8:08               ` Enrico Weigelt, metux IT consult
  2021-07-01  8:21                 ` Florian Weimer
  0 siblings, 1 reply; 35+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-07-01  8:08 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86

On 30.06.21 17:38, Florian Weimer wrote:

>> Not necessarily, these can still be applied (and fairly simple).
>> You actually have to twist more extra knobs if to wanted those weird
>> things to happen.
> 
> Sorry, this is just not true.  You cannot load system libraries such as
> NSS modules or cryptographic libraries with a custom glibc because the
> system glibc could be newer, and glibc does not provide that kind of
> compatibility (only the other way round).

Yes, such glibc "plugins" specifically need to be built for the same
glibc version.

I've already mentioned that in 25yrs I've had such scenario, where some
operator actually wants to load *3rdparty* nss modules (that are *not*
included in upstream glibc), just *once* - and this time myself doing
that funny stuff (and also written that module).

And I'm repeating my previous questions: can you name some actual real
world (not hypothetical or academical) scenarios where:

somebody really needs some binary-only application &&
needs those extra modules *into that* application &&
cannot recompile these modules into the applications's prefix &&
needs AMX in that application &&
cannot just use chroot &&
cannot put it into container ?

I happen to be one exacly those folks whose better part of the daily
business is dealing with really crazy scenarios, most people don't
even dare thinking about (also dealing with horrible stuff like having
link in binary-only objects into embedded applications which had been
compiled for different ABIs, etc) - and I can only conclude that the
above kind of scenario is really, really rare, usually caused by
completely non-technical factors (e.g. beaurocrazy) and there're always
other options.


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-01  8:08               ` Enrico Weigelt, metux IT consult
@ 2021-07-01  8:21                 ` Florian Weimer
  2021-07-01 11:59                   ` Enrico Weigelt, metux IT consult
  0 siblings, 1 reply; 35+ messages in thread
From: Florian Weimer @ 2021-07-01  8:21 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86

* Enrico Weigelt:

> And I'm repeating my previous questions: can you name some actual real
> world (not hypothetical or academical) scenarios where:
>
> somebody really needs some binary-only application &&
> needs those extra modules *into that* application &&
> cannot recompile these modules into the applications's prefix &&
> needs AMX in that application &&
> cannot just use chroot &&
> cannot put it into container ?

There are no real-world scenarios yet which involve AMX, so I'm not sure
what you are after with this question.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-30 15:29         ` Thiago Macieira
@ 2021-07-01 11:57           ` Enrico Weigelt, metux IT consult
  0 siblings, 0 replies; 35+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-07-01 11:57 UTC (permalink / raw)
  To: Thiago Macieira, fweimer
  Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

On 30.06.21 17:29, Thiago Macieira wrote:

Hi,

> For me specifically, I need to identify some SW dept. that would take on the
> responsibility for it long-term. Abandonware would serve no one.

Right. And from what I've learning in such large organisations, finding
someone who's willing to take such responsibility isn't an easy task :o

> I wouldn't mind this were a collaborative project under the auspices of
> freedesktop.org or similar, so long as it is cross-platform to at least macOS,
> FreeBSD and Windows.

hmm, this also hightly depends on what that library shall be actually
doing. if we're talking just about cpuid parsing (and providing some
comfortable API) isn't a big deal at all. (personally, can't contribute
much for the actual backend side on window or macos, but shouldn't be so
hard to find some people here).

But for AMX, unfortunately, that's just a tiny fraction. Actually, it
seems that this specific information is actually useless, since we need
to explicitly ask the kernel so set things up and it can just say no.

> But given that this is in Intel's interest for this library to exist and
> make it easy for people to use our CPU features, it seemed like a natural
> fit for Intel. And even if it isn't an Intel-owned project, we probably 
> want to be contributors.

Great :)

> I understand, but whether it is easier and better for 99% of the cases does
> not mean it is so for 100%. And most especially it does not guarantee that it
> will be used for everyone. For reasons real or not, there are precompiled
> binaries. Just see Google Chrome, for example.

Yes, there is a lot of such crap out there and Chrome is a good bad
example. But that doesn't mean that we here should ever care about that.
Let'em deal with their weird stuff on their own.

Specifically on such applications like Chrome, their official excuses
for doing this are pretty funny, some examples:

#1. "oh, it is so hard building native packages for many distros."

     No, it's not (especially not for such huge sw development companies
     like Google, who do have the right people for that). I'm doing that
     all the day. I've got even my own tools for that, freely available
     on github. This can be done 99% fully automatically.

     The actual problem is: SW like Chrome is just a monster, which is
     very hard to build from source. This problem was created by nobody
     else than the vendor.

#2: "we need our own install/update mechanism, to make it easier for
     users to install"

     No, we have easy install/update mechanism since 25 years. It's
     called package management. Just make your SW easy to build and
     customize, and distros take care of it - just like w/ 100s of 1000s
     other packages. Oh, BTW, this even exists on Windows.

#3: "we need some fast update for security fixes"

     Yes, we have that: package management. See #2.

Chrome is an excellent example for doing so many things so fundamentally
wrong (BTW: large portion of the security problems also come from that
crazy SW architecture - e.g. putting so many different things into one
program that could easily be done in several entirely separate ones).
They just made simple things extremely complicated.

I could start a rant on even worse examples like NVidia or NI here, but
in respect of blood presure I'll stop here :p

>> Licensing with glibc also isn't a serious problem here. All you need to
>> do is be compliant with the LGPL. In short: publish all your patches to
>> glibc, offer the license and link dynamically. Already done that a
>> thousand times.
> 
> We can agree it's an additional hurdle, which will likely cause people to
> investigate a solution that doesn't require that hurdle.

Yes, it is *some* hurdle, but I doubt that this is a show stopper. When
you're already on Linux SW development, you almost certainly have to
cope with such things anyways (unless you're really don't using anything
of the existing ecosystem - which, in very most cases, is very bad
design in the first place).

The *extra* hurdle here in that case is:

*if* you patch glibc, you have to publish the patches, if you don't
patch it, there's nothing to do.

> I'm not going to comment on the timing of architectural decisions. But just
> from the example I gave: in order to be ready for a late 2021 or early 2022
> launch, we'd need to have the feature's specification published and the
> patches accepted by December 2018. That's about 3 years lead time.

Sorry, but I still don't get why that 3yrs window. We're talking about
entirely new features where SW actually *using* that doesn't even exist.
I wonder where you get the idea from that *old* SW shall magically
support something entirely new, w/o rewriting it.

Since much of the discussion here was about large and beaurocratic
organisations (which I'm coping w/ on daily basis) with lots of formal
processes - those who still run the ancient distros, we've been talking
about:

For those, if some software already in use by them comes around with a
new release that actually uses that new feature, it's not a minor path
anymore, but a major jump that implies migration or new installation.
They have to go through the whole evaluation, testing, qualification,
integration processes from start. From process perspective, it's
basically a new product. And most likely, it's not even a standard
product (what some departement could just order centrally, like some
laptop or a standard VM or some piece of storage). It will take even
quite some time, before they're able to order this new hardware.

There're basically two chances for getting such installations in those
organisations (remember: exactly the users of such old distros):

a) bring the new product into the organisation wide standard process.
    Usually takes *at least* a whole year (often much longer). And also
    add the new hardware (complete machines that already come with the
    the new CPU types) to the standard catalog (similar time frame)

    If the SW is run in the data center, which alomost certainly means
    running in a VM (ESXI or HyperV), then this VM infrastructure needs
    to fully support this, too. Add another year.

b) the whole installation is completely taken out of standard processes.
    (often called "appliance"). then all the factors that lead to
    sticking to the old distro version in the first place, vanish in
    thin air. (the whole reason for choosing those distros is mostly
    coming from the management processes, not actually technical reasons)
    IT-ops managers usually don't like those out-of-process "appliances"
    very much, they're are quite common.

> How many software projects (let alone mixed software and hardware) do you know
> that know 3 years ahead of time what they will need?

In embedded world (large portion of my business), that's the usual case.
Once you go into regulated fields like automotive or medical, that's
even required by the usual processes, and the time frames are much
longer.

But we'd been talking about the timeline of CPU feature development and
corresponding SW support. Haven't done actual CPU development myself (my
self-soldered transputer experiment from the mid 90th wouldn't count
here ;-)), but I'd suspect that from the point in time where the CPU
designers create a new feature on the sketchboard and decide how the IA
would look like, until the point where the public can actually buy
standard machines with these new chips, will take several years. Let's
just randomly assume three years. Now if these CPU designers would
publish the new IA spec at that point, we had three years time until
we might see the first actual machines in the field. (as outlined above,
wide deployment in large organisations will take even much longer).

>>   > Then there are Linux distros that do LTS every 2 years or so.
>>
>> Why don't the few actually affected parties just upgrade their compiler
>> on their build machines when needed ?
> 
> Have you tried?

Actually, yes.

And when setting up CI infrastructures, I'm usually letting them build
the SW by various toolchain versions and combinations. Once you've got
the basic infrastructure set up, adding more combinations is just a
matter of more iron.

> Besides, the whole problem here is barrier of entry. If we don't make it easy
> for them to use the new features, they won't. 

Let's to look at it from a completely different angle:

In order to make actually use of those new features, several things have
to be done (with different stakeholders):

1. education: developers need to know how to actually write code
    actually using these features in a good way.
2. tech: OS and tooling support
3. customers need to be able to actually buy and deploy that new HW,
    as well as new SW that actually uses these new features
    (internal purchase, qualification processes, support, ...)

The main point we're discussion right now is #2. In this discussion
we've now learned:

1. it needs certainly needs new versions of OS components and tooling,
    and SW vendors need to build and deliver new SW versions for that.
2. many folks out there have problems with upgrading these things
3. the above problems aren't specific to cpu features, but we encounter
    them in a wide range of other scenarios, too.

So, why don't we focus on solving these problem once and for all ?

Let's try to give the customers the necessary means that they just don't
need to stick old compilers, libraries, etc., anymore.

An important part of that story is building and shipping software for
various distros. We have the tools and methods for doing very most of
that fully automatically. Basically needs: putting the tools together,
lots of iron, and training + consulting. Been there a lot of times

(actually started creating some out-of-the-box solution for that - just
lacking the marketing to make it economically feasable for me.)

> And I was using this as an
> argument for why precompiled binaries will exist: the interested parties will
> take the pain to upgrade the compilers and other supporting software so that
> the build even of Open Source software is the most capable one, then release
> that binary for others who haven't. This lowers the barrier of entry
> significantly.

As mentioned, the root cause is that tooling upgrade and building
certain software isn't quite easy, which is also one of the main causes
why certain SW isn't up to date in many distros. Let's solve the root
problem instead of just cleaning up its fallout again and again.

> And this is all to justify that such a functionality shouldn't be part of
> glibc, where it can't be used by those precompiled binaries which, for one
> reason or another, will exist.

I fully aggree that those things shouldn't belong into glibc (along with
many other things).

> It should be in a small, permissively-licensed library that will often get
> statically linked into the binary in question.

ACK.

>> For me, it wouldn't, at all. I never download binaries from untrusted
>> sources. (except for forensic analysis).
> 
> I understand and I am, myself, almost like you. I do have some precompiled
> binaries (aforementioned Google Chrome), but as a rule I avoid them.
> 
> But not everyone is like the two of us.

The really interesting question why do people wanna do that at all ?

IMHO, the most plausible reasons is: they can't get that SW (or release
thereof) from their distro and also can't build it on their own.

So, that's the main problem here. We should focus on eliminating the
need for such out-of-distro precompiled binaries (even for proprietary
software - take care that it is built for the distros, and using the
distro's build/deployment toolchain).

>> BUT: we're talking about about brand new silicon here. Why should
>> anybody - who really needs these new features - install such an ancient
>> OS on a brand new machine ?
> 
> I don't know. It might be for fleet homogeneity: everything has the same SW
> installed, facilitating maintenance. Just coming up with reasons.

Yes, and where's that actually coming from ? Beaurocratic processes that
reached a state where they're more for their own sake. Not a technical
problem. Nothing one being in the role of software engieers should care
about - the best thing one can do at that point is to ignore this
beaurocrazy and focus on technological excellence.

I'm not saying we IT folks / SW engineers shouldn't take care of
management processes - in the contrary, we should actually work on
improving them. But we can only do that when we're asked to do so and
put in control of these processes - which, unfortunately, rarely
happens.

>> Now we're getting to the vital point: trying to make "universal"
>> binaries for verious different distros. This is something I'm strictly
>> advising against since 25 years, because with that you're putting
>> yourself into *a lot* trouble (ABI compatibility between arbitrary
>> distros or even various distro releases always had been pretty much a
>> myth, only works for some specific cases). Just don't do it, unless you
>> *really* don't have any other chance.
> 
> Well, that's the point, isn't it? Are we ready to call this use-case not
> valid, so it can't be used to support the argument of a solution that needs to
> be deployable to old distros?

Almost. The solution is simple: software needs to be built for the
distro, using the distro's build/packaging toolchain. And if some
particular distro is yet missing something (e.g. a more recent
compiler), just add it.

It's actually not hard, as soon we get out of our inhouse bubbles and
act as a community, actively cooperating with each other. Yet another
not technical, but a social problem.

>> My proposal would an conditional jump opcode that directly checks for
>> specific features. If this is well designed, I believe that can be
>> resolved by the cpu's internal prefetcher unit. But for that we'd also
>> need some extra task status bit so the cpu knows it is enabled for the
>> current task.
> 
> That's more of a "can I use this now", instead of "can I use this ever". So
> far, the answer to the two has been the same. Therefore, there has been no
> need to have the functionality that you're describing.

Not quite. There's a wide range between "now" and "ever". Yes, AMX is a
much harder problem since extra state needs to be stored on context
switches. If I'd would be designing a completely new CPU, I'd make sure
that those things don't cause any trouble at all (as Burroughs already
did in the 70th).

But yes, that's quite academical, unless we had a chance to actually
discuss with the CPU designers and influence their decisions.

>> But over all these years, new some registers have been introduced.
>> I fail to imagine how context switches can be done properly w/o also
>> saving/restoring such new registers.
> 
> There have been a few small registers and state that need to be saved here and
> there, but the biggest blocks were:
> 
> - SSE state
> - AVX state
> - AVX512 state
> - AMX state
> 
> The first two were small enough (and long enough ago) that the discussions
> were small and aren't relevant today. The AVX512 state was added in the past
> decade. And as you've seen from this thread, that is still a sticky point, and
> that was only about 1.5 kB.

Yes, so we see the problem isn't new. I'm feeling very sad about the
CPU designers not just repeating those mistakes but now even worse. And
they don't even discuss with us SW engineers first :(

If I had been asked by them, I'd basically recommended a dedicated TPU.
(along with a very long list of other points)


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-01  8:21                 ` Florian Weimer
@ 2021-07-01 11:59                   ` Enrico Weigelt, metux IT consult
  2021-07-06 12:57                     ` Florian Weimer
  0 siblings, 1 reply; 35+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-07-01 11:59 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86

On 01.07.21 10:21, Florian Weimer wrote:
> * Enrico Weigelt:
> 
>> And I'm repeating my previous questions: can you name some actual real
>> world (not hypothetical or academical) scenarios where:
>>
>> somebody really needs some binary-only application &&
>> needs those extra modules *into that* application &&
>> cannot recompile these modules into the applications's prefix &&
>> needs AMX in that application &&
>> cannot just use chroot &&
>> cannot put it into container ?
> 
> There are no real-world scenarios yet which involve AMX, so I'm not sure
> what you are after with this question.

Okay, let's take AMX out of the equation (until it actually arrives
in the field). How does it look like then ?


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-01 11:59                   ` Enrico Weigelt, metux IT consult
@ 2021-07-06 12:57                     ` Florian Weimer
  0 siblings, 0 replies; 35+ messages in thread
From: Florian Weimer @ 2021-07-06 12:57 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult
  Cc: Thiago Macieira, hjl.tools, libc-alpha, linux-api, linux-arch, x86

* Enrico Weigelt:

> On 01.07.21 10:21, Florian Weimer wrote:
>> * Enrico Weigelt:
>> 
>>> And I'm repeating my previous questions: can you name some actual real
>>> world (not hypothetical or academical) scenarios where:
>>>
>>> somebody really needs some binary-only application &&
>>> needs those extra modules *into that* application &&
>>> cannot recompile these modules into the applications's prefix &&
>>> needs AMX in that application &&
>>> cannot just use chroot &&
>>> cannot put it into container ?
>> There are no real-world scenarios yet which involve AMX, so I'm not
>> sure
>> what you are after with this question.
>
> Okay, let's take AMX out of the equation (until it actually arrives
> in the field). How does it look like then ?

We have customers that want to use name service switch (NSS) plugins in
proprietary software and who do not want to distribute the (GNU)
toolchain with their application.  The latter excludes
chroot/containers.  Some applications more or less require to run
directly on the host (e.g., if they have some system monitoring aspect).

Thanks,
Florian


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-23 15:32 ` Dave Hansen
@ 2021-07-08  6:05   ` Florian Weimer
  2021-07-08 14:19     ` Dave Hansen
  0 siblings, 1 reply; 35+ messages in thread
From: Florian Weimer @ 2021-07-08  6:05 UTC (permalink / raw)
  To: Dave Hansen; +Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel

* Dave Hansen:

> On 6/23/21 8:04 AM, Florian Weimer wrote:
>> https://www.gnu.org/software/libc/manual/html_node/X86.html
> ...
>> Previously kernel developers have expressed dismay that we didn't
>> coordinate the interface with them.  This is why I want raise this now.
>
> This looks basically like someone dumped a bunch of CPUID bit values and
> exposed them to applications without considering whether applications
> would ever need them.  For instance, why would an app ever care about:
>
> 	PKS – Protection keys for supervisor-mode pages.
>
> And how could glibc ever give applications accurate information about
> whether PKS "is supported by the operating system"?  It just plain
> doesn't know, or at least only knows from a really weak ABI like
> /proc/cpuinfo.

glibc is expected to mask these bits for CPU_FEATURE_USABLE because they
have unknown semantics (to glibc).

They are still exposed via HAS_CPU_FEATURE.

I argued against HAS_CPU_FEATURE because the mere presence of this
interface will introduce application bugs because application really
must use CPU_FEATURE_USABLE instead.

I wanted to go with a curated set of bits, but we couldn't get consensus
around that.  Curiously, the present interface can expose changing CPU
state (if the kernel updates some fixed memory region accordingly), my
preferred interface would not have supported that.

> It also doesn't seem to tell applications what they want which is, "can
> I, the application, *use* this feature?"

CPU_FEATURE_USABLE is supposed to be that interface.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-25 23:31 ` Thiago Macieira
  2021-06-28 12:40   ` Enrico Weigelt, metux IT consult
@ 2021-07-08  7:08   ` Florian Weimer
  2021-07-08 15:13     ` Thiago Macieira
  1 sibling, 1 reply; 35+ messages in thread
From: Florian Weimer @ 2021-07-08  7:08 UTC (permalink / raw)
  To: Thiago Macieira; +Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

* Thiago Macieira:

> On 23 Jun 2021 17:04:27 +0200, Florian Weimer wrote:
>> We have an interface in glibc to query CPU features:
>> X86-specific Facilities
>> <https://www.gnu.org/software/libc/manual/html_node/X86.html>
>>
>> CPU_FEATURE_USABLE all preconditions for a feature are met,
>> HAS_CPU_FEATURE means it's in silicon but possibly dormant.
>> CPU_FEATURE_USABLE is supposed to look at XCR0, AT_HWCAP2 etc. before
>> enabling the relevant bit (so it cannot pass through any unknown bits).
>
> It's a nice initiative, but it doesn't help library and applications that need 
> to be either cross-platform or backwards compatible.
>
> The first problem is the cross-platformness need. Because we library and 
> application developers need to support other OSes, we'll need to deploy our 
> own CPUID-based detection. It's far better to use common code everywhere, 
> where one developer working on Linux can fix bugs in FreeBSD, macOS or Windows 
> or any of the permutations. Every platform-specific deviation adds to 
> maintenance requirements and is a source of potential latent bugs, now or in 
> the future due to refactoring. That is why doing everything in the form of 
> instructions would be far better and easier, rather than system calls.

I must say this is a rather application-specific view.  Sure, you get
consistency within the application across different targets, but for
those who work on multiple applications (but perhaps on a single
distribution/OS), things are very inconsistent.

And the reason why I started this is that CPUID-based feature detection
is dead anyway (assuming the kernel developers do not implement lazy
initialization of the AMX state).  CPUID (and ancillary data such as
XCR0) will say that AMX support is there, but it will not work unless
some (yet to decided) steps are executed by the userspace thread.

While I consider the CPUID-based model a success (and the cross-OS
consistency may have contributed to that), its days seem to be over.

> [Unless said system calls were standardised and actually
> deployed. Making this a cross-platform library that is not part of
> libc would be a major step in that direction]

That won't help with AMX, as far as I can tell.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-08  6:05   ` Florian Weimer
@ 2021-07-08 14:19     ` Dave Hansen
  2021-07-08 14:31       ` Florian Weimer
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2021-07-08 14:19 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel

On 7/7/21 11:05 PM, Florian Weimer wrote:
>> This looks basically like someone dumped a bunch of CPUID bit values and
>> exposed them to applications without considering whether applications
>> would ever need them.  For instance, why would an app ever care about:
>>
>> 	PKS – Protection keys for supervisor-mode pages.
>>
>> And how could glibc ever give applications accurate information about
>> whether PKS "is supported by the operating system"?  It just plain
>> doesn't know, or at least only knows from a really weak ABI like
>> /proc/cpuinfo.
> glibc is expected to mask these bits for CPU_FEATURE_USABLE because they
> have unknown semantics (to glibc).

OK, so if I call CPU_FEATURE_USABLE(PKS) on a system *WITH* PKS
supported in the operating system, I'll get false from an interface that
claims to be:

> This macro returns a nonzero value (true) if the processor has the
> feature name and the feature is supported by the operating system.

The interface just seems buggy by *design*.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-08 14:19     ` Dave Hansen
@ 2021-07-08 14:31       ` Florian Weimer
  2021-07-08 14:36         ` Dave Hansen
  0 siblings, 1 reply; 35+ messages in thread
From: Florian Weimer @ 2021-07-08 14:31 UTC (permalink / raw)
  To: Dave Hansen; +Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel

* Dave Hansen:

> On 7/7/21 11:05 PM, Florian Weimer wrote:
>>> This looks basically like someone dumped a bunch of CPUID bit values and
>>> exposed them to applications without considering whether applications
>>> would ever need them.  For instance, why would an app ever care about:
>>>
>>> 	PKS – Protection keys for supervisor-mode pages.
>>>
>>> And how could glibc ever give applications accurate information about
>>> whether PKS "is supported by the operating system"?  It just plain
>>> doesn't know, or at least only knows from a really weak ABI like
>>> /proc/cpuinfo.
>> glibc is expected to mask these bits for CPU_FEATURE_USABLE because they
>> have unknown semantics (to glibc).
>
> OK, so if I call CPU_FEATURE_USABLE(PKS) on a system *WITH* PKS
> supported in the operating system, I'll get false from an interface that
> claims to be:
>
>> This macro returns a nonzero value (true) if the processor has the
>> feature name and the feature is supported by the operating system.
>
> The interface just seems buggy by *design*.

Yes, but that is largely a documentation matter.  We should have said
something about “userspace” there, and that the bit needs to be known to
glibc.  There is another exception: FSGSBASE, and that's a real bug we
need to fix (it has to go through AT_HWCAP2).

If we want to avoid that, we need to go down the road of a curated set
of CPUID bits, where a bit only exists if we have taught glibc its
semantics.  You still might get a false negative by running against an
older glibc than the application was built for.  (We are not going to
force applications that e.g. look for FSGSBASE only run with a glibc
that is at least of that version which implemented semantics for the
FSGSBASE bit.)

Thanks,
Florian


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-08 14:31       ` Florian Weimer
@ 2021-07-08 14:36         ` Dave Hansen
  2021-07-08 14:41           ` Florian Weimer
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2021-07-08 14:36 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel

On 7/8/21 7:31 AM, Florian Weimer wrote:
>> OK, so if I call CPU_FEATURE_USABLE(PKS) on a system *WITH* PKS
>> supported in the operating system, I'll get false from an interface that
>> claims to be:
>>
>>> This macro returns a nonzero value (true) if the processor has the
>>> feature name and the feature is supported by the operating system.
>> The interface just seems buggy by *design*.
> Yes, but that is largely a documentation matter.  We should have said
> something about “userspace” there, and that the bit needs to be known to
> glibc.  There is another exception: FSGSBASE, and that's a real bug we
> need to fix (it has to go through AT_HWCAP2).
> 
> If we want to avoid that, we need to go down the road of a curated set
> of CPUID bits, where a bit only exists if we have taught glibc its
> semantics.  You still might get a false negative by running against an
> older glibc than the application was built for.  (We are not going to
> force applications that e.g. look for FSGSBASE only run with a glibc
> that is at least of that version which implemented semantics for the
> FSGSBASE bit.)

That's kinda my whole point.

These *MUST* be curated to be meaningful.  Right now, someone just
dumped a set of CPUID bits into the documentation.

The interface really needs *three* modes:

1. Yes, the CPU/OS supports this feature
2. No, the CPU/OS doesn't support this feature
3. Hell if I know, never heard of this feature
	
The interface really conflates 2 and 3.  To me, that makes it
fundamentally flawed.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-08 14:36         ` Dave Hansen
@ 2021-07-08 14:41           ` Florian Weimer
  0 siblings, 0 replies; 35+ messages in thread
From: Florian Weimer @ 2021-07-08 14:41 UTC (permalink / raw)
  To: Dave Hansen; +Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, linux-kernel

* Dave Hansen:

> That's kinda my whole point.
>
> These *MUST* be curated to be meaningful.  Right now, someone just
> dumped a set of CPUID bits into the documentation.
>
> The interface really needs *three* modes:
>
> 1. Yes, the CPU/OS supports this feature
> 2. No, the CPU/OS doesn't support this feature
> 3. Hell if I know, never heard of this feature
> 	
> The interface really conflates 2 and 3.  To me, that makes it
> fundamentally flawed.

That's an interesing point.

3 looks potentially more useful than the feature/usable distinction to
me.

The recent RTM change suggests that there are more states, but we
probably can't do much about such soft-disable changes.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-07-08  7:08   ` Florian Weimer
@ 2021-07-08 15:13     ` Thiago Macieira
  0 siblings, 0 replies; 35+ messages in thread
From: Thiago Macieira @ 2021-07-08 15:13 UTC (permalink / raw)
  To: Florian Weimer; +Cc: hjl.tools, libc-alpha, linux-api, linux-arch, x86

On Thursday, 8 July 2021 00:08:16 PDT Florian Weimer wrote:
> > The first problem is the cross-platformness need. Because we library and
> > application developers need to support other OSes, we'll need to deploy
> > our
> > own CPUID-based detection. It's far better to use common code everywhere,
> > where one developer working on Linux can fix bugs in FreeBSD, macOS or
> > Windows or any of the permutations. Every platform-specific deviation
> > adds to maintenance requirements and is a source of potential latent
> > bugs, now or in the future due to refactoring. That is why doing
> > everything in the form of instructions would be far better and easier,
> > rather than system calls.
> I must say this is a rather application-specific view.  Sure, you get
> consistency within the application across different targets, but for
> those who work on multiple applications (but perhaps on a single
> distribution/OS), things are very inconsistent.

Why would they be inconsistent, if the library is cross-platform?

> And the reason why I started this is that CPUID-based feature detection
> is dead anyway (assuming the kernel developers do not implement lazy
> initialization of the AMX state).  CPUID (and ancillary data such as
> XCR0) will say that AMX support is there, but it will not work unless
> some (yet to decided) steps are executed by the userspace thread.
> 
> While I consider the CPUID-based model a success (and the cross-OS
> consistency may have contributed to that), its days seem to be over.

Well, we need to design the API of this library such that we can accommodate 
the various possibilities. For all CPU possibilities, the library needs to be 
able to tell what the state of support is, among a state of "already enabled", 
"possible but not enabled" and "impossible", along with a call to enable them. 
The latter should be supported at least for AVX512 and AMX states. On Linux, 
only AMX will be tristate, but on macOS we need the tristate for AVX512 too.

This library would then wrap all the necessary checking for OSXSAVE and XCR0, 
so the user doesn't need to worry about them or how the OS enables them, only 
the features they're interested in.

Additionally, I'd like the library to also have constant expression paths that 
evaluate to constant true if the feature was already enabled at compile time 
(e.g., -march=x86-64-v3 sets __AVX2__ and __FMA__, so you can always run AVX2 
and FMA code, without checking). But that's just icing on top.

(it won't come as a surprise that I already have code for most of this)

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: x86 CPU features detection for applications (and AMX)
  2021-06-23 15:04 x86 CPU features detection for applications (and AMX) Florian Weimer
  2021-06-23 15:32 ` Dave Hansen
  2021-06-25 23:31 ` Thiago Macieira
@ 2021-07-08 17:56 ` Mark Brown
  2 siblings, 0 replies; 35+ messages in thread
From: Mark Brown @ 2021-07-08 17:56 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, linux-api, x86, linux-arch, H.J. Lu, Catalin Marinas,
	Will Deacon

[-- Attachment #1: Type: text/plain, Size: 1662 bytes --]

On Wed, Jun 23, 2021 at 05:04:27PM +0200, Florian Weimer wrote:

Copying in Catalin & Will.

> We have an interface in glibc to query CPU features:

>   X86-specific Facilities
>   <https://www.gnu.org/software/libc/manual/html_node/X86.html>

> CPU_FEATURE_USABLE all preconditions for a feature are met,
> HAS_CPU_FEATURE means it's in silicon but possibly dormant.
> CPU_FEATURE_USABLE is supposed to look at XCR0, AT_HWCAP2 etc. before
> enabling the relevant bit (so it cannot pass through any unknown bits).

...

> When we designed this glibc interface, we assumed that bits would be
> static during the life-time of the process, initialized at process
> start.  That follows the model of previous x86 CPU feature enablement.

...

> This still wouldn't cover the enable/disable side, but at least it would
> work for CPU features which are modal and come and go.  The fact that we
> tell GCC to cache the returned pointer from that internal function, but
> not that the data is immutable works to our advantage here.

> On the other hand, maybe there is a way to give users a better
> interface.  Obviously we want to avoid a syscall for a simple CPU
> feature check.  And we also need something to enable/disable CPU
> features.

This enabling and disabling of CPU features sounds like something that
might also become relevant for arm64, for example I can see a use case
for having something that allows some of the more expensive features
to be masked from some userspace processes for resource management
purposes.  This sounds like a bit of a different use case to x86 AIUI
but I think there's overlap in the actual operations that would be
needed.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-07-08 17:57 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-23 15:04 x86 CPU features detection for applications (and AMX) Florian Weimer
2021-06-23 15:32 ` Dave Hansen
2021-07-08  6:05   ` Florian Weimer
2021-07-08 14:19     ` Dave Hansen
2021-07-08 14:31       ` Florian Weimer
2021-07-08 14:36         ` Dave Hansen
2021-07-08 14:41           ` Florian Weimer
2021-06-25 23:31 ` Thiago Macieira
2021-06-28 12:40   ` Enrico Weigelt, metux IT consult
2021-06-28 13:20     ` Peter Zijlstra
2021-06-30 12:50       ` Enrico Weigelt, metux IT consult
2021-06-30 15:36         ` Thiago Macieira
2021-07-01  7:35           ` Enrico Weigelt, metux IT consult
2021-06-28 15:08     ` Thiago Macieira
2021-06-28 15:27       ` Peter Zijlstra
2021-06-28 16:13         ` Thiago Macieira
2021-06-28 17:11           ` Peter Zijlstra
2021-06-28 17:23             ` Thiago Macieira
2021-06-28 19:08               ` Peter Zijlstra
2021-06-28 19:26                 ` Thiago Macieira
2021-06-28 17:43           ` Peter Zijlstra
2021-06-28 19:05             ` Thiago Macieira
2021-06-30 14:32       ` Enrico Weigelt, metux IT consult
2021-06-30 14:34         ` Florian Weimer
2021-06-30 15:16           ` Enrico Weigelt, metux IT consult
2021-06-30 15:38             ` Florian Weimer
2021-07-01  8:08               ` Enrico Weigelt, metux IT consult
2021-07-01  8:21                 ` Florian Weimer
2021-07-01 11:59                   ` Enrico Weigelt, metux IT consult
2021-07-06 12:57                     ` Florian Weimer
2021-06-30 15:29         ` Thiago Macieira
2021-07-01 11:57           ` Enrico Weigelt, metux IT consult
2021-07-08  7:08   ` Florian Weimer
2021-07-08 15:13     ` Thiago Macieira
2021-07-08 17:56 ` Mark Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.