[PATCH] arm64: Add support for Half precision floating point

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] arm64: Add support for Half precision floating point
@ 2016-01-26 15:52 Suzuki K Poulose
  2016-01-26 16:02 ` Will Deacon
  2016-02-26 15:37 ` Catalin Marinas
  0 siblings, 2 replies; 20+ messages in thread
From: Suzuki K Poulose @ 2016-01-26 15:52 UTC (permalink / raw)
  To: linux-arm-kernel

ARMv8.2 extensions [1] include an optional feature, which supports
half precision(16bit) floating point/asimd data processing
instructions. This patch adds support for detecting and exposing
the same to the userspace via HWCAPs

[1] https://community.arm.com/groups/processors/blog/2016/01/05/armv8-a-architecture-evolution

Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
---
 arch/arm64/include/uapi/asm/hwcap.h |    2 ++
 arch/arm64/kernel/cpufeature.c      |    2 ++
 arch/arm64/kernel/cpuinfo.c         |    2 ++
 3 files changed, 6 insertions(+)

diff --git a/arch/arm64/include/uapi/asm/hwcap.h b/arch/arm64/include/uapi/asm/hwcap.h
index 361c8a8..a739287 100644
--- a/arch/arm64/include/uapi/asm/hwcap.h
+++ b/arch/arm64/include/uapi/asm/hwcap.h
@@ -28,5 +28,7 @@
 #define HWCAP_SHA2		(1 << 6)
 #define HWCAP_CRC32		(1 << 7)
 #define HWCAP_ATOMICS		(1 << 8)
+#define HWCAP_FPHP		(1 << 9)
+#define HWCAP_ASIMDHP		(1 << 10)
 
 #endif /* _UAPI__ASM_HWCAP_H */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 5c90aa4..798898e 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -673,7 +673,9 @@ static const struct arm64_cpu_capabilities arm64_hwcaps[] = {
 	HWCAP_CAP(SYS_ID_AA64ISAR0_EL1, ID_AA64ISAR0_CRC32_SHIFT, 1, CAP_HWCAP, HWCAP_CRC32),
 	HWCAP_CAP(SYS_ID_AA64ISAR0_EL1, ID_AA64ISAR0_ATOMICS_SHIFT, 2, CAP_HWCAP, HWCAP_ATOMICS),
 	HWCAP_CAP(SYS_ID_AA64PFR0_EL1, ID_AA64PFR0_FP_SHIFT, 0, CAP_HWCAP, HWCAP_FP),
+	HWCAP_CAP(SYS_ID_AA64PFR0_EL1, ID_AA64PFR0_FP_SHIFT, 1, CAP_HWCAP, HWCAP_FPHP),
 	HWCAP_CAP(SYS_ID_AA64PFR0_EL1, ID_AA64PFR0_ASIMD_SHIFT, 0, CAP_HWCAP, HWCAP_ASIMD),
+	HWCAP_CAP(SYS_ID_AA64PFR0_EL1, ID_AA64PFR0_ASIMD_SHIFT, 1, CAP_HWCAP, HWCAP_ASIMDHP),
 #ifdef CONFIG_COMPAT
 	HWCAP_CAP(SYS_ID_ISAR5_EL1, ID_ISAR5_AES_SHIFT, 2, CAP_COMPAT_HWCAP2, COMPAT_HWCAP2_PMULL),
 	HWCAP_CAP(SYS_ID_ISAR5_EL1, ID_ISAR5_AES_SHIFT, 1, CAP_COMPAT_HWCAP2, COMPAT_HWCAP2_AES),
diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index 212ae63..05523f0 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -59,6 +59,8 @@ static const char *const hwcap_str[] = {
 	"sha2",
 	"crc32",
 	"atomics",
+	"fphp",
+	"asimdhp",
 	NULL
 };
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-26 15:52 [PATCH] arm64: Add support for Half precision floating point Suzuki K Poulose
@ 2016-01-26 16:02 ` Will Deacon
  2016-01-26 16:11   ` Catalin Marinas
  2016-01-26 16:21   ` Suzuki K. Poulose
  2016-02-26 15:37 ` Catalin Marinas
  1 sibling, 2 replies; 20+ messages in thread
From: Will Deacon @ 2016-01-26 16:02 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Suzuki,

On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
> ARMv8.2 extensions [1] include an optional feature, which supports
> half precision(16bit) floating point/asimd data processing
> instructions. This patch adds support for detecting and exposing
> the same to the userspace via HWCAPs
> 
> [1] https://community.arm.com/groups/processors/blog/2016/01/05/armv8-a-architecture-evolution
> 
> Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
> ---
>  arch/arm64/include/uapi/asm/hwcap.h |    2 ++
>  arch/arm64/kernel/cpufeature.c      |    2 ++
>  arch/arm64/kernel/cpuinfo.c         |    2 ++
>  3 files changed, 6 insertions(+)
> 
> diff --git a/arch/arm64/include/uapi/asm/hwcap.h b/arch/arm64/include/uapi/asm/hwcap.h
> index 361c8a8..a739287 100644
> --- a/arch/arm64/include/uapi/asm/hwcap.h
> +++ b/arch/arm64/include/uapi/asm/hwcap.h
> @@ -28,5 +28,7 @@
>  #define HWCAP_SHA2		(1 << 6)
>  #define HWCAP_CRC32		(1 << 7)
>  #define HWCAP_ATOMICS		(1 << 8)
> +#define HWCAP_FPHP		(1 << 9)
> +#define HWCAP_ASIMDHP		(1 << 10)

Where did we get to with the mrs trapping you proposed here?

  http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html

At some point, we need to consider whether or not we want to continue
adding new HWCAPs or whether your suggestion above is actually useful
to userspace.

Did the libc guys get anywhere with a prototype? What do we need to do
to make progress with it?

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-26 16:02 ` Will Deacon
@ 2016-01-26 16:11   ` Catalin Marinas
  2016-01-28 16:00     ` Will Deacon
  2016-01-26 16:21   ` Suzuki K. Poulose
  1 sibling, 1 reply; 20+ messages in thread
From: Catalin Marinas @ 2016-01-26 16:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 26, 2016 at 04:02:58PM +0000, Will Deacon wrote:
> On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
> > ARMv8.2 extensions [1] include an optional feature, which supports
> > half precision(16bit) floating point/asimd data processing
> > instructions. This patch adds support for detecting and exposing
> > the same to the userspace via HWCAPs
> > 
> > [1] https://community.arm.com/groups/processors/blog/2016/01/05/armv8-a-architecture-evolution
> > 
> > Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
> > ---
> >  arch/arm64/include/uapi/asm/hwcap.h |    2 ++
> >  arch/arm64/kernel/cpufeature.c      |    2 ++
> >  arch/arm64/kernel/cpuinfo.c         |    2 ++
> >  3 files changed, 6 insertions(+)
> > 
> > diff --git a/arch/arm64/include/uapi/asm/hwcap.h b/arch/arm64/include/uapi/asm/hwcap.h
> > index 361c8a8..a739287 100644
> > --- a/arch/arm64/include/uapi/asm/hwcap.h
> > +++ b/arch/arm64/include/uapi/asm/hwcap.h
> > @@ -28,5 +28,7 @@
> >  #define HWCAP_SHA2		(1 << 6)
> >  #define HWCAP_CRC32		(1 << 7)
> >  #define HWCAP_ATOMICS		(1 << 8)
> > +#define HWCAP_FPHP		(1 << 9)
> > +#define HWCAP_ASIMDHP		(1 << 10)
> 
> Where did we get to with the mrs trapping you proposed here?
> 
>   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
> 
> At some point, we need to consider whether or not we want to continue
> adding new HWCAPs or whether your suggestion above is actually useful
> to userspace.

IMO, even if we merge the MRS emulation, I would still like to see
HWCAPs exported. We are not short on bits yet (53 to go ;)).

> Did the libc guys get anywhere with a prototype? What do we need to do
> to make progress with it?

This investigation should indeed continue but I think it is orthogonal.

-- 
Catalin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-26 16:02 ` Will Deacon
  2016-01-26 16:11   ` Catalin Marinas
@ 2016-01-26 16:21   ` Suzuki K. Poulose
  2016-01-26 16:55     ` Siddhesh Poyarekar
  1 sibling, 1 reply; 20+ messages in thread
From: Suzuki K. Poulose @ 2016-01-26 16:21 UTC (permalink / raw)
  To: linux-arm-kernel

On 26/01/16 16:02, Will Deacon wrote:
> Hi Suzuki,
>
> On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
>> ARMv8.2 extensions [1] include an optional feature, which supports
>> half precision(16bit) floating point/asimd data processing
>> instructions. This patch adds support for detecting and exposing
>> the same to the userspace via HWCAPs


>> +#define HWCAP_FPHP		(1 << 9)
>> +#define HWCAP_ASIMDHP		(1 << 10)
>
> Where did we get to with the mrs trapping you proposed here?
>
>    http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html

We are yet to get some feedback from glibc/gcc folks. Siddhesh was looking
to make use of it [2]. But haven't heard anything back. Ramana mentioned
(in private) that they had some plans to take a look at it.

>
> At some point, we need to consider whether or not we want to continue
> adding new HWCAPs or whether your suggestion above is actually useful
> to userspace.

Definitely.


> Did the libc guys get anywhere with a prototype? What do we need to do
> to make progress with it?

I am not sure.

Siddesh, Ramana,

Could you please let us know your plans ?

[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/381422.html

Thanks
Suzuki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-26 16:21   ` Suzuki K. Poulose
@ 2016-01-26 16:55     ` Siddhesh Poyarekar
  2016-01-28 16:07       ` Will Deacon
  0 siblings, 1 reply; 20+ messages in thread
From: Siddhesh Poyarekar @ 2016-01-26 16:55 UTC (permalink / raw)
  To: linux-arm-kernel

Adding Adhemerval to cc since he had volunteered to follow up on this,
mainly because he had a couple of additional ideas on the kernel
front.

On Tue, Jan 26, 2016 at 04:21:43PM +0000, Suzuki K. Poulose wrote:
> On 26/01/16 16:02, Will Deacon wrote:
> >Hi Suzuki,
> >
> >On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
> >>ARMv8.2 extensions [1] include an optional feature, which supports
> >>half precision(16bit) floating point/asimd data processing
> >>instructions. This patch adds support for detecting and exposing
> >>the same to the userspace via HWCAPs
> 
> 
> >>+#define HWCAP_FPHP		(1 << 9)
> >>+#define HWCAP_ASIMDHP		(1 << 10)
> >
> >Where did we get to with the mrs trapping you proposed here?
> >
> >   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
> 
> We are yet to get some feedback from glibc/gcc folks. Siddhesh was looking
> to make use of it [2]. But haven't heard anything back. Ramana mentioned
> (in private) that they had some plans to take a look at it.

I believe one of Adhemerval's ideas was similar to what I had
mentioned back then, which was to provide all of the CPU information
in a single file instead of having to traverse a directory structure.
The other idea was to add a vDSO function that returns this data so as
to avoid (or at least reduce) the context switch latency.

This still does not solve the problem of CPUs coming online later, but
that problem shouldn't be a showstopper.  The effect of a CPU coming
online later is limited to a process that is already running and it
won't affect only microarchitecture selection, it will affect
performance of other code within glibc.

The other aspect that I am waiting for feedback from ARM for is about
the property of the MIDR value.  If it can be ascertained that a core
with a specific MIDR value will always only be in a homogeneous
configuration, we could bypass the directory traversal and just stick
to the value returned from midr_el1.  This is likely vendor-specific
and I'm waiting to know if the ARM toolchain hackers would be
comfortable with baking in such assumptions into glibc.  Extra marks
for making such a requirement explicit in future specifications.

> >At some point, we need to consider whether or not we want to continue
> >adding new HWCAPs or whether your suggestion above is actually useful
> >to userspace.
> 
> Definitely.
> 
> 
> >Did the libc guys get anywhere with a prototype? What do we need to do
> >to make progress with it?
> 
> I am not sure.
> 
> Siddesh, Ramana,
> 
> Could you please let us know your plans ?
> 
> [2] http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/381422.html

I had hacked at some code with directory traversal on top of your
patch and it works fine as far as doing a PoC, but until we get
consensus on how we want to handle things like BIG.little, there can't
be much progress.

Siddhesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-26 16:11   ` Catalin Marinas
@ 2016-01-28 16:00     ` Will Deacon
  2016-02-16 11:48       ` Szabolcs Nagy
  0 siblings, 1 reply; 20+ messages in thread
From: Will Deacon @ 2016-01-28 16:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 26, 2016 at 04:11:51PM +0000, Catalin Marinas wrote:
> On Tue, Jan 26, 2016 at 04:02:58PM +0000, Will Deacon wrote:
> > On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
> > > ARMv8.2 extensions [1] include an optional feature, which supports
> > > half precision(16bit) floating point/asimd data processing
> > > instructions. This patch adds support for detecting and exposing
> > > the same to the userspace via HWCAPs
> > > 
> > > [1] https://community.arm.com/groups/processors/blog/2016/01/05/armv8-a-architecture-evolution
> > > 
> > > Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
> > > ---
> > >  arch/arm64/include/uapi/asm/hwcap.h |    2 ++
> > >  arch/arm64/kernel/cpufeature.c      |    2 ++
> > >  arch/arm64/kernel/cpuinfo.c         |    2 ++
> > >  3 files changed, 6 insertions(+)
> > > 
> > > diff --git a/arch/arm64/include/uapi/asm/hwcap.h b/arch/arm64/include/uapi/asm/hwcap.h
> > > index 361c8a8..a739287 100644
> > > --- a/arch/arm64/include/uapi/asm/hwcap.h
> > > +++ b/arch/arm64/include/uapi/asm/hwcap.h
> > > @@ -28,5 +28,7 @@
> > >  #define HWCAP_SHA2		(1 << 6)
> > >  #define HWCAP_CRC32		(1 << 7)
> > >  #define HWCAP_ATOMICS		(1 << 8)
> > > +#define HWCAP_FPHP		(1 << 9)
> > > +#define HWCAP_ASIMDHP		(1 << 10)
> > 
> > Where did we get to with the mrs trapping you proposed here?
> > 
> >   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
> > 
> > At some point, we need to consider whether or not we want to continue
> > adding new HWCAPs or whether your suggestion above is actually useful
> > to userspace.
> 
> IMO, even if we merge the MRS emulation, I would still like to see
> HWCAPs exported. We are not short on bits yet (53 to go ;)).

I'm less keen. HWCAPs don't align well with the way that the ARM
architecture versions features and we should be encouraging people to
use the MRS emulation if it exists.

> > Did the libc guys get anywhere with a prototype? What do we need to do
> > to make progress with it?
> 
> This investigation should indeed continue but I think it is orthogonal.

Not if its intended to replace HWCAPs in the longterm.

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-26 16:55     ` Siddhesh Poyarekar
@ 2016-01-28 16:07       ` Will Deacon
  2016-01-28 16:46         ` Siddhesh Poyarekar
  2016-01-28 16:51         ` Adhemerval Zanella
  0 siblings, 2 replies; 20+ messages in thread
From: Will Deacon @ 2016-01-28 16:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 26, 2016 at 10:25:38PM +0530, Siddhesh Poyarekar wrote:
> Adding Adhemerval to cc since he had volunteered to follow up on this,
> mainly because he had a couple of additional ideas on the kernel
> front.
> 
> On Tue, Jan 26, 2016 at 04:21:43PM +0000, Suzuki K. Poulose wrote:
> > On 26/01/16 16:02, Will Deacon wrote:
> > >Hi Suzuki,
> > >
> > >On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
> > >>ARMv8.2 extensions [1] include an optional feature, which supports
> > >>half precision(16bit) floating point/asimd data processing
> > >>instructions. This patch adds support for detecting and exposing
> > >>the same to the userspace via HWCAPs
> > 
> > 
> > >>+#define HWCAP_FPHP		(1 << 9)
> > >>+#define HWCAP_ASIMDHP		(1 << 10)
> > >
> > >Where did we get to with the mrs trapping you proposed here?
> > >
> > >   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
> > 
> > We are yet to get some feedback from glibc/gcc folks. Siddhesh was looking
> > to make use of it [2]. But haven't heard anything back. Ramana mentioned
> > (in private) that they had some plans to take a look at it.
> 
> I believe one of Adhemerval's ideas was similar to what I had
> mentioned back then, which was to provide all of the CPU information
> in a single file instead of having to traverse a directory structure.

My understanding was that libc needed this information extremely early
on (i.e. before it could even issue system calls), and therefore such
an approach would be in addition to the proposal here. Am I mistaken?

> The other idea was to add a vDSO function that returns this data so as
> to avoid (or at least reduce) the context switch latency.

I'm not at all keen on adding a data ABI to the vDSO. I think people tried
similar things in the past (something on PPC?) and have horror stories
from that.

> The other aspect that I am waiting for feedback from ARM for is about
> the property of the MIDR value.  If it can be ascertained that a core
> with a specific MIDR value will always only be in a homogeneous
> configuration, we could bypass the directory traversal and just stick
> to the value returned from midr_el1.  This is likely vendor-specific
> and I'm waiting to know if the ARM toolchain hackers would be
> comfortable with baking in such assumptions into glibc.  Extra marks
> for making such a requirement explicit in future specifications.

The architecture makes no guarantees about what will and won't be used
in different configurations, so we shouldn't try to derive this from the
MIDR. Even if you figure out a heuristic for today's platforms, it won't
necessarily hold true in the future.

> I had hacked at some code with directory traversal on top of your
> patch and it works fine as far as doing a PoC, but until we get
> consensus on how we want to handle things like BIG.little, there can't
> be much progress.

By "directory traversal" are you only referring to the /sys portions
of this? I'm *much* more interested in the utility of the MRS emulation
part, since that's what could effectively replace HWCAPs in the future.

As for big/little, the kernel view has been pretty consistent on that:
we will expose a "sanitised" view of the registers (as described in the
Documentation along with the patch) where we can, and for the per-CPU
registers such as MIDR, you will read the current CPU register (which
is why those registers are also exposed in sysfs).

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-28 16:07       ` Will Deacon
@ 2016-01-28 16:46         ` Siddhesh Poyarekar
  2016-01-28 17:27           ` Catalin Marinas
  2016-01-28 16:51         ` Adhemerval Zanella
  1 sibling, 1 reply; 20+ messages in thread
From: Siddhesh Poyarekar @ 2016-01-28 16:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 28, 2016 at 04:07:48PM +0000, Will Deacon wrote:
> My understanding was that libc needed this information extremely early
> on (i.e. before it could even issue system calls), and therefore such
> an approach would be in addition to the proposal here. Am I mistaken?

Not really, glibc will need this information before it can call any of
the ifunc-selected functions, i.e. typically string and some math
functions.  System calls are not an issue since we don't have
microarchitecture-specific system calls.  Suzuki's patch works just
fine, just that to make sure that we're selecting the correct routine,
we may (in the worst case) have to traverse the /sysfs directories to
get information from all of the cpu files.  A single file with all
that information would be much better performance-wise.

> I'm not at all keen on adding a data ABI to the vDSO. I think people
> tried similar things in the past (something on PPC?) and have horror
> stories from that.

It does not have to be a data ABI, it could be a set of functions that
initialize an opaque context and iterate through the cpu data for each
call, something that would allow me to do this:

    cpu_info_context_t *ctx = cpu_info_init_context ();
    unsigned long midr;

    while ((midr = cpu_info_next_midr (ctx)) != 0)
      {
        /* Do stuff.  */
      }

> The architecture makes no guarantees about what will and won't be used
> in different configurations, so we shouldn't try to derive this from the
> MIDR. Even if you figure out a heuristic for today's platforms, it won't
> necessarily hold true in the future.

Another approach could be vendor confirmation that they would never
release cores with the same MIDR value in different configurations.
That is to say, a PE with a specific MIDR value will always be in a
homogenous system and will never be part of a big.little
configuration.  The microarchitecture routines are essentially
vendor-specific, so getting this assurance from them should be
sufficient, shouldn't it?

> By "directory traversal" are you only referring to the /sys portions
> of this? I'm *much* more interested in the utility of the MRS emulation
> part, since that's what could effectively replace HWCAPs in the future.

The MRS emulation may be sufficient for the case where the system is
homogenous and the vendor states that the midr will never be used in a
heterogenous configuration.  If not, we will have no choice but to
traverse the /sys directories to find the midr for all online cpus and
make that decision.

> As for big/little, the kernel view has been pretty consistent on that:
> we will expose a "sanitised" view of the registers (as described in the
> Documentation along with the patch) where we can, and for the per-CPU
> registers such as MIDR, you will read the current CPU register (which
> is why those registers are also exposed in sysfs).

That's a reasonable approach, my only point of contention was to find
a faster alternative to the directory traversal.

Siddhesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-28 16:07       ` Will Deacon
  2016-01-28 16:46         ` Siddhesh Poyarekar
@ 2016-01-28 16:51         ` Adhemerval Zanella
  2016-02-02 17:31           ` Szabolcs Nagy
  1 sibling, 1 reply; 20+ messages in thread
From: Adhemerval Zanella @ 2016-01-28 16:51 UTC (permalink / raw)
  To: linux-arm-kernel



On 28-01-2016 14:07, Will Deacon wrote:
> On Tue, Jan 26, 2016 at 10:25:38PM +0530, Siddhesh Poyarekar wrote:
>> Adding Adhemerval to cc since he had volunteered to follow up on this,
>> mainly because he had a couple of additional ideas on the kernel
>> front.
>>
>> On Tue, Jan 26, 2016 at 04:21:43PM +0000, Suzuki K. Poulose wrote:
>>> On 26/01/16 16:02, Will Deacon wrote:
>>>> Hi Suzuki,
>>>>
>>>> On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
>>>>> ARMv8.2 extensions [1] include an optional feature, which supports
>>>>> half precision(16bit) floating point/asimd data processing
>>>>> instructions. This patch adds support for detecting and exposing
>>>>> the same to the userspace via HWCAPs
>>>
>>>
>>>>> +#define HWCAP_FPHP		(1 << 9)
>>>>> +#define HWCAP_ASIMDHP		(1 << 10)
>>>>
>>>> Where did we get to with the mrs trapping you proposed here?
>>>>
>>>>   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
>>>
>>> We are yet to get some feedback from glibc/gcc folks. Siddhesh was looking
>>> to make use of it [2]. But haven't heard anything back. Ramana mentioned
>>> (in private) that they had some plans to take a look at it.
>>
>> I believe one of Adhemerval's ideas was similar to what I had
>> mentioned back then, which was to provide all of the CPU information
>> in a single file instead of having to traverse a directory structure.
> 
> My understanding was that libc needed this information extremely early
> on (i.e. before it could even issue system calls), and therefore such
> an approach would be in addition to the proposal here. Am I mistaken?

If the idea is to use these instruction for function implementation selection
(iFUNC) the idea is to have on PLT resolution either by accessing it directly
or using a caching mechanism. x86_64 does something similar with cacheline
information: it issues a single cpuid and create processor information table
based on its information (it is also what the __builtin_supports() also
does).

> 
>> The other idea was to add a vDSO function that returns this data so as
>> to avoid (or at least reduce) the context switch latency.
> 
> I'm not at all keen on adding a data ABI to the vDSO. I think people tried
> similar things in the past (something on PPC?) and have horror stories
> from that.

In fact ppc still exports it in vDSO (include/asm/vdso_datapage.h), with
information like the LPAR cfg, platform, processor, {d,i}cache, etc.
I recall that I have see some code back at IBM that tried to use these
fields directly, but indeed it is not recommended.

What I have in mind is something what ppc does with __kernel_get_syscall_map.
It is vDSO function that returns a vDSO internal data related to which
syscalls are implemented in the running kernel (through a bitmap field).

> 
>> The other aspect that I am waiting for feedback from ARM for is about
>> the property of the MIDR value.  If it can be ascertained that a core
>> with a specific MIDR value will always only be in a homogeneous
>> configuration, we could bypass the directory traversal and just stick
>> to the value returned from midr_el1.  This is likely vendor-specific
>> and I'm waiting to know if the ARM toolchain hackers would be
>> comfortable with baking in such assumptions into glibc.  Extra marks
>> for making such a requirement explicit in future specifications.
> 
> The architecture makes no guarantees about what will and won't be used
> in different configurations, so we shouldn't try to derive this from the
> MIDR. Even if you figure out a heuristic for today's platforms, it won't
> necessarily hold true in the future.
> 
>> I had hacked at some code with directory traversal on top of your
>> patch and it works fine as far as doing a PoC, but until we get
>> consensus on how we want to handle things like BIG.little, there can't
>> be much progress.
> 
> By "directory traversal" are you only referring to the /sys portions
> of this? I'm *much* more interested in the utility of the MRS emulation
> part, since that's what could effectively replace HWCAPs in the future.
> 
> As for big/little, the kernel view has been pretty consistent on that:
> we will expose a "sanitised" view of the registers (as described in the
> Documentation along with the patch) where we can, and for the per-CPU
> registers such as MIDR, you will read the current CPU register (which
> is why those registers are also exposed in sysfs).
> 
> Will
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-28 16:46         ` Siddhesh Poyarekar
@ 2016-01-28 17:27           ` Catalin Marinas
  2016-01-28 17:44             ` Siddhesh Poyarekar
  0 siblings, 1 reply; 20+ messages in thread
From: Catalin Marinas @ 2016-01-28 17:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 28, 2016 at 10:16:31PM +0530, Siddhesh Poyarekar wrote:
> On Thu, Jan 28, 2016 at 04:07:48PM +0000, Will Deacon wrote:
> > My understanding was that libc needed this information extremely early
> > on (i.e. before it could even issue system calls), and therefore such
> > an approach would be in addition to the proposal here. Am I mistaken?
> 
> Not really, glibc will need this information before it can call any of
> the ifunc-selected functions, i.e. typically string and some math
> functions.  System calls are not an issue since we don't have
> microarchitecture-specific system calls.  Suzuki's patch works just
> fine, just that to make sure that we're selecting the correct routine,
> we may (in the worst case) have to traverse the /sysfs directories to
> get information from all of the cpu files.  A single file with all
> that information would be much better performance-wise.

Suzuki's MRS emulation only exposes the CPU feature registers and not
the MIDR. So this would help with choosing implementation based on
features (e.g. crypto) but not for micro-architecture tuning.

> > The architecture makes no guarantees about what will and won't be used
> > in different configurations, so we shouldn't try to derive this from the
> > MIDR. Even if you figure out a heuristic for today's platforms, it won't
> > necessarily hold true in the future.
> 
> Another approach could be vendor confirmation that they would never
> release cores with the same MIDR value in different configurations.
> That is to say, a PE with a specific MIDR value will always be in a
> homogenous system and will never be part of a big.little
> configuration.  The microarchitecture routines are essentially
> vendor-specific, so getting this assurance from them should be
> sufficient, shouldn't it?

I don't think you would get such assurance. Basically how the CPUs are
connected on a system is the property of a SoC and not of the CPU (nor
MIDR). I know it's not helpful but we don't really have an option (other
than using made-up MIDR values with some reserved vendor id field and a
central, OS-agnostic database to describe the real MIDRs).

> > As for big/little, the kernel view has been pretty consistent on that:
> > we will expose a "sanitised" view of the registers (as described in the
> > Documentation along with the patch) where we can, and for the per-CPU
> > registers such as MIDR, you will read the current CPU register (which
> > is why those registers are also exposed in sysfs).
> 
> That's a reasonable approach, my only point of contention was to find
> a faster alternative to the directory traversal.

So are you only interested in MIDR for microarchitecture tuning? Would
glibc make any use of the feature registers exposed via MRS emulation
(and so far mirrored by HWCAP bits, well, as long as we can still do
this sanely)?

-- 
Catalin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-28 17:27           ` Catalin Marinas
@ 2016-01-28 17:44             ` Siddhesh Poyarekar
  2016-01-28 17:55               ` Suzuki K. Poulose
  0 siblings, 1 reply; 20+ messages in thread
From: Siddhesh Poyarekar @ 2016-01-28 17:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 28, 2016 at 05:27:05PM +0000, Catalin Marinas wrote:
> Suzuki's MRS emulation only exposes the CPU feature registers and not
> the MIDR. So this would help with choosing implementation based on
> features (e.g. crypto) but not for micro-architecture tuning.

Umm, I'm pretty sure it does, at least the patchset I tested back in
October did.  I think you may be confusing with revidr.

> I don't think you would get such assurance. Basically how the CPUs are
> connected on a system is the property of a SoC and not of the CPU (nor
> MIDR). I know it's not helpful but we don't really have an option (other
> than using made-up MIDR values with some reserved vendor id field and a
> central, OS-agnostic database to describe the real MIDRs).

... which means we would end up traversing all of /sysfs and/or check
CPU feature registers for this information, which makes a more optimal
API even more important.

> So are you only interested in MIDR for microarchitecture tuning? Would
> glibc make any use of the feature registers exposed via MRS emulation
> (and so far mirrored by HWCAP bits, well, as long as we can still do
> this sanely)?

My primary interest is exploring the ability to identify CPUs of
specific vendors (and specific make/models) based on their MIDR.
However, we would definitely also like to use the CPU feature bits to
enhance specific parts of glibc, like vectorized string/memory
routines for processors that support them.

Siddhesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-28 17:44             ` Siddhesh Poyarekar
@ 2016-01-28 17:55               ` Suzuki K. Poulose
  0 siblings, 0 replies; 20+ messages in thread
From: Suzuki K. Poulose @ 2016-01-28 17:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 28/01/16 17:44, Siddhesh Poyarekar wrote:
> On Thu, Jan 28, 2016 at 05:27:05PM +0000, Catalin Marinas wrote:
>> Suzuki's MRS emulation only exposes the CPU feature registers and not
>> the MIDR. So this would help with choosing implementation based on
>> features (e.g. crypto) but not for micro-architecture tuning.
>
> Umm, I'm pretty sure it does, at least the patchset I tested back in
> October did.  I think you may be confusing with revidr.

It does expose MIDR, but with an exception that it provides that of
the CPU where mrs was executed (as documented), unlike the CPU feature
registers which guarantee a system wide view.

Thanks
Suzuki

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-28 16:51         ` Adhemerval Zanella
@ 2016-02-02 17:31           ` Szabolcs Nagy
  2016-02-02 18:12             ` Adhemerval Zanella
  0 siblings, 1 reply; 20+ messages in thread
From: Szabolcs Nagy @ 2016-02-02 17:31 UTC (permalink / raw)
  To: linux-arm-kernel

On 28/01/16 16:51, Adhemerval Zanella wrote:
> On 28-01-2016 14:07, Will Deacon wrote:
>> On Tue, Jan 26, 2016 at 10:25:38PM +0530, Siddhesh Poyarekar wrote:
>>> Adding Adhemerval to cc since he had volunteered to follow up on this,
>>> mainly because he had a couple of additional ideas on the kernel
>>> front.
>>>
>>> On Tue, Jan 26, 2016 at 04:21:43PM +0000, Suzuki K. Poulose wrote:
>>>> On 26/01/16 16:02, Will Deacon wrote:
>>>>> Hi Suzuki,
>>>>>
>>>>> On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
>>>>>> ARMv8.2 extensions [1] include an optional feature, which supports
>>>>>> half precision(16bit) floating point/asimd data processing
>>>>>> instructions. This patch adds support for detecting and exposing
>>>>>> the same to the userspace via HWCAPs
>>>>
>>>>
>>>>>> +#define HWCAP_FPHP		(1 << 9)
>>>>>> +#define HWCAP_ASIMDHP		(1 << 10)
>>>>>
>>>>> Where did we get to with the mrs trapping you proposed here?
>>>>>
>>>>>   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
>>>>
>>>> We are yet to get some feedback from glibc/gcc folks. Siddhesh was looking
>>>> to make use of it [2]. But haven't heard anything back. Ramana mentioned
>>>> (in private) that they had some plans to take a look at it.
>>>
>>> I believe one of Adhemerval's ideas was similar to what I had
>>> mentioned back then, which was to provide all of the CPU information
>>> in a single file instead of having to traverse a directory structure.
>>
>> My understanding was that libc needed this information extremely early
>> on (i.e. before it could even issue system calls), and therefore such
>> an approach would be in addition to the proposal here. Am I mistaken?
> 
> If the idea is to use these instruction for function implementation selection
> (iFUNC) the idea is to have on PLT resolution either by accessing it directly
> or using a caching mechanism. x86_64 does something similar with cacheline
> information: it issues a single cpuid and create processor information table
> based on its information (it is also what the __builtin_supports() also
> does).
> 

__builtin_supports is not a single cpuid on x86, it is
a cpuid per dso with one cache per dso.

(gcc-5 used a single cache in libgcc_s.so.1 and that
turned out to be broken because ifunc in other dsos
could not reliably access it.)

>>> The other idea was to add a vDSO function that returns this data so as
>>> to avoid (or at least reduce) the context switch latency.
>>
>> I'm not at all keen on adding a data ABI to the vDSO. I think people tried
>> similar things in the past (something on PPC?) and have horror stories
>> from that.
> 
> In fact ppc still exports it in vDSO (include/asm/vdso_datapage.h), with
> information like the LPAR cfg, platform, processor, {d,i}cache, etc.
> I recall that I have see some code back at IBM that tried to use these
> fields directly, but indeed it is not recommended.
> 
> What I have in mind is something what ppc does with __kernel_get_syscall_map.
> It is vDSO function that returns a vDSO internal data related to which
> syscalls are implemented in the running kernel (through a bitmap field).
> 

fs access or vdso does not work for ifunc based dispatch
(assuming the current ifunc implementation in glibc).

(for vdso you need the AT_SYSINFO_EHDR auxval somehow and
then implement elf symbol lookup in the ifunc resolver
without calling any libc function. passing auxvals to the
ifunc resolver can be done by changing the ifunc abi, but
doing symbol lookups there is unrealistic.)

in the libc (e.g. for memcpy) ifunc is a bit easier to use,
but in user code (function-multi-versioning) ifunc is very
limited.

i wrote about the ifunc limitations here:
https://sourceware.org/ml/libc-alpha/2015-11/msg00108.html
see point (4) and (5).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-02-02 17:31           ` Szabolcs Nagy
@ 2016-02-02 18:12             ` Adhemerval Zanella
  2016-02-02 18:25               ` Szabolcs Nagy
  0 siblings, 1 reply; 20+ messages in thread
From: Adhemerval Zanella @ 2016-02-02 18:12 UTC (permalink / raw)
  To: linux-arm-kernel



On 02-02-2016 15:31, Szabolcs Nagy wrote:
> On 28/01/16 16:51, Adhemerval Zanella wrote:
>> On 28-01-2016 14:07, Will Deacon wrote:
>>> On Tue, Jan 26, 2016 at 10:25:38PM +0530, Siddhesh Poyarekar wrote:
>>>> Adding Adhemerval to cc since he had volunteered to follow up on this,
>>>> mainly because he had a couple of additional ideas on the kernel
>>>> front.
>>>>
>>>> On Tue, Jan 26, 2016 at 04:21:43PM +0000, Suzuki K. Poulose wrote:
>>>>> On 26/01/16 16:02, Will Deacon wrote:
>>>>>> Hi Suzuki,
>>>>>>
>>>>>> On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
>>>>>>> ARMv8.2 extensions [1] include an optional feature, which supports
>>>>>>> half precision(16bit) floating point/asimd data processing
>>>>>>> instructions. This patch adds support for detecting and exposing
>>>>>>> the same to the userspace via HWCAPs
>>>>>
>>>>>
>>>>>>> +#define HWCAP_FPHP		(1 << 9)
>>>>>>> +#define HWCAP_ASIMDHP		(1 << 10)
>>>>>>
>>>>>> Where did we get to with the mrs trapping you proposed here?
>>>>>>
>>>>>>   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
>>>>>
>>>>> We are yet to get some feedback from glibc/gcc folks. Siddhesh was looking
>>>>> to make use of it [2]. But haven't heard anything back. Ramana mentioned
>>>>> (in private) that they had some plans to take a look at it.
>>>>
>>>> I believe one of Adhemerval's ideas was similar to what I had
>>>> mentioned back then, which was to provide all of the CPU information
>>>> in a single file instead of having to traverse a directory structure.
>>>
>>> My understanding was that libc needed this information extremely early
>>> on (i.e. before it could even issue system calls), and therefore such
>>> an approach would be in addition to the proposal here. Am I mistaken?
>>
>> If the idea is to use these instruction for function implementation selection
>> (iFUNC) the idea is to have on PLT resolution either by accessing it directly
>> or using a caching mechanism. x86_64 does something similar with cacheline
>> information: it issues a single cpuid and create processor information table
>> based on its information (it is also what the __builtin_supports() also
>> does).
>>
> 
> __builtin_supports is not a single cpuid on x86, it is
> a cpuid per dso with one cache per dso.
> 
> (gcc-5 used a single cache in libgcc_s.so.1 and that
> turned out to be broken because ifunc in other dsos
> could not reliably access it.)

It is with static libgcc (default), but if you use -shared-gcc only one
__cpu_model (used by __builtin_cpu_supports) will be linked.  But since
static libgcc is default it will be indeed one per DSO.

> 
>>>> The other idea was to add a vDSO function that returns this data so as
>>>> to avoid (or at least reduce) the context switch latency.
>>>
>>> I'm not at all keen on adding a data ABI to the vDSO. I think people tried
>>> similar things in the past (something on PPC?) and have horror stories
>>> from that.
>>
>> In fact ppc still exports it in vDSO (include/asm/vdso_datapage.h), with
>> information like the LPAR cfg, platform, processor, {d,i}cache, etc.
>> I recall that I have see some code back at IBM that tried to use these
>> fields directly, but indeed it is not recommended.
>>
>> What I have in mind is something what ppc does with __kernel_get_syscall_map.
>> It is vDSO function that returns a vDSO internal data related to which
>> syscalls are implemented in the running kernel (through a bitmap field).
>>
> 
> fs access or vdso does not work for ifunc based dispatch
> (assuming the current ifunc implementation in glibc).
> 
> (for vdso you need the AT_SYSINFO_EHDR auxval somehow and
> then implement elf symbol lookup in the ifunc resolver
> without calling any libc function. passing auxvals to the
> ifunc resolver can be done by changing the ifunc abi, but
> doing symbol lookups there is unrealistic.)
> 
> in the libc (e.g. for memcpy) ifunc is a bit easier to use,
> but in user code (function-multi-versioning) ifunc is very
> limited.
> 
> i wrote about the ifunc limitations here:
> https://sourceware.org/ml/libc-alpha/2015-11/msg00108.html
> see point (4) and (5).
> 

I recall this thread and indeed iFUNC have a set of limitations.  Although for
use within libc itself it might be safe with the constraints you have described.

Now for vDSO usage I think it might be safe to use within GLIBC
with correct vDSO pointers initialization order. At least it is done
on GLIBC for gettimeofday for x86_64 and powerpc (the iFUNC returns
the vDSO function pointer).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-02-02 18:12             ` Adhemerval Zanella
@ 2016-02-02 18:25               ` Szabolcs Nagy
  2016-02-02 18:28                 ` Adhemerval Zanella
  0 siblings, 1 reply; 20+ messages in thread
From: Szabolcs Nagy @ 2016-02-02 18:25 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/02/16 18:12, Adhemerval Zanella wrote:
> On 02-02-2016 15:31, Szabolcs Nagy wrote:
>> On 28/01/16 16:51, Adhemerval Zanella wrote:
>>> On 28-01-2016 14:07, Will Deacon wrote:
>>>> On Tue, Jan 26, 2016 at 10:25:38PM +0530, Siddhesh Poyarekar wrote:
>>>>> Adding Adhemerval to cc since he had volunteered to follow up on this,
>>>>> mainly because he had a couple of additional ideas on the kernel
>>>>> front.
>>>>>
>>>>> On Tue, Jan 26, 2016 at 04:21:43PM +0000, Suzuki K. Poulose wrote:
>>>>>> On 26/01/16 16:02, Will Deacon wrote:
>>>>>>> Hi Suzuki,
>>>>>>>
>>>>>>> On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
>>>>>>>> ARMv8.2 extensions [1] include an optional feature, which supports
>>>>>>>> half precision(16bit) floating point/asimd data processing
>>>>>>>> instructions. This patch adds support for detecting and exposing
>>>>>>>> the same to the userspace via HWCAPs
>>>>>>
>>>>>>
>>>>>>>> +#define HWCAP_FPHP		(1 << 9)
>>>>>>>> +#define HWCAP_ASIMDHP		(1 << 10)
>>>>>>>
>>>>>>> Where did we get to with the mrs trapping you proposed here?
>>>>>>>
>>>>>>>   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
>>>>>>
>>>>>> We are yet to get some feedback from glibc/gcc folks. Siddhesh was looking
>>>>>> to make use of it [2]. But haven't heard anything back. Ramana mentioned
>>>>>> (in private) that they had some plans to take a look at it.
>>>>>
>>>>> I believe one of Adhemerval's ideas was similar to what I had
>>>>> mentioned back then, which was to provide all of the CPU information
>>>>> in a single file instead of having to traverse a directory structure.
>>>>
>>>> My understanding was that libc needed this information extremely early
>>>> on (i.e. before it could even issue system calls), and therefore such
>>>> an approach would be in addition to the proposal here. Am I mistaken?
>>>
>>> If the idea is to use these instruction for function implementation selection
>>> (iFUNC) the idea is to have on PLT resolution either by accessing it directly
>>> or using a caching mechanism. x86_64 does something similar with cacheline
>>> information: it issues a single cpuid and create processor information table
>>> based on its information (it is also what the __builtin_supports() also
>>> does).
>>>
>>
>> __builtin_supports is not a single cpuid on x86, it is
>> a cpuid per dso with one cache per dso.
>>
>> (gcc-5 used a single cache in libgcc_s.so.1 and that
>> turned out to be broken because ifunc in other dsos
>> could not reliably access it.)
> 
> It is with static libgcc (default), but if you use -shared-gcc only one
> __cpu_model (used by __builtin_cpu_supports) will be linked.  But since
> static libgcc is default it will be indeed one per DSO.

with shared libgcc x86 fmv is broken, the ifunc
resolver may run before libgcc gets relocated.

fwiw shared libgcc is also broken on arm with old kernels.
(because it aborts if 64bit atomics is not supported,
the check assumes it only gets linked in if user code
uses 64bit atomics, but with shared libgcc the check
is always done.)

so i dont think shared libgcc is well supported..

>>>>> The other idea was to add a vDSO function that returns this data so as
>>>>> to avoid (or at least reduce) the context switch latency.
>>>>
>>>> I'm not at all keen on adding a data ABI to the vDSO. I think people tried
>>>> similar things in the past (something on PPC?) and have horror stories
>>>> from that.
>>>
>>> In fact ppc still exports it in vDSO (include/asm/vdso_datapage.h), with
>>> information like the LPAR cfg, platform, processor, {d,i}cache, etc.
>>> I recall that I have see some code back at IBM that tried to use these
>>> fields directly, but indeed it is not recommended.
>>>
>>> What I have in mind is something what ppc does with __kernel_get_syscall_map.
>>> It is vDSO function that returns a vDSO internal data related to which
>>> syscalls are implemented in the running kernel (through a bitmap field).
>>>
>>
>> fs access or vdso does not work for ifunc based dispatch
>> (assuming the current ifunc implementation in glibc).
>>
>> (for vdso you need the AT_SYSINFO_EHDR auxval somehow and
>> then implement elf symbol lookup in the ifunc resolver
>> without calling any libc function. passing auxvals to the
>> ifunc resolver can be done by changing the ifunc abi, but
>> doing symbol lookups there is unrealistic.)
>>
>> in the libc (e.g. for memcpy) ifunc is a bit easier to use,
>> but in user code (function-multi-versioning) ifunc is very
>> limited.
>>
>> i wrote about the ifunc limitations here:
>> https://sourceware.org/ml/libc-alpha/2015-11/msg00108.html
>> see point (4) and (5).
>>
> 
> I recall this thread and indeed iFUNC have a set of limitations.  Although for
> use within libc itself it might be safe with the constraints you have described.
> 
> Now for vDSO usage I think it might be safe to use within GLIBC
> with correct vDSO pointers initialization order. At least it is done
> on GLIBC for gettimeofday for x86_64 and powerpc (the iFUNC returns
> the vDSO function pointer).
> 

i don't see how that can work with static linking.
(vdso setup happens after ifunc resolvers are run)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-02-02 18:25               ` Szabolcs Nagy
@ 2016-02-02 18:28                 ` Adhemerval Zanella
  0 siblings, 0 replies; 20+ messages in thread
From: Adhemerval Zanella @ 2016-02-02 18:28 UTC (permalink / raw)
  To: linux-arm-kernel



On 02-02-2016 16:25, Szabolcs Nagy wrote:
> On 02/02/16 18:12, Adhemerval Zanella wrote:
>> On 02-02-2016 15:31, Szabolcs Nagy wrote:
>>> On 28/01/16 16:51, Adhemerval Zanella wrote:
>>>> On 28-01-2016 14:07, Will Deacon wrote:
>>>>> On Tue, Jan 26, 2016 at 10:25:38PM +0530, Siddhesh Poyarekar wrote:
>>>>>> Adding Adhemerval to cc since he had volunteered to follow up on this,
>>>>>> mainly because he had a couple of additional ideas on the kernel
>>>>>> front.
>>>>>>
>>>>>> On Tue, Jan 26, 2016 at 04:21:43PM +0000, Suzuki K. Poulose wrote:
>>>>>>> On 26/01/16 16:02, Will Deacon wrote:
>>>>>>>> Hi Suzuki,
>>>>>>>>
>>>>>>>> On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
>>>>>>>>> ARMv8.2 extensions [1] include an optional feature, which supports
>>>>>>>>> half precision(16bit) floating point/asimd data processing
>>>>>>>>> instructions. This patch adds support for detecting and exposing
>>>>>>>>> the same to the userspace via HWCAPs
>>>>>>>
>>>>>>>
>>>>>>>>> +#define HWCAP_FPHP		(1 << 9)
>>>>>>>>> +#define HWCAP_ASIMDHP		(1 << 10)
>>>>>>>>
>>>>>>>> Where did we get to with the mrs trapping you proposed here?
>>>>>>>>
>>>>>>>>   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
>>>>>>>
>>>>>>> We are yet to get some feedback from glibc/gcc folks. Siddhesh was looking
>>>>>>> to make use of it [2]. But haven't heard anything back. Ramana mentioned
>>>>>>> (in private) that they had some plans to take a look at it.
>>>>>>
>>>>>> I believe one of Adhemerval's ideas was similar to what I had
>>>>>> mentioned back then, which was to provide all of the CPU information
>>>>>> in a single file instead of having to traverse a directory structure.
>>>>>
>>>>> My understanding was that libc needed this information extremely early
>>>>> on (i.e. before it could even issue system calls), and therefore such
>>>>> an approach would be in addition to the proposal here. Am I mistaken?
>>>>
>>>> If the idea is to use these instruction for function implementation selection
>>>> (iFUNC) the idea is to have on PLT resolution either by accessing it directly
>>>> or using a caching mechanism. x86_64 does something similar with cacheline
>>>> information: it issues a single cpuid and create processor information table
>>>> based on its information (it is also what the __builtin_supports() also
>>>> does).
>>>>
>>>
>>> __builtin_supports is not a single cpuid on x86, it is
>>> a cpuid per dso with one cache per dso.
>>>
>>> (gcc-5 used a single cache in libgcc_s.so.1 and that
>>> turned out to be broken because ifunc in other dsos
>>> could not reliably access it.)
>>
>> It is with static libgcc (default), but if you use -shared-gcc only one
>> __cpu_model (used by __builtin_cpu_supports) will be linked.  But since
>> static libgcc is default it will be indeed one per DSO.
> 
> with shared libgcc x86 fmv is broken, the ifunc
> resolver may run before libgcc gets relocated.
> 
> fwiw shared libgcc is also broken on arm with old kernels.
> (because it aborts if 64bit atomics is not supported,
> the check assumes it only gets linked in if user code
> uses 64bit atomics, but with shared libgcc the check
> is always done.)
> 
> so i dont think shared libgcc is well supported..
> 
>>>>>> The other idea was to add a vDSO function that returns this data so as
>>>>>> to avoid (or at least reduce) the context switch latency.
>>>>>
>>>>> I'm not at all keen on adding a data ABI to the vDSO. I think people tried
>>>>> similar things in the past (something on PPC?) and have horror stories
>>>>> from that.
>>>>
>>>> In fact ppc still exports it in vDSO (include/asm/vdso_datapage.h), with
>>>> information like the LPAR cfg, platform, processor, {d,i}cache, etc.
>>>> I recall that I have see some code back at IBM that tried to use these
>>>> fields directly, but indeed it is not recommended.
>>>>
>>>> What I have in mind is something what ppc does with __kernel_get_syscall_map.
>>>> It is vDSO function that returns a vDSO internal data related to which
>>>> syscalls are implemented in the running kernel (through a bitmap field).
>>>>
>>>
>>> fs access or vdso does not work for ifunc based dispatch
>>> (assuming the current ifunc implementation in glibc).
>>>
>>> (for vdso you need the AT_SYSINFO_EHDR auxval somehow and
>>> then implement elf symbol lookup in the ifunc resolver
>>> without calling any libc function. passing auxvals to the
>>> ifunc resolver can be done by changing the ifunc abi, but
>>> doing symbol lookups there is unrealistic.)
>>>
>>> in the libc (e.g. for memcpy) ifunc is a bit easier to use,
>>> but in user code (function-multi-versioning) ifunc is very
>>> limited.
>>>
>>> i wrote about the ifunc limitations here:
>>> https://sourceware.org/ml/libc-alpha/2015-11/msg00108.html
>>> see point (4) and (5).
>>>
>>
>> I recall this thread and indeed iFUNC have a set of limitations.  Although for
>> use within libc itself it might be safe with the constraints you have described.
>>
>> Now for vDSO usage I think it might be safe to use within GLIBC
>> with correct vDSO pointers initialization order. At least it is done
>> on GLIBC for gettimeofday for x86_64 and powerpc (the iFUNC returns
>> the vDSO function pointer).
>>
> 
> i don't see how that can work with static linking.
> (vdso setup happens after ifunc resolvers are run)

Direct syscalls are used for static case. I didn't yet dig into why exactly
vDSO setup happens after ifunc and if it is possible to change it to
enable this for static linking as well.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-28 16:00     ` Will Deacon
@ 2016-02-16 11:48       ` Szabolcs Nagy
  2016-02-16 11:53         ` Will Deacon
  0 siblings, 1 reply; 20+ messages in thread
From: Szabolcs Nagy @ 2016-02-16 11:48 UTC (permalink / raw)
  To: linux-arm-kernel

On 28/01/16 16:00, Will Deacon wrote:
> On Tue, Jan 26, 2016 at 04:11:51PM +0000, Catalin Marinas wrote:
>> On Tue, Jan 26, 2016 at 04:02:58PM +0000, Will Deacon wrote:
>>> On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
>>>> +#define HWCAP_FPHP		(1 << 9)
>>>> +#define HWCAP_ASIMDHP		(1 << 10)
>>>
>>> Where did we get to with the mrs trapping you proposed here?
>>>
>>>   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
>>>
>>> At some point, we need to consider whether or not we want to continue
>>> adding new HWCAPs or whether your suggestion above is actually useful
>>> to userspace.
>>
>> IMO, even if we merge the MRS emulation, I would still like to see
>> HWCAPs exported. We are not short on bits yet (53 to go ;)).
> 
> I'm less keen. HWCAPs don't align well with the way that the ARM
> architecture versions features and we should be encouraging people to
> use the MRS emulation if it exists.
> 
>>> Did the libc guys get anywhere with a prototype? What do we need to do
>>> to make progress with it?
>>
>> This investigation should indeed continue but I think it is orthogonal.
> 
> Not if its intended to replace HWCAPs in the longterm.
> 

userspace needs HWCAP bits independently of the MIDR emulation.

MIDR cannot replace HWCAP:

- MIDR does not map to features in a future-proof way
  (we don't know which MIDR will indicate fp16 availability)

- MIDR would be useful for more fine-grained uarch specific
  tuning decisions, HWCAP is for arch extensions like fp16.

- We don't yet know if the proposed MIDR emulation solves
  all userspace issues we want it to solve. I'll try to
  investigate that as well as the alternative VDSO based
  approach (if VDSO can work, that would be better for
  userspace). There are nasty issues here so the conclusion
  might take a while, but that should not hold up HWCAPs.

thanks

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-02-16 11:48       ` Szabolcs Nagy
@ 2016-02-16 11:53         ` Will Deacon
  2016-02-16 12:57           ` Szabolcs Nagy
  0 siblings, 1 reply; 20+ messages in thread
From: Will Deacon @ 2016-02-16 11:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 16, 2016 at 11:48:14AM +0000, Szabolcs Nagy wrote:
> On 28/01/16 16:00, Will Deacon wrote:
> > On Tue, Jan 26, 2016 at 04:11:51PM +0000, Catalin Marinas wrote:
> >> On Tue, Jan 26, 2016 at 04:02:58PM +0000, Will Deacon wrote:
> >>> On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
> >>>> +#define HWCAP_FPHP		(1 << 9)
> >>>> +#define HWCAP_ASIMDHP		(1 << 10)
> >>>
> >>> Where did we get to with the mrs trapping you proposed here?
> >>>
> >>>   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
> >>>
> >>> At some point, we need to consider whether or not we want to continue
> >>> adding new HWCAPs or whether your suggestion above is actually useful
> >>> to userspace.
> >>
> >> IMO, even if we merge the MRS emulation, I would still like to see
> >> HWCAPs exported. We are not short on bits yet (53 to go ;)).
> > 
> > I'm less keen. HWCAPs don't align well with the way that the ARM
> > architecture versions features and we should be encouraging people to
> > use the MRS emulation if it exists.
> > 
> >>> Did the libc guys get anywhere with a prototype? What do we need to do
> >>> to make progress with it?
> >>
> >> This investigation should indeed continue but I think it is orthogonal.
> > 
> > Not if its intended to replace HWCAPs in the longterm.
> > 
> 
> userspace needs HWCAP bits independently of the MIDR emulation.
> 
> MIDR cannot replace HWCAP:
> 
> - MIDR does not map to features in a future-proof way
>   (we don't know which MIDR will indicate fp16 availability)
> 
> - MIDR would be useful for more fine-grained uarch specific
>   tuning decisions, HWCAP is for arch extensions like fp16.
> 
> - We don't yet know if the proposed MIDR emulation solves
>   all userspace issues we want it to solve. I'll try to
>   investigate that as well as the alternative VDSO based
>   approach (if VDSO can work, that would be better for
>   userspace). There are nasty issues here so the conclusion
>   might take a while, but that should not hold up HWCAPs.

I'm not solely proposing MIDR as an alternative to HWCAP. I'm proposing
that the feature registers, e.g. ID_AA64MMFR0_EL1 are used instead.

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-02-16 11:53         ` Will Deacon
@ 2016-02-16 12:57           ` Szabolcs Nagy
  0 siblings, 0 replies; 20+ messages in thread
From: Szabolcs Nagy @ 2016-02-16 12:57 UTC (permalink / raw)
  To: linux-arm-kernel

On 16/02/16 11:53, Will Deacon wrote:
> On Tue, Feb 16, 2016 at 11:48:14AM +0000, Szabolcs Nagy wrote:
>> On 28/01/16 16:00, Will Deacon wrote:
>>> On Tue, Jan 26, 2016 at 04:11:51PM +0000, Catalin Marinas wrote:
>>>> On Tue, Jan 26, 2016 at 04:02:58PM +0000, Will Deacon wrote:
>>>>> On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K Poulose wrote:
>>>>>> +#define HWCAP_FPHP		(1 << 9)
>>>>>> +#define HWCAP_ASIMDHP		(1 << 10)
>>>>>
>>>>> Where did we get to with the mrs trapping you proposed here?
>>>>>
>>>>>   http://lists.infradead.org/pipermail/linux-arm-kernel/2015-October/374609.html
>>>>>
>>>>> At some point, we need to consider whether or not we want to continue
>>>>> adding new HWCAPs or whether your suggestion above is actually useful
>>>>> to userspace.
>>>>
>>>> IMO, even if we merge the MRS emulation, I would still like to see
>>>> HWCAPs exported. We are not short on bits yet (53 to go ;)).
>>>
>>> I'm less keen. HWCAPs don't align well with the way that the ARM
>>> architecture versions features and we should be encouraging people to
>>> use the MRS emulation if it exists.
>>>
>>>>> Did the libc guys get anywhere with a prototype? What do we need to do
>>>>> to make progress with it?
>>>>
>>>> This investigation should indeed continue but I think it is orthogonal.
>>>
>>> Not if its intended to replace HWCAPs in the longterm.
>>>
>>
>> userspace needs HWCAP bits independently of the MIDR emulation.
>>
>> MIDR cannot replace HWCAP:
>>
>> - MIDR does not map to features in a future-proof way
>>   (we don't know which MIDR will indicate fp16 availability)
>>
>> - MIDR would be useful for more fine-grained uarch specific
>>   tuning decisions, HWCAP is for arch extensions like fp16.
>>
>> - We don't yet know if the proposed MIDR emulation solves
>>   all userspace issues we want it to solve. I'll try to
>>   investigate that as well as the alternative VDSO based
>>   approach (if VDSO can work, that would be better for
>>   userspace). There are nasty issues here so the conclusion
>>   might take a while, but that should not hold up HWCAPs.
> 
> I'm not solely proposing MIDR as an alternative to HWCAP. I'm proposing
> that the feature registers, e.g. ID_AA64MMFR0_EL1 are used instead.
> 

if the ID_* system registers are made available to userspace
(with fixup for heterogeneous systems), that works too.

hwcap is still more accessible in userspace:

- libc gets it without additional syscall/trap
- user code can get it with getauxval(AT_HWCAP)
- ifunc resolvers take a hwcap argument.
- HWCAP_ bits are already exposed in uapi headers.

i'm not familiar with the ID_ regs though.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] arm64: Add support for Half precision floating point
  2016-01-26 15:52 [PATCH] arm64: Add support for Half precision floating point Suzuki K Poulose
  2016-01-26 16:02 ` Will Deacon
@ 2016-02-26 15:37 ` Catalin Marinas
  1 sibling, 0 replies; 20+ messages in thread
From: Catalin Marinas @ 2016-02-26 15:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jan 26, 2016 at 03:52:46PM +0000, Suzuki K. Poulose wrote:
> ARMv8.2 extensions [1] include an optional feature, which supports
> half precision(16bit) floating point/asimd data processing
> instructions. This patch adds support for detecting and exposing
> the same to the userspace via HWCAPs
> 
> [1] https://community.arm.com/groups/processors/blog/2016/01/05/armv8-a-architecture-evolution
> 
> Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>

Applied. Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-02-26 15:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-26 15:52 [PATCH] arm64: Add support for Half precision floating point Suzuki K Poulose
2016-01-26 16:02 ` Will Deacon
2016-01-26 16:11   ` Catalin Marinas
2016-01-28 16:00     ` Will Deacon
2016-02-16 11:48       ` Szabolcs Nagy
2016-02-16 11:53         ` Will Deacon
2016-02-16 12:57           ` Szabolcs Nagy
2016-01-26 16:21   ` Suzuki K. Poulose
2016-01-26 16:55     ` Siddhesh Poyarekar
2016-01-28 16:07       ` Will Deacon
2016-01-28 16:46         ` Siddhesh Poyarekar
2016-01-28 17:27           ` Catalin Marinas
2016-01-28 17:44             ` Siddhesh Poyarekar
2016-01-28 17:55               ` Suzuki K. Poulose
2016-01-28 16:51         ` Adhemerval Zanella
2016-02-02 17:31           ` Szabolcs Nagy
2016-02-02 18:12             ` Adhemerval Zanella
2016-02-02 18:25               ` Szabolcs Nagy
2016-02-02 18:28                 ` Adhemerval Zanella
2016-02-26 15:37 ` Catalin Marinas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.