All of lore.kernel.org
 help / color / mirror / Atom feed
* CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
@ 2019-03-29 13:07 George Spelvin
  2019-03-29 20:00   ` Michael Cree
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: George Spelvin @ 2019-03-29 13:07 UTC (permalink / raw)
  To: linux-alpha, linux-mips, linux-s390, linuxppc-dev; +Cc: lkml

(Cross-posted in case there are generic issues; please trim if
discussion wanders into single-architecture details.)

I was working on some scaling code that can benefit from 64x64->128-bit
multiplies.  GCC supports an __int128 type on processors with hardware
support (including z/Arch and MIPS64), but the support was broken on
early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.

Currently, of the ten 64-bit architectures Linux supports, that's
only enabled on x86, ARM, and RISC-V.

SPARC and HP-PA don't have support.

But that leaves Alpha, Mips, PowerPC, and S/390x.

Current mips64, powerpc64, and s390x gcc seems to generate sensible code
for mul_u64_u64_shr() in <linux/math64.h> if I cross-compile them.

I don't have easy access to an Alpha cross-compiler to test, but
as it has UMULH, I suspect it would work, too.

Is there a reason it hasn't been enabled on these platforms?

There might be a MIPS64r6 issue, since r6 changed from DMULTU
writing the lo and hi registers to DMULU/DMUHU, and gcc 8.3, at
least, doesn't know how to generate inline code for the latter.

(Note that users *also* check __INT128__, which is defined if GCC
claims to support __int128, so you don't have to worry about 32-bit
compiles or ancient compilers.  It only has to be conditional on
*broken* support.)


FWIW, the code I'm working on has this inner loop:
(https://arxiv.org/abs/1805.10941 for details)

u64 get_random_u64(void);
u64 get_random_max64(u64 range, u64 lim)
{
	unsigned __int128 prod;
	do {
		prod = (unsigned __int128)get_random_u64() * range;
	} while (unlikely((u64)prod < lim));
	return prod >> 64;
}

Which turns into these inner loops:
MIPS:
.L7:
	jal	get_random_u64
	nop
	dmultu $2,$17
	mflo	$3
	sltu	$4,$3,$16
	bne	$4,$0,.L7
	mfhi	$2

PowerPC:
.L9:
	bl get_random_u64
	nop
	mulld 9,3,31
	mulhdu 3,3,31
	cmpld 7,30,9
	bgt 7,.L9

s/390:
.L13:
	brasl	%r14,get_random_u64@PLT
	lgr	%r5,%r2
	mlgr	%r4,%r10
	lgr	%r2,%r4
	clgr	%r11,%r5
	jh	.L13

I like that the MIPS code leaves the high half of the product in
the hi register until it tests the low half; I wish PowerPC would
similarly move the mulhdu *after* the loop, like the following
hypothetical MIPS R6 code:

.L7:
	balc	get_random_u64
	dmulu	$3, $2, $17
	sltu	$3, $3, $16
	bnezc	$3, .L7
	dmuhu	$2, $2, $17

Or this handwritten Alpha code:
1:
	bsr	$26, get_random_u64
	mulq	$0, $9, $1	# $9 is range
	cmpult	$1, $10, $1	# $10 is lim
	bne	$1, 1b
	umulh	$0, $9, $0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
  2019-03-29 13:07 CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha? George Spelvin
@ 2019-03-29 20:00   ` Michael Cree
  2019-03-29 20:25   ` Segher Boessenkool
  2019-03-30  8:43   ` Heiko Carstens
  2 siblings, 0 replies; 19+ messages in thread
From: Michael Cree @ 2019-03-29 20:00 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-alpha, linux-mips, linux-s390, linuxppc-dev

On Fri, Mar 29, 2019 at 01:07:07PM +0000, George Spelvin wrote:
> I was working on some scaling code that can benefit from 64x64->128-bit
> multiplies.  GCC supports an __int128 type on processors with hardware
> support (including z/Arch and MIPS64), but the support was broken on
> early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.
[snip] 
> I don't have easy access to an Alpha cross-compiler to test, but
> as it has UMULH, I suspect it would work, too.

On Debian/Ubuntu it is just a matter of:
apt-get install gcc-alpha-linux-gnu

> Or this handwritten Alpha code:
> 1:
> 	bsr	$26, get_random_u64
> 	mulq	$0, $9, $1	# $9 is range
> 	cmpult	$1, $10, $1	# $10 is lim
> 	bne	$1, 1b
> 	umulh	$0, $9, $0

The compiler produces:

$L2:
	ldq $27,get_random_u64($29)		!literal!2
	jsr $26,($27),get_random_u64		!lituse_jsr!2
	ldah $29,0($26)		!gpdisp!3
	mulq $0,$9,$1
	lda $29,0($29)		!gpdisp!3
	umulh $0,$9,$0
	cmpule $10,$1,$1
	beq $1,$L2

It does move the umulh inside the loop but that seems sensible since
the use of unlikely() implies that the loop is unlikely to be taken
so on average it would be a good bet to start the calculation of
umulh earlier since it has a few cycles latency to get the result,
and it is pipelined so it can be calculated in the shadow of the
mulq instruction on the same execution unit.  On the older CPUs
(before EV6 which are not out-of-order execution) having the umulh
inside the loop may be a net gain.

Cheers,
Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
@ 2019-03-29 20:00   ` Michael Cree
  0 siblings, 0 replies; 19+ messages in thread
From: Michael Cree @ 2019-03-29 20:00 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-s390, linuxppc-dev, linux-mips, linux-alpha

On Fri, Mar 29, 2019 at 01:07:07PM +0000, George Spelvin wrote:
> I was working on some scaling code that can benefit from 64x64->128-bit
> multiplies.  GCC supports an __int128 type on processors with hardware
> support (including z/Arch and MIPS64), but the support was broken on
> early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.
[snip] 
> I don't have easy access to an Alpha cross-compiler to test, but
> as it has UMULH, I suspect it would work, too.

On Debian/Ubuntu it is just a matter of:
apt-get install gcc-alpha-linux-gnu

> Or this handwritten Alpha code:
> 1:
> 	bsr	$26, get_random_u64
> 	mulq	$0, $9, $1	# $9 is range
> 	cmpult	$1, $10, $1	# $10 is lim
> 	bne	$1, 1b
> 	umulh	$0, $9, $0

The compiler produces:

$L2:
	ldq $27,get_random_u64($29)		!literal!2
	jsr $26,($27),get_random_u64		!lituse_jsr!2
	ldah $29,0($26)		!gpdisp!3
	mulq $0,$9,$1
	lda $29,0($29)		!gpdisp!3
	umulh $0,$9,$0
	cmpule $10,$1,$1
	beq $1,$L2

It does move the umulh inside the loop but that seems sensible since
the use of unlikely() implies that the loop is unlikely to be taken
so on average it would be a good bet to start the calculation of
umulh earlier since it has a few cycles latency to get the result,
and it is pipelined so it can be calculated in the shadow of the
mulq instruction on the same execution unit.  On the older CPUs
(before EV6 which are not out-of-order execution) having the umulh
inside the loop may be a net gain.

Cheers,
Michael.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
  2019-03-29 13:07 CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha? George Spelvin
@ 2019-03-29 20:25   ` Segher Boessenkool
  2019-03-29 20:25   ` Segher Boessenkool
  2019-03-30  8:43   ` Heiko Carstens
  2 siblings, 0 replies; 19+ messages in thread
From: Segher Boessenkool @ 2019-03-29 20:25 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-alpha, linux-mips, linux-s390, linuxppc-dev

Hi!

On Fri, Mar 29, 2019 at 01:07:07PM +0000, George Spelvin wrote:
> I was working on some scaling code that can benefit from 64x64->128-bit
> multiplies.  GCC supports an __int128 type on processors with hardware
> support (including z/Arch and MIPS64), but the support was broken on
> early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.
> 
> Currently, of the ten 64-bit architectures Linux supports, that's
> only enabled on x86, ARM, and RISC-V.
> 
> SPARC and HP-PA don't have support.
> 
> But that leaves Alpha, Mips, PowerPC, and S/390x.
> 
> Current mips64, powerpc64, and s390x gcc seems to generate sensible code
> for mul_u64_u64_shr() in <linux/math64.h> if I cross-compile them.

Yup.

> I don't have easy access to an Alpha cross-compiler to test, but
> as it has UMULH, I suspect it would work, too.

https://mirrors.edge.kernel.org/pub/tools/crosstool/

> u64 get_random_u64(void);
> u64 get_random_max64(u64 range, u64 lim)
> {
> 	unsigned __int128 prod;
> 	do {
> 		prod = (unsigned __int128)get_random_u64() * range;
> 	} while (unlikely((u64)prod < lim));
> 	return prod >> 64;
> }

> Which turns into these inner loops:
> MIPS:
> .L7:
> 	jal	get_random_u64
> 	nop
> 	dmultu $2,$17
> 	mflo	$3
> 	sltu	$4,$3,$16
> 	bne	$4,$0,.L7
> 	mfhi	$2
> 
> PowerPC:
> .L9:
> 	bl get_random_u64
> 	nop
> 	mulld 9,3,31
> 	mulhdu 3,3,31
> 	cmpld 7,30,9
> 	bgt 7,.L9
> 
> s/390:
> .L13:
> 	brasl	%r14,get_random_u64@PLT
> 	lgr	%r5,%r2
> 	mlgr	%r4,%r10
> 	lgr	%r2,%r4
> 	clgr	%r11,%r5
> 	jh	.L13
> 
> I like that the MIPS code leaves the high half of the product in
> the hi register until it tests the low half; I wish PowerPC would
> similarly move the mulhdu *after* the loop,

The MIPS code has the multiplication inside the loop as well, and even
the mfhi I think: MIPS has delay slots.

GCC treats the int128 as one register until it has expanded to RTL, and it
does not do such loop optimisations after that, apparently.

File a PR please?  https://gcc.gnu.org/bugzilla/


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
@ 2019-03-29 20:25   ` Segher Boessenkool
  0 siblings, 0 replies; 19+ messages in thread
From: Segher Boessenkool @ 2019-03-29 20:25 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-s390, linuxppc-dev, linux-mips, linux-alpha

Hi!

On Fri, Mar 29, 2019 at 01:07:07PM +0000, George Spelvin wrote:
> I was working on some scaling code that can benefit from 64x64->128-bit
> multiplies.  GCC supports an __int128 type on processors with hardware
> support (including z/Arch and MIPS64), but the support was broken on
> early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.
> 
> Currently, of the ten 64-bit architectures Linux supports, that's
> only enabled on x86, ARM, and RISC-V.
> 
> SPARC and HP-PA don't have support.
> 
> But that leaves Alpha, Mips, PowerPC, and S/390x.
> 
> Current mips64, powerpc64, and s390x gcc seems to generate sensible code
> for mul_u64_u64_shr() in <linux/math64.h> if I cross-compile them.

Yup.

> I don't have easy access to an Alpha cross-compiler to test, but
> as it has UMULH, I suspect it would work, too.

https://mirrors.edge.kernel.org/pub/tools/crosstool/

> u64 get_random_u64(void);
> u64 get_random_max64(u64 range, u64 lim)
> {
> 	unsigned __int128 prod;
> 	do {
> 		prod = (unsigned __int128)get_random_u64() * range;
> 	} while (unlikely((u64)prod < lim));
> 	return prod >> 64;
> }

> Which turns into these inner loops:
> MIPS:
> .L7:
> 	jal	get_random_u64
> 	nop
> 	dmultu $2,$17
> 	mflo	$3
> 	sltu	$4,$3,$16
> 	bne	$4,$0,.L7
> 	mfhi	$2
> 
> PowerPC:
> .L9:
> 	bl get_random_u64
> 	nop
> 	mulld 9,3,31
> 	mulhdu 3,3,31
> 	cmpld 7,30,9
> 	bgt 7,.L9
> 
> s/390:
> .L13:
> 	brasl	%r14,get_random_u64@PLT
> 	lgr	%r5,%r2
> 	mlgr	%r4,%r10
> 	lgr	%r2,%r4
> 	clgr	%r11,%r5
> 	jh	.L13
> 
> I like that the MIPS code leaves the high half of the product in
> the hi register until it tests the low half; I wish PowerPC would
> similarly move the mulhdu *after* the loop,

The MIPS code has the multiplication inside the loop as well, and even
the mfhi I think: MIPS has delay slots.

GCC treats the int128 as one register until it has expanded to RTL, and it
does not do such loop optimisations after that, apparently.

File a PR please?  https://gcc.gnu.org/bugzilla/


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
  2019-03-29 13:07 CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha? George Spelvin
@ 2019-03-30  8:43   ` Heiko Carstens
  2019-03-29 20:25   ` Segher Boessenkool
  2019-03-30  8:43   ` Heiko Carstens
  2 siblings, 0 replies; 19+ messages in thread
From: Heiko Carstens @ 2019-03-30  8:43 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-alpha, linux-mips, linux-s390, linuxppc-dev

On Fri, Mar 29, 2019 at 01:07:07PM +0000, George Spelvin wrote:
> (Cross-posted in case there are generic issues; please trim if
> discussion wanders into single-architecture details.)
> 
> I was working on some scaling code that can benefit from 64x64->128-bit
> multiplies.  GCC supports an __int128 type on processors with hardware
> support (including z/Arch and MIPS64), but the support was broken on
> early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.
> 
> Currently, of the ten 64-bit architectures Linux supports, that's
> only enabled on x86, ARM, and RISC-V.
> 
> SPARC and HP-PA don't have support.
> 
> But that leaves Alpha, Mips, PowerPC, and S/390x.
> 
> Current mips64, powerpc64, and s390x gcc seems to generate sensible code
> for mul_u64_u64_shr() in <linux/math64.h> if I cross-compile them.
> 
> I don't have easy access to an Alpha cross-compiler to test, but
> as it has UMULH, I suspect it would work, too.
> 
> Is there a reason it hasn't been enabled on these platforms?

It hasn't been enabled on s390 simply because at least I wasn't aware
of this config option. Feel free to send a patch, otherwise I will
enable this. Whatever you prefer.

Thanks for pointing this out!


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
@ 2019-03-30  8:43   ` Heiko Carstens
  0 siblings, 0 replies; 19+ messages in thread
From: Heiko Carstens @ 2019-03-30  8:43 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-s390, linuxppc-dev, linux-mips, linux-alpha

On Fri, Mar 29, 2019 at 01:07:07PM +0000, George Spelvin wrote:
> (Cross-posted in case there are generic issues; please trim if
> discussion wanders into single-architecture details.)
> 
> I was working on some scaling code that can benefit from 64x64->128-bit
> multiplies.  GCC supports an __int128 type on processors with hardware
> support (including z/Arch and MIPS64), but the support was broken on
> early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.
> 
> Currently, of the ten 64-bit architectures Linux supports, that's
> only enabled on x86, ARM, and RISC-V.
> 
> SPARC and HP-PA don't have support.
> 
> But that leaves Alpha, Mips, PowerPC, and S/390x.
> 
> Current mips64, powerpc64, and s390x gcc seems to generate sensible code
> for mul_u64_u64_shr() in <linux/math64.h> if I cross-compile them.
> 
> I don't have easy access to an Alpha cross-compiler to test, but
> as it has UMULH, I suspect it would work, too.
> 
> Is there a reason it hasn't been enabled on these platforms?

It hasn't been enabled on s390 simply because at least I wasn't aware
of this config option. Feel free to send a patch, otherwise I will
enable this. Whatever you prefer.

Thanks for pointing this out!


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
  2019-03-30  8:43   ` Heiko Carstens
@ 2019-03-30 10:30     ` George Spelvin
  -1 siblings, 0 replies; 19+ messages in thread
From: George Spelvin @ 2019-03-30 10:30 UTC (permalink / raw)
  To: heiko.carstens, lkml; +Cc: linux-alpha, linux-mips, linux-s390, linuxppc-dev

On Sat, 30 Mar 2019 at 09:43:47 +0100, Heiko Carstens wrote:
> It hasn't been enabled on s390 simply because at least I wasn't aware
> of this config option. Feel free to send a patch, otherwise I will
> enable this. Whatever you prefer.
> 
> Thanks for pointing this out!

Here's a draft patch, but obviously it should be tested!

From 6f3cc608c49dba33a38e81232a2fd46e8fa8dbcd Mon Sep 17 00:00:00 2001
From: George Spelvin <lkml@sdf.org>
Date: Sat, 30 Mar 2019 10:27:14 +0000
Subject: [PATCH] s390: Enable CONFIG_ARCH_SUPPORTS_INT128 on 64-bit builds

If a platform supports a 64x64->128-bit widening multiply,
that allows more efficient scaling of 64-bit values in various
parts of the kernel.  GCC advertises __int128 support with the
__INT128__ #define, but we care about efficient inline
support, so this is a separate flag.

For s390, that was added on 24 March 2017 by
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=246457
which is part of GCC 7.

It also only applies to TARGET_ARCH12, which I am guessing
corresponds to HAVE_MARCH_ZEC12_FEATURES.  clang support is
pure guesswork.

Signed-off-by: George Spelvin <lkml@sdf.org>
---
 arch/s390/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index ed554b09eb3f..43e6dc677f7d 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -372,6 +372,7 @@ endchoice
 
 config 64BIT
 	def_bool y
+	select ARCH_SUPPORTS_INT128 if GCC_VERSION >= 70000 && HAVE_MARCH_ZEC12_FEATURES || CC_IS_CLANG
 
 config COMPAT
 	def_bool y
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
@ 2019-03-30 10:30     ` George Spelvin
  0 siblings, 0 replies; 19+ messages in thread
From: George Spelvin @ 2019-03-30 10:30 UTC (permalink / raw)
  To: heiko.carstens, lkml; +Cc: linux-s390, linuxppc-dev, linux-mips, linux-alpha

On Sat, 30 Mar 2019 at 09:43:47 +0100, Heiko Carstens wrote:
> It hasn't been enabled on s390 simply because at least I wasn't aware
> of this config option. Feel free to send a patch, otherwise I will
> enable this. Whatever you prefer.
> 
> Thanks for pointing this out!

Here's a draft patch, but obviously it should be tested!

From 6f3cc608c49dba33a38e81232a2fd46e8fa8dbcd Mon Sep 17 00:00:00 2001
From: George Spelvin <lkml@sdf.org>
Date: Sat, 30 Mar 2019 10:27:14 +0000
Subject: [PATCH] s390: Enable CONFIG_ARCH_SUPPORTS_INT128 on 64-bit builds

If a platform supports a 64x64->128-bit widening multiply,
that allows more efficient scaling of 64-bit values in various
parts of the kernel.  GCC advertises __int128 support with the
__INT128__ #define, but we care about efficient inline
support, so this is a separate flag.

For s390, that was added on 24 March 2017 by
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=246457
which is part of GCC 7.

It also only applies to TARGET_ARCH12, which I am guessing
corresponds to HAVE_MARCH_ZEC12_FEATURES.  clang support is
pure guesswork.

Signed-off-by: George Spelvin <lkml@sdf.org>
---
 arch/s390/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index ed554b09eb3f..43e6dc677f7d 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -372,6 +372,7 @@ endchoice
 
 config 64BIT
 	def_bool y
+	select ARCH_SUPPORTS_INT128 if GCC_VERSION >= 70000 && HAVE_MARCH_ZEC12_FEATURES || CC_IS_CLANG
 
 config COMPAT
 	def_bool y
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
  2019-03-29 20:25   ` Segher Boessenkool
@ 2019-03-30 11:28     ` George Spelvin
  -1 siblings, 0 replies; 19+ messages in thread
From: George Spelvin @ 2019-03-30 11:28 UTC (permalink / raw)
  To: lkml, segher; +Cc: linux-alpha, linux-mips, linux-s390, linuxppc-dev

General update:

I've since found the reason for the GCC version check.
It's not *broken* support (apologies for impugning GCC),
but *inefficient* support via a muditi3 helper function.

The version check is the version which added in-line code generation.
For example, s390 support was added in March of 2017 and is part
of the 7.1 release.

I hvave to do some digging through gcc version history to
find when it was added to various architectures.
(And MIPS64v6 support is still lacking in gcc 9.)


On Fri, 29 Mar 2019 at 15:25:58 -0500, Segher Boessenkool wrote:
> On Fri, Mar 29, 2019 at 01:07:07PM +0000, George Spelvin wrote:
>> I don't have easy access to an Alpha cross-compiler to test, but
>> as it has UMULH, I suspect it would work, too.
> 
> https://mirrors.edge.kernel.org/pub/tools/crosstool/

Thanks for the pointer; I have a working Alpha cross-compiler now.
 
>> u64 get_random_u64(void);
>> u64 get_random_max64(u64 range, u64 lim)
>> {
>> 	unsigned __int128 prod;
>> 	do {
>> 		prod = (unsigned __int128)get_random_u64() * range;
>> 	} while (unlikely((u64)prod < lim));
>> 	return prod >> 64;
>> }
> 
>> MIPS:
>> .L7:
>> 	jal	get_random_u64
>> 	nop
>> 	dmultu $2,$17
>> 	mflo	$3
>> 	sltu	$4,$3,$16
>> 	bne	$4,$0,.L7
>> 	mfhi	$2
>> 
>> PowerPC:
>> .L9:
>> 	bl get_random_u64
>> 	nop
>> 	mulld 9,3,31
>> 	mulhdu 3,3,31
>> 	cmpld 7,30,9
>> 	bgt 7,.L9
>>
>> I like that the MIPS code leaves the high half of the product in
>> the hi register until it tests the low half; I wish PowerPC would
>> similarly move the mulhdu *after* the loop,

> The MIPS code has the multiplication inside the loop as well, and even
> the mfhi I think: MIPS has delay slots.

Yes, it's in the delay slot, which is fine (the branch is unlikely,
after all).  But it does the compare (sltu) before accessing %hi, which
is good as %hi often has a longer latency than %lo.  (On out-of-order
cores, of course, none of this matters.)

> GCC treats the int128 as one register until it has expanded to RTL, and it
> does not do such loop optimisations after that, apparently.
> 
> File a PR please?  https://gcc.gnu.org/bugzilla/

Er...  about what?  The fact that the PowerPC code is not
>> PowerPC:
>> .L9:
>> 	bl get_random_u64
>> 	nop
>> 	mulld 9,3,31
>> 	cmpld 7,30,9
>> 	bgt 7,.L9
>> 	mulhdu 3,3,31

I'm not sure quite how to explain it in gcc-ese.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
@ 2019-03-30 11:28     ` George Spelvin
  0 siblings, 0 replies; 19+ messages in thread
From: George Spelvin @ 2019-03-30 11:28 UTC (permalink / raw)
  To: lkml, segher; +Cc: linux-s390, linuxppc-dev, linux-mips, linux-alpha

General update:

I've since found the reason for the GCC version check.
It's not *broken* support (apologies for impugning GCC),
but *inefficient* support via a muditi3 helper function.

The version check is the version which added in-line code generation.
For example, s390 support was added in March of 2017 and is part
of the 7.1 release.

I hvave to do some digging through gcc version history to
find when it was added to various architectures.
(And MIPS64v6 support is still lacking in gcc 9.)


On Fri, 29 Mar 2019 at 15:25:58 -0500, Segher Boessenkool wrote:
> On Fri, Mar 29, 2019 at 01:07:07PM +0000, George Spelvin wrote:
>> I don't have easy access to an Alpha cross-compiler to test, but
>> as it has UMULH, I suspect it would work, too.
> 
> https://mirrors.edge.kernel.org/pub/tools/crosstool/

Thanks for the pointer; I have a working Alpha cross-compiler now.
 
>> u64 get_random_u64(void);
>> u64 get_random_max64(u64 range, u64 lim)
>> {
>> 	unsigned __int128 prod;
>> 	do {
>> 		prod = (unsigned __int128)get_random_u64() * range;
>> 	} while (unlikely((u64)prod < lim));
>> 	return prod >> 64;
>> }
> 
>> MIPS:
>> .L7:
>> 	jal	get_random_u64
>> 	nop
>> 	dmultu $2,$17
>> 	mflo	$3
>> 	sltu	$4,$3,$16
>> 	bne	$4,$0,.L7
>> 	mfhi	$2
>> 
>> PowerPC:
>> .L9:
>> 	bl get_random_u64
>> 	nop
>> 	mulld 9,3,31
>> 	mulhdu 3,3,31
>> 	cmpld 7,30,9
>> 	bgt 7,.L9
>>
>> I like that the MIPS code leaves the high half of the product in
>> the hi register until it tests the low half; I wish PowerPC would
>> similarly move the mulhdu *after* the loop,

> The MIPS code has the multiplication inside the loop as well, and even
> the mfhi I think: MIPS has delay slots.

Yes, it's in the delay slot, which is fine (the branch is unlikely,
after all).  But it does the compare (sltu) before accessing %hi, which
is good as %hi often has a longer latency than %lo.  (On out-of-order
cores, of course, none of this matters.)

> GCC treats the int128 as one register until it has expanded to RTL, and it
> does not do such loop optimisations after that, apparently.
> 
> File a PR please?  https://gcc.gnu.org/bugzilla/

Er...  about what?  The fact that the PowerPC code is not
>> PowerPC:
>> .L9:
>> 	bl get_random_u64
>> 	nop
>> 	mulld 9,3,31
>> 	cmpld 7,30,9
>> 	bgt 7,.L9
>> 	mulhdu 3,3,31

I'm not sure quite how to explain it in gcc-ese.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
  2019-03-30 10:30     ` George Spelvin
@ 2019-03-30 13:00       ` George Spelvin
  -1 siblings, 0 replies; 19+ messages in thread
From: George Spelvin @ 2019-03-30 13:00 UTC (permalink / raw)
  To: heiko.carstens, lkml; +Cc: linux-alpha, linux-mips, linux-s390, linuxppc-dev

I've been tracking down when "umulditi3" support was added to various
gcc platforms.  So far, I've found:

ia64: 2005-07-28, gcc 4.1.0
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=102463

mips: 2008-06-09, gcc 4.4.0
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=136600 

alpha: 2013-02-01, gcc 4.8.0
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=195668

s390: 2011-10-07, gcc 4.7.0
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=179647

ppc64: 2013-02-01, gcc 4.8.0
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=195667


Here's a revised s390 patch.

From e8ea8c0a5d618385049248649b8c13717b598a42 Mon Sep 17 00:00:00 2001
From: George Spelvin <lkml@sdf.org>
Date: Sat, 30 Mar 2019 10:27:14 +0000
Subject: [PATCH v2] s390: Enable CONFIG_ARCH_SUPPORTS_INT128 on 64-bit builds

If a platform supports a 64x64->128-bit widening multiply,
that allows more efficient scaling of 64-bit values in various
parts of the kernel.  GCC advertises __int128 support with the
__INT128__ #define, but we care about efficient inline
support, so this is a separate flag.

For s390, that was added on 2011-10-07 by
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=179647
which is part of GCC 4.7.

Signed-off-by: George Spelvin <lkml@sdf.org>
---
 arch/s390/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index ed554b09eb3f..6ddaee6573f4 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -372,6 +372,7 @@ endchoice
 
 config 64BIT
 	def_bool y
+	select ARCH_SUPPORTS_INT128 if GCC_VERSION >= 40700 || CC_IS_CLANG
 
 config COMPAT
 	def_bool y
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
@ 2019-03-30 13:00       ` George Spelvin
  0 siblings, 0 replies; 19+ messages in thread
From: George Spelvin @ 2019-03-30 13:00 UTC (permalink / raw)
  To: heiko.carstens, lkml; +Cc: linux-s390, linuxppc-dev, linux-mips, linux-alpha

I've been tracking down when "umulditi3" support was added to various
gcc platforms.  So far, I've found:

ia64: 2005-07-28, gcc 4.1.0
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=102463

mips: 2008-06-09, gcc 4.4.0
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=136600 

alpha: 2013-02-01, gcc 4.8.0
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=195668

s390: 2011-10-07, gcc 4.7.0
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=179647

ppc64: 2013-02-01, gcc 4.8.0
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=195667


Here's a revised s390 patch.

From e8ea8c0a5d618385049248649b8c13717b598a42 Mon Sep 17 00:00:00 2001
From: George Spelvin <lkml@sdf.org>
Date: Sat, 30 Mar 2019 10:27:14 +0000
Subject: [PATCH v2] s390: Enable CONFIG_ARCH_SUPPORTS_INT128 on 64-bit builds

If a platform supports a 64x64->128-bit widening multiply,
that allows more efficient scaling of 64-bit values in various
parts of the kernel.  GCC advertises __int128 support with the
__INT128__ #define, but we care about efficient inline
support, so this is a separate flag.

For s390, that was added on 2011-10-07 by
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=179647
which is part of GCC 4.7.

Signed-off-by: George Spelvin <lkml@sdf.org>
---
 arch/s390/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index ed554b09eb3f..6ddaee6573f4 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -372,6 +372,7 @@ endchoice
 
 config 64BIT
 	def_bool y
+	select ARCH_SUPPORTS_INT128 if GCC_VERSION >= 40700 || CC_IS_CLANG
 
 config COMPAT
 	def_bool y
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
  2019-03-29 20:00   ` Michael Cree
  (?)
@ 2019-03-30 23:14   ` Segher Boessenkool
  -1 siblings, 0 replies; 19+ messages in thread
From: Segher Boessenkool @ 2019-03-30 23:14 UTC (permalink / raw)
  To: Michael Cree, George Spelvin, linux-alpha, linux-mips,
	linux-s390, linuxppc-dev

On Sat, Mar 30, 2019 at 09:00:15AM +1300, Michael Cree wrote:
> It does move the umulh inside the loop but that seems sensible since
> the use of unlikely() implies that the loop is unlikely to be taken
> so on average it would be a good bet to start the calculation of
> umulh earlier since it has a few cycles latency to get the result,
> and it is pipelined so it can be calculated in the shadow of the
> mulq instruction on the same execution unit.

That may make sense, but it is not what happens, sorry.  It _starts off_
as part of the loop, and it is never moved outside.

The only difference between a likely loop and an unlikely loop here I've
seen (on all targets I tried) is that with a likely loop the loop target
is aligned, while with an unlikely loop it isn't.

> On the older CPUs
> (before EV6 which are not out-of-order execution) having the umulh
> inside the loop may be a net gain.

Yes.  Similarly, on Power you can often calculate the high mul at the same
time as the low mul, for no extra cost.  This may be true on many archs.


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
  2019-03-30 11:28     ` George Spelvin
@ 2019-03-30 23:52       ` Segher Boessenkool
  -1 siblings, 0 replies; 19+ messages in thread
From: Segher Boessenkool @ 2019-03-30 23:52 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-alpha, linux-mips, linux-s390, linuxppc-dev

On Sat, Mar 30, 2019 at 11:28:21AM +0000, George Spelvin wrote:
> >> I like that the MIPS code leaves the high half of the product in
> >> the hi register until it tests the low half; I wish PowerPC would
> >> similarly move the mulhdu *after* the loop,
> 
> > The MIPS code has the multiplication inside the loop as well, and even
> > the mfhi I think: MIPS has delay slots.
> 
> Yes, it's in the delay slot, which is fine (the branch is unlikely,
> after all).  But it does the compare (sltu) before accessing %hi, which
> is good as %hi often has a longer latency than %lo.  (On out-of-order
> cores, of course, none of this matters.)

Yes.  But it does the mfhi on every iteration, while it only needs it for
the last one (or after the last one).  This may not be more expensive for
the actual hardware, but it is for GCC's cost model

> > GCC treats the int128 as one register until it has expanded to RTL, and it
> > does not do such loop optimisations after that, apparently.
> > 
> > File a PR please?  https://gcc.gnu.org/bugzilla/
> 
> Er...  about what?  The fact that the PowerPC code is not
> >> PowerPC:
> >> .L9:
> >> 	bl get_random_u64
> >> 	nop
> >> 	mulld 9,3,31
> >> 	cmpld 7,30,9
> >> 	bgt 7,.L9
> >> 	mulhdu 3,3,31
> 
> I'm not sure quite how to explain it in gcc-ese.

Yeah, exactly, like that.  This transformation is called "loop sinking"
usually: if anything that is set within a loop is only used after the loop,
it can be set after the loop (provided you keep the set's sources alive).


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
@ 2019-03-30 23:52       ` Segher Boessenkool
  0 siblings, 0 replies; 19+ messages in thread
From: Segher Boessenkool @ 2019-03-30 23:52 UTC (permalink / raw)
  To: George Spelvin; +Cc: linux-s390, linuxppc-dev, linux-mips, linux-alpha

On Sat, Mar 30, 2019 at 11:28:21AM +0000, George Spelvin wrote:
> >> I like that the MIPS code leaves the high half of the product in
> >> the hi register until it tests the low half; I wish PowerPC would
> >> similarly move the mulhdu *after* the loop,
> 
> > The MIPS code has the multiplication inside the loop as well, and even
> > the mfhi I think: MIPS has delay slots.
> 
> Yes, it's in the delay slot, which is fine (the branch is unlikely,
> after all).  But it does the compare (sltu) before accessing %hi, which
> is good as %hi often has a longer latency than %lo.  (On out-of-order
> cores, of course, none of this matters.)

Yes.  But it does the mfhi on every iteration, while it only needs it for
the last one (or after the last one).  This may not be more expensive for
the actual hardware, but it is for GCC's cost model

> > GCC treats the int128 as one register until it has expanded to RTL, and it
> > does not do such loop optimisations after that, apparently.
> > 
> > File a PR please?  https://gcc.gnu.org/bugzilla/
> 
> Er...  about what?  The fact that the PowerPC code is not
> >> PowerPC:
> >> .L9:
> >> 	bl get_random_u64
> >> 	nop
> >> 	mulld 9,3,31
> >> 	cmpld 7,30,9
> >> 	bgt 7,.L9
> >> 	mulhdu 3,3,31
> 
> I'm not sure quite how to explain it in gcc-ese.

Yeah, exactly, like that.  This transformation is called "loop sinking"
usually: if anything that is set within a loop is only used after the loop,
it can be set after the loop (provided you keep the set's sources alive).


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
  2019-03-30 10:30     ` George Spelvin
@ 2019-03-31  0:30       ` Segher Boessenkool
  -1 siblings, 0 replies; 19+ messages in thread
From: Segher Boessenkool @ 2019-03-31  0:30 UTC (permalink / raw)
  To: George Spelvin
  Cc: heiko.carstens, linux-s390, linuxppc-dev, linux-mips, linux-alpha

On Sat, Mar 30, 2019 at 10:30:23AM +0000, George Spelvin wrote:
> For s390, that was added on 24 March 2017 by
> https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=246457
> which is part of GCC 7.
> 
> It also only applies to TARGET_ARCH12, which I am guessing
> corresponds to HAVE_MARCH_ZEC12_FEATURES.

zEC12 is arch10, while z14 is arch12.  See
https://sourceware.org/binutils/docs-2.32/as/s390-Options.html#s390-Options
for example; it lists the correspondences, and states "The processor names
starting with arch refer to the edition number in the Principle of
Operations manual.  They can be used as alternate processor names and have
been added for compatibility with the IBM XL compiler."

Newer GCC does not use the somewhat confusing TARGET_ARCH12 name anymore;
see https://gcc.gnu.org/r264796 .


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
@ 2019-03-31  0:30       ` Segher Boessenkool
  0 siblings, 0 replies; 19+ messages in thread
From: Segher Boessenkool @ 2019-03-31  0:30 UTC (permalink / raw)
  To: George Spelvin
  Cc: linux-s390, linux-alpha, heiko.carstens, linuxppc-dev, linux-mips

On Sat, Mar 30, 2019 at 10:30:23AM +0000, George Spelvin wrote:
> For s390, that was added on 24 March 2017 by
> https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=246457
> which is part of GCC 7.
> 
> It also only applies to TARGET_ARCH12, which I am guessing
> corresponds to HAVE_MARCH_ZEC12_FEATURES.

zEC12 is arch10, while z14 is arch12.  See
https://sourceware.org/binutils/docs-2.32/as/s390-Options.html#s390-Options
for example; it lists the correspondences, and states "The processor names
starting with arch refer to the edition number in the Principle of
Operations manual.  They can be used as alternate processor names and have
been added for compatibility with the IBM XL compiler."

Newer GCC does not use the somewhat confusing TARGET_ARCH12 name anymore;
see https://gcc.gnu.org/r264796 .


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?
  2019-03-31  0:30       ` Segher Boessenkool
  (?)
@ 2019-03-31  9:23       ` George Spelvin
  -1 siblings, 0 replies; 19+ messages in thread
From: George Spelvin @ 2019-03-31  9:23 UTC (permalink / raw)
  To: linux-s390

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: multipart/mixed; boundary="--EVF5PPMfhYS0aIcm", Size: 868 bytes --]

The kbuild test robot sent me the following compile error report,
indicating that my version test needs further refinement.

With GCC 8.3, I get in-line code even for variable shifts:
mul_u64_u64_shr:
.LFB3:
	.cfi_startproc
	ldgr	%f0,%r11
	.cfi_register 11, 16
	mlgr	%r2,%r2
	ahik	%r11,%r4,-64
	srlg	%r0,%r2,0(%r11)
	sllg	%r1,%r2,1
	lhi	%r5,63
	sr	%r5,%r4
	sllg	%r1,%r1,0(%r5)
	srlg	%r4,%r3,0(%r4)
	ogr	%r4,%r1
	ltr	%r11,%r11
	lgr	%r2,%r4
	locgrhe	%r2,%r0
	lgdr	%r11,%f0
	.cfi_restore 11
	br	%r14
	.cfi_endproc

... but apparently gcc 7.2 needs helper functions.

Anyway, forwarded to
1) Invaidate my previous patch, and
2) Hand the ball to people who have a chance of
   doing anything about it.

I can try working solo to create an s390 kernel that compiles if
people really want, but I need assistance for boot-testing.

Is anyone interested in heping pursue this?

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-03-31  9:23 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-29 13:07 CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha? George Spelvin
2019-03-29 20:00 ` Michael Cree
2019-03-29 20:00   ` Michael Cree
2019-03-30 23:14   ` Segher Boessenkool
2019-03-29 20:25 ` Segher Boessenkool
2019-03-29 20:25   ` Segher Boessenkool
2019-03-30 11:28   ` George Spelvin
2019-03-30 11:28     ` George Spelvin
2019-03-30 23:52     ` Segher Boessenkool
2019-03-30 23:52       ` Segher Boessenkool
2019-03-30  8:43 ` Heiko Carstens
2019-03-30  8:43   ` Heiko Carstens
2019-03-30 10:30   ` George Spelvin
2019-03-30 10:30     ` George Spelvin
2019-03-30 13:00     ` George Spelvin
2019-03-30 13:00       ` George Spelvin
2019-03-31  0:30     ` Segher Boessenkool
2019-03-31  0:30       ` Segher Boessenkool
2019-03-31  9:23       ` George Spelvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.