All of lore.kernel.org
 help / color / mirror / Atom feed
* [OpenRISC] GCC-optimizations/weirdness...
@ 2016-10-19 16:39 Jakob Viketoft
  2016-10-19 21:22 ` Richard Henderson
  0 siblings, 1 reply; 5+ messages in thread
From: Jakob Viketoft @ 2016-10-19 16:39 UTC (permalink / raw)
  To: openrisc

Hello any gcc-gurus out there!

I've hit a major performance obstacle on 64-bit mul/div which is used by default in the RTEMS kernel. I feel this should be able to speed up considerably using some clever maths and the (possibly) defined hardware instructions for 32 bits. However, I'm not quite sure how the translation in gcc is made. Looking into the or1k-port of gcc, I have yet to see a way of defining 64-bit mul/div operations as compared to the 32-bit which are implemented. Please see my test code and a dump of it's assembler below which shows how these are implemented today.

I also consistently see unnecessary stack operations which might be quite a bummer in terms of performance as well, please see the same code snippets below.

Any help or pointers are more than welcome!

     /Jakob

---------------------------8<--------------------------

uint64_t add64(uint64_t a, uint64_t b)
{
  return a + b;
}

uint64_t mul64(uint64_t a, uint64_t b)
{
  return a * b;
}

uint64_t div64(uint64_t a, uint64_t b)
{
  return a / b;
}

uint32_t add32(uint32_t a, uint32_t b)
{
  return a + b;
}

uint32_t mul32(uint32_t a, uint32_t b)
{
  return a * b;
}

uint32_t div32(uint32_t a, uint32_t b)
{
  return a / b;
}


00002df4 <add64>:
    2df4:       e1 84 30 00      l.add r12,r4,r6
    2df8:       d7 e1 0f fc      l.sw -4(r1),r1
    2dfc:       e4 8c 20 00      l.sfltu r12,r4
    2e00:       9c 21 ff fc      l.addi r1,r1,-4
    2e04:       10 00 00 03      l.bf 2e10 <add64+0x1c>
    2e08:       9c 80 00 01      l.addi r4,r0,1
    2e0c:       9c 80 00 00      l.addi r4,r0,0
    2e10:       e1 63 28 00      l.add r11,r3,r5
    2e14:       9c 21 00 04      l.addi r1,r1,4
    2e18:       e1 64 58 00      l.add r11,r4,r11
    2e1c:       44 00 48 00      l.jr r9
    2e20:       84 21 ff fc      l.lwz r1,-4(r1)

00002e24 <mul64>:
    2e24:       d7 e1 4f fc      l.sw -4(r1),r9
    2e28:       d7 e1 0f f8      l.sw -8(r1),r1
    2e2c:       04 00 03 96      l.jal 3c84 <__muldi3>
    2e30:       9c 21 ff f8      l.addi r1,r1,-8
    2e34:       9c 21 00 08      l.addi r1,r1,8
    2e38:       85 21 ff fc      l.lwz r9,-4(r1)
    2e3c:       44 00 48 00      l.jr r9
    2e40:       84 21 ff f8      l.lwz r1,-8(r1)

00002e44 <div64>:
    2e44:       d7 e1 4f fc      l.sw -4(r1),r9
    2e48:       d7 e1 0f f8      l.sw -8(r1),r1
    2e4c:       04 00 03 ad      l.jal 3d00 <__udivdi3>
    2e50:       9c 21 ff f8      l.addi r1,r1,-8
    2e54:       9c 21 00 08      l.addi r1,r1,8
    2e58:       85 21 ff fc      l.lwz r9,-4(r1)
    2e5c:       44 00 48 00      l.jr r9
    2e60:       84 21 ff f8      l.lwz r1,-8(r1)

00002e64 <add32>:
    2e64:       d7 e1 0f fc      l.sw -4(r1),r1
    2e68:       9c 21 ff fc      l.addi r1,r1,-4
    2e6c:       e1 63 20 00      l.add r11,r3,r4
    2e70:       9c 21 00 04      l.addi r1,r1,4
    2e74:       44 00 48 00      l.jr r9
    2e78:       84 21 ff fc      l.lwz r1,-4(r1)

00002e7c <mul32>:
    2e7c:       d7 e1 0f fc      l.sw -4(r1),r1
    2e80:       9c 21 ff fc      l.addi r1,r1,-4
    2e84:       e1 63 23 06      l.mul r11,r3,r4
    2e88:       9c 21 00 04      l.addi r1,r1,4
    2e8c:       44 00 48 00      l.jr r9
    2e90:       84 21 ff fc      l.lwz r1,-4(r1)

00002e94 <div32>:
    2e94:       d7 e1 0f fc      l.sw -4(r1),r1
    2e98:       9c 21 ff fc      l.addi r1,r1,-4
    2e9c:       e1 63 23 0a      l.divu r11,r3,r4
    2ea0:       9c 21 00 04      l.addi r1,r1,4
    2ea4:       44 00 48 00      l.jr r9
    2ea8:       84 21 ff fc      l.lwz r1,-4(r1)



Jakob Viketoft
Senior Engineer in RTL and embedded software

ÅAC Microtec AB
Dag Hammarskjölds väg 48
SE-751 83 Uppsala, Sweden

T: +46 702 80 95 97
http://www.aacmicrotec.com


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [OpenRISC] GCC-optimizations/weirdness...
  2016-10-19 16:39 [OpenRISC] GCC-optimizations/weirdness Jakob Viketoft
@ 2016-10-19 21:22 ` Richard Henderson
  2016-10-20  7:35   ` Jakob Viketoft
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Henderson @ 2016-10-19 21:22 UTC (permalink / raw)
  To: openrisc

On 10/19/2016 09:39 AM, Jakob Viketoft wrote:
> Hello any gcc-gurus out there!

Please have a look at

   git://github.com/rth7680/gcc.git or1k-6
   git://github/com/rth7680/binutils-gdb.git or1k-0

I did some work on cleaning up the or1k port a year or two ago.  The branch is 
based on (probably pre-release) gcc 6.

In addition to basic cleanups, it also contains support for my l.adrp proposal 
from 2014.  It's enabled by default, for ease of testing, but easily disabled 
from the command-line (-mno-adrp), or by reverting the appropriately named 
"hack: Enable ADDRP by default" patch.

> I've hit a major performance obstacle on 64-bit mul/div which is used by
> default in the RTEMS kernel. I feel this should be able to speed up
> considerably using some clever maths and the (possibly) defined hardware
> instructions for 32 bits.

It wouldn't be difficult to add support to gcc for the madd extension.  But 
without that, the clever arithmetic is all buried inside __muldi3 (and is too 
large to replicate inline).

There is no proposed extension that would help with 64-bit division.  So that 
too is buried in __udivdi3.

> I also consistently see unnecessary stack operations which might be quite a
> bummer in terms of performance as well, please see the same code snippets
> below.

This is definitely cleaned up in my branch.
Output of my gc -O2 intermixed below.

> 00002df4 <add64>:
>     2df4:       e1 84 30 00      l.add r12,r4,r6
>     2df8:       d7 e1 0f fc      l.sw -4(r1),r1
>     2dfc:       e4 8c 20 00      l.sfltu r12,r4
>     2e00:       9c 21 ff fc      l.addi r1,r1,-4
>     2e04:       10 00 00 03      l.bf 2e10 <add64+0x1c>
>     2e08:       9c 80 00 01      l.addi r4,r0,1
>     2e0c:       9c 80 00 00      l.addi r4,r0,0
>     2e10:       e1 63 28 00      l.add r11,r3,r5
>     2e14:       9c 21 00 04      l.addi r1,r1,4
>     2e18:       e1 64 58 00      l.add r11,r4,r11
>     2e1c:       44 00 48 00      l.jr r9
>     2e20:       84 21 ff fc      l.lwz r1,-4(r1)

00000000 <add64>:
    0:	e1 84 30 00 	l.add r12,r4,r6
    4:	44 00 48 00 	l.jr r9
    8:	e1 63 28 01 	l.addc r11,r3,r5

> 00002e24 <mul64>:
>     2e24:       d7 e1 4f fc      l.sw -4(r1),r9
>     2e28:       d7 e1 0f f8      l.sw -8(r1),r1
>     2e2c:       04 00 03 96      l.jal 3c84 <__muldi3>
>     2e30:       9c 21 ff f8      l.addi r1,r1,-8
>     2e34:       9c 21 00 08      l.addi r1,r1,8
>     2e38:       85 21 ff fc      l.lwz r9,-4(r1)
>     2e3c:       44 00 48 00      l.jr r9
>     2e40:       84 21 ff f8      l.lwz r1,-8(r1)

0000000c <mul64>:
    c:	d7 e1 4f fc 	l.sw -4(r1),r9
   10:	04 00 00 00 	l.jal 10 <mul64+0x4>
   14:	9c 21 ff fc 	l.addi r1,r1,-4
   18:	9c 21 00 04 	l.addi r1,r1,4
   1c:	85 21 ff fc 	l.lwz r9,-4(r1)
   20:	44 00 48 00 	l.jr r9
   24:	15 00 00 00 	l.nop 0x0

> 00002e44 <div64>:
>     2e44:       d7 e1 4f fc      l.sw -4(r1),r9
>     2e48:       d7 e1 0f f8      l.sw -8(r1),r1
>     2e4c:       04 00 03 ad      l.jal 3d00 <__udivdi3>
>     2e50:       9c 21 ff f8      l.addi r1,r1,-8
>     2e54:       9c 21 00 08      l.addi r1,r1,8
>     2e58:       85 21 ff fc      l.lwz r9,-4(r1)
>     2e5c:       44 00 48 00      l.jr r9
>     2e60:       84 21 ff f8      l.lwz r1,-8(r1)

00000028 <div64>:
   28:	d7 e1 4f fc 	l.sw -4(r1),r9
   2c:	04 00 00 00 	l.jal 2c <div64+0x4>
   30:	9c 21 ff fc 	l.addi r1,r1,-4
   34:	9c 21 00 04 	l.addi r1,r1,4
   38:	85 21 ff fc 	l.lwz r9,-4(r1)
   3c:	44 00 48 00 	l.jr r9
   40:	15 00 00 00 	l.nop 0x0

> 00002e64 <add32>:
>     2e64:       d7 e1 0f fc      l.sw -4(r1),r1
>     2e68:       9c 21 ff fc      l.addi r1,r1,-4
>     2e6c:       e1 63 20 00      l.add r11,r3,r4
>     2e70:       9c 21 00 04      l.addi r1,r1,4
>     2e74:       44 00 48 00      l.jr r9
>     2e78:       84 21 ff fc      l.lwz r1,-4(r1)

00000044 <add32>:
   44:	44 00 48 00 	l.jr r9
   48:	e1 63 20 00 	l.add r11,r3,r4

> 00002e7c <mul32>:
>     2e7c:       d7 e1 0f fc      l.sw -4(r1),r1
>     2e80:       9c 21 ff fc      l.addi r1,r1,-4
>     2e84:       e1 63 23 06      l.mul r11,r3,r4
>     2e88:       9c 21 00 04      l.addi r1,r1,4
>     2e8c:       44 00 48 00      l.jr r9
>     2e90:       84 21 ff fc      l.lwz r1,-4(r1)

0000004c <mul32>:
   4c:	44 00 48 00 	l.jr r9
   50:	e1 63 23 06 	l.mul r11,r3,r4

> 00002e94 <div32>:
>     2e94:       d7 e1 0f fc      l.sw -4(r1),r1
>     2e98:       9c 21 ff fc      l.addi r1,r1,-4
>     2e9c:       e1 63 23 0a      l.divu r11,r3,r4
>     2ea0:       9c 21 00 04      l.addi r1,r1,4
>     2ea4:       44 00 48 00      l.jr r9
>     2ea8:       84 21 ff fc      l.lwz r1,-4(r1)

00000054 <div32>:
   54:	44 00 48 00 	l.jr r9
   58:	e1 63 23 0a 	l.divu r11,r3,r4



r~



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [OpenRISC] GCC-optimizations/weirdness...
  2016-10-19 21:22 ` Richard Henderson
@ 2016-10-20  7:35   ` Jakob Viketoft
  2016-10-20 15:48     ` Richard Henderson
  0 siblings, 1 reply; 5+ messages in thread
From: Jakob Viketoft @ 2016-10-20  7:35 UTC (permalink / raw)
  To: openrisc

From: Richard Henderson [rth7680 at gmail.com] on behalf of Richard Henderson [rth at twiddle.net]
Sent: Wednesday, October 19, 2016 23:22
To: Jakob Viketoft; openrisc at lists.librecores.org
Subject: Re: [OpenRISC] GCC-optimizations/weirdness...

> Please have a look at

>   git://github.com/rth7680/gcc.git or1k-6
>   git://github/com/rth7680/binutils-gdb.git or1k-0

> I did some work on cleaning up the or1k port a year or two ago.  The branch is
> based on (probably pre-release) gcc 6.

Thank you, this sounds very interesting, I'll definitely take a look.

> > I've hit a major performance obstacle on 64-bit mul/div which is used by
> > default in the RTEMS kernel. I feel this should be able to speed up
> > considerably using some clever maths and the (possibly) defined hardware
> > instructions for 32 bits.

> It wouldn't be difficult to add support to gcc for the madd extension.  But
> without that, the clever arithmetic is all buried inside __muldi3 (and is too
> large to replicate inline).

> There is no proposed extension that would help with 64-bit division.  So that
> too is buried in __udivdi3.

What I meant was to make clever arithmetic that replaces the __muldi3/__udivdi3 or at least improves it using the available 32-bit hardware instructions. I.e. how to replace a given operation with a given set of assembler instructions, just as the add64 does, not adding more custom instructions. The __muldi3 is quite close, but no cigar in terms of optimality for this CPU. I don't necessarily intend to have it inline, but it still can be optimized even if it's a separate call.

> > I also consistently see unnecessary stack operations which might be quite a
> > bummer in terms of performance as well, please see the same code snippets
> > below.

> This is definitely cleaned up in my branch.
> Output of my gc -O2 intermixed below.

That sounds great as well. Will definitely take a look into using this instead.

> 00000000 <add64>:
>    0:  e1 84 30 00     l.add r12,r4,r6
>    4:  44 00 48 00     l.jr r9
>    8:  e1 63 28 01     l.addc r11,r3,r5

This looks really good.

> 0000000c <mul64>:
>    c:  d7 e1 4f fc     l.sw -4(r1),r9
>   10:  04 00 00 00     l.jal 10 <mul64+0x4>
>   14:  9c 21 ff fc     l.addi r1,r1,-4
>   18:  9c 21 00 04     l.addi r1,r1,4
>   1c:  85 21 ff fc     l.lwz r9,-4(r1)
>   20:  44 00 48 00     l.jr r9
>   24:  15 00 00 00     l.nop 0x0

I assume it's a simple linker mistake setting the l.jal to mul64, right? I assume you still call __muldi3?

> 00000028 <div64>:
>   28:  d7 e1 4f fc     l.sw -4(r1),r9
>   2c:  04 00 00 00     l.jal 2c <div64+0x4>
>   30:  9c 21 ff fc     l.addi r1,r1,-4
>   34:  9c 21 00 04     l.addi r1,r1,4
>   38:  85 21 ff fc     l.lwz r9,-4(r1)
>   3c:  44 00 48 00     l.jr r9
>   40:  15 00 00 00     l.nop 0x0

Ditto with the linker, I assume?

> 00000044 <add32>:
>   44:  44 00 48 00     l.jr r9
>   48:  e1 63 20 00     l.add r11,r3,r4

> 0000004c <mul32>:
>   4c:  44 00 48 00     l.jr r9
>   50:  e1 63 23 06     l.mul r11,r3,r4

> 00000054 <div32>:
>   54:  44 00 48 00     l.jr r9
>   58:  e1 63 23 0a     l.divu r11,r3,r4

Yes, this is exactly what I wanted! Btw, any guess to why it's making l.mul (and not l.mulu) on unsigneds?

      /Jakob

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [OpenRISC] GCC-optimizations/weirdness...
  2016-10-20  7:35   ` Jakob Viketoft
@ 2016-10-20 15:48     ` Richard Henderson
  2016-10-20 18:34       ` Richard Henderson
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Henderson @ 2016-10-20 15:48 UTC (permalink / raw)
  To: openrisc

On 10/20/2016 12:35 AM, Jakob Viketoft wrote:
>> There is no proposed extension that would help with 64-bit division.  So that
>> too is buried in __udivdi3.
>
> What I meant was to make clever arithmetic that replaces the
> __muldi3/__udivdi3 or at least improves it using the available 32-bit
> hardware instructions. I.e. how to replace a given operation with a given
> set of assembler instructions, just as the add64 does, not adding more
> custom instructions. The __muldi3 is quite close, but no cigar in terms of
> optimality for this CPU. I don't necessarily intend to have it inline, but
> it still can be optimized even if it's a separate call.

Having a look at __muldi3 closely, I see that we could in fact use carry 
arithmetic to reduce it's instruction count by 2 (if cmov is enabled).  I guess 
if cmov hadn't been enabled the intermediate branch would make thing much worse.

This gets me down to

00000000 <__muldi3>:
    0:   ba 64 00 50     l.srli r19,r4,0x10
    4:   b9 66 00 50     l.srli r11,r6,0x10
    8:   a5 84 ff ff     l.andi r12,r4,0xffff
    c:   a6 e6 ff ff     l.andi r23,r6,0xffff
   10:   e2 2c 5b 06     l.mul r17,r12,r11
   14:   e2 b3 bb 06     l.mul r21,r19,r23
   18:   e1 73 5b 06     l.mul r11,r19,r11
   1c:   e2 6c bb 06     l.mul r19,r12,r23
   20:   e0 84 2b 06     l.mul r4,r4,r5
   24:   e0 c6 1b 06     l.mul r6,r6,r3
   28:   19 80 ff ff     l.movhi r12,0xffff
   2c:   ba f1 00 50     l.srli r23,r17,0x10
   30:   e2 31 60 03     l.and r17,r17,r12
   34:   e1 95 60 03     l.and r12,r21,r12
   38:   b8 b5 00 50     l.srli r5,r21,0x10
   3c:   e1 8c 88 00     l.add r12,r12,r17
   40:   e2 37 28 01     l.addc r17,r23,r5
   44:   e1 8c 98 00     l.add r12,r12,r19
   48:   e2 71 58 01     l.addc r19,r17,r11
   4c:   e0 84 98 00     l.add r4,r4,r19
   50:   44 00 48 00     l.jr r9
   54:   e1 64 30 00     l.add r11,r4,r6

which is, I believe, optimal.

>> 0000000c <mul64>:
>>    c:  d7 e1 4f fc     l.sw -4(r1),r9
>>   10:  04 00 00 00     l.jal 10 <mul64+0x4>
>>   14:  9c 21 ff fc     l.addi r1,r1,-4
>>   18:  9c 21 00 04     l.addi r1,r1,4
>>   1c:  85 21 ff fc     l.lwz r9,-4(r1)
>>   20:  44 00 48 00     l.jr r9
>>   24:  15 00 00 00     l.nop 0x0
>
> I assume it's a simple linker mistake setting the l.jal to mul64, right? I assume you still call __muldi3?

This is objdump of a .o file, and not showing the relocations.  So, yes, the 
final linked executable would call __muldi3.

> Btw, any guess to why it's making l.mul (and not l.mulu) on unsigneds?

Becuase l.mul and l.mulu are (when you don't care about the overflow/carry 
bits) indistinguishable.  GCC itself doesn't retain the signedness of the 
operation throughout optimization.


r~


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [OpenRISC] GCC-optimizations/weirdness...
  2016-10-20 15:48     ` Richard Henderson
@ 2016-10-20 18:34       ` Richard Henderson
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Henderson @ 2016-10-20 18:34 UTC (permalink / raw)
  To: openrisc

On 10/20/2016 08:48 AM, Richard Henderson wrote:
> 00000000 <__muldi3>:
>    0:   ba 64 00 50     l.srli r19,r4,0x10
>    4:   b9 66 00 50     l.srli r11,r6,0x10
>    8:   a5 84 ff ff     l.andi r12,r4,0xffff
>    c:   a6 e6 ff ff     l.andi r23,r6,0xffff
>   10:   e2 2c 5b 06     l.mul r17,r12,r11
>   14:   e2 b3 bb 06     l.mul r21,r19,r23
>   18:   e1 73 5b 06     l.mul r11,r19,r11
>   1c:   e2 6c bb 06     l.mul r19,r12,r23
>   20:   e0 84 2b 06     l.mul r4,r4,r5
>   24:   e0 c6 1b 06     l.mul r6,r6,r3
>   28:   19 80 ff ff     l.movhi r12,0xffff
>   2c:   ba f1 00 50     l.srli r23,r17,0x10
>   30:   e2 31 60 03     l.and r17,r17,r12
>   34:   e1 95 60 03     l.and r12,r21,r12
>   38:   b8 b5 00 50     l.srli r5,r21,0x10
>   3c:   e1 8c 88 00     l.add r12,r12,r17
>   40:   e2 37 28 01     l.addc r17,r23,r5
>   44:   e1 8c 98 00     l.add r12,r12,r19
>   48:   e2 71 58 01     l.addc r19,r17,r11
>   4c:   e0 84 98 00     l.add r4,r4,r19
>   50:   44 00 48 00     l.jr r9
>   54:   e1 64 30 00     l.add r11,r4,r6

Bah.  Silly error on my part -- shifts not ands.  But anyway,


r~



diff --git a/include/longlong.h b/include/longlong.h
index 2841d0f..dabcb75 100644
--- a/include/longlong.h
+++ b/include/longlong.h
@@ -909,6 +909,33 @@ extern UDItype __umulsidi3 (USItype, USItype);
      UDItype __s = __a - __b;                                   \
      (sl) = (USItype)__s; (sh) = __s >> 32;                     \
    } while (0)
+/* Unlike the generic version below, make use of carry arithmetic
+   to fold the intermediate multiplies.  */
+#define umul_ppmm(w1, w0, u, v)                                        \
+  do {                                                                 \
+    UWtype __x0, __x1, __x2, __x3, __x1h, __x1l, __x2h, __x2l;         \
+    UHWtype __ul, __vl, __uh, __vh;                                    \
+                                                                       \
+    __ul = __ll_lowpart (u);                                           \
+    __uh = __ll_highpart (u);                                          \
+    __vl = __ll_lowpart (v);                                           \
+    __vh = __ll_highpart (v);                                          \
+                                                                       \
+    __x0 = (UWtype) __ul * __vl;                                       \
+    __x1 = (UWtype) __ul * __vh;                                       \
+    __x2 = (UWtype) __uh * __vl;                                       \
+    __x3 = (UWtype) __uh * __vh;                                       \
+                                                                       \
+    __x1l = __x1 << (W_TYPE_SIZE / 2);                                 \
+    __x2l = __x2 << (W_TYPE_SIZE / 2);                                 \
+    __x1h = __ll_highpart (__x1);                                      \
+    __x2h = __ll_highpart (__x2);                                      \
+                                                                       \
+    add_ssaaaa(__x3, __x0, __x3, __x0, __x1h, __x1l);                  \
+    add_ssaaaa(__x3, __x0, __x3, __x0, __x2h, __x2l);                  \
+    (w1) = __x3;                                                       \
+    (w0) = __x0;                                                       \
+  } while (0)
  #endif /* __OR1K__ */

  /* FIXME: We should test _IBMR2 here when we add assembly support for the



^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-10-20 18:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-19 16:39 [OpenRISC] GCC-optimizations/weirdness Jakob Viketoft
2016-10-19 21:22 ` Richard Henderson
2016-10-20  7:35   ` Jakob Viketoft
2016-10-20 15:48     ` Richard Henderson
2016-10-20 18:34       ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.