All of lore.kernel.org
 help / color / mirror / Atom feed
* z constraint in powerpc inline assembly ?
@ 2020-01-16  6:11 Christophe Leroy
  2020-01-16  8:06 ` Gabriel Paubert
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Christophe Leroy @ 2020-01-16  6:11 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: linuxppc-dev

Hi Segher,

I'm trying to see if we could enhance TCP checksum calculations by 
splitting inline assembly blocks to give GCC the opportunity to mix it 
with other stuff, but I'm getting difficulties with the carry.

As far as I can read in the documentation, the z constraint represents 
'‘XER[CA]’ carry bit (part of the XER register)'

I've tried the following, but I get errors. Can you help ?

unsigned long cksum(unsigned long a, unsigned long b, unsigned long c)
{
	unsigned long sum;
	unsigned long carry;

	asm("addc %0, %2, %3" : "=r"(sum), "=z"(carry) : "r"(a), "r"(b));
	asm("adde %0, %0, %2" : "+r"(sum), "+z"(carry) : "r"(c));
	asm("addze %0, %0" : "+r"(sum) : "z"(carry));

	return sum;
}



csum.c: In function 'cksum':
csum.c:6:2: error: inconsistent operand constraints in an 'asm'
   asm("addc %0, %2, %3" : "=r"(sum), "=z"(carry) : "r"(a), "r"(b));
   ^
csum.c:7:2: error: inconsistent operand constraints in an 'asm'
   asm("adde %0, %0, %2" : "+r"(sum), "+z"(carry) : "r"(c));
   ^
csum.c:8:2: error: inconsistent operand constraints in an 'asm'
   asm("addze %0, %0" : "+r"(sum) : "z"(carry));
   ^

Thanks
Christophe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: z constraint in powerpc inline assembly ?
  2020-01-16  6:11 z constraint in powerpc inline assembly ? Christophe Leroy
@ 2020-01-16  8:06 ` Gabriel Paubert
  2020-01-16 13:57   ` Segher Boessenkool
  2020-01-16 14:01 ` Segher Boessenkool
  2020-01-16 15:54 ` David Laight
  2 siblings, 1 reply; 10+ messages in thread
From: Gabriel Paubert @ 2020-01-16  8:06 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: linuxppc-dev

On Thu, Jan 16, 2020 at 07:11:36AM +0100, Christophe Leroy wrote:
> Hi Segher,
> 
> I'm trying to see if we could enhance TCP checksum calculations by splitting
> inline assembly blocks to give GCC the opportunity to mix it with other
> stuff, but I'm getting difficulties with the carry.
> 
> As far as I can read in the documentation, the z constraint represents
> '‘XER[CA]’ carry bit (part of the XER register)'

Well, the documentation is very optimisitic. From the GCC source code
(thanks for switching to git last week-end ;-)), it is clear that the
carry is not, for the time being, properly modeled. 

Right now, in the machine description, all setters and users of the carry
are in the same block of generated instructions.

For a start, all single instructions patterns that set the carry (and
do not use it) as a side effect should mention the they clobber the 
carry, otherwise inserting one between a setter and a user of the carry 
would break. This includes all arithmetic right shift (sra[wd]{,i}, 
subfic, addic{,\.} and I may have forgotten some.

If you want to future proof your code just in case, you should also add
an "xer" clobber to all instruction sequences that may modify the carry
bit. But any inline assembly that touches XER might break if GCC is
ugraded to properly model the carry bit, and a lot of code might need to
be audited.

	Gabriel

> 
> I've tried the following, but I get errors. Can you help ?
> 
> unsigned long cksum(unsigned long a, unsigned long b, unsigned long c)
> {
> 	unsigned long sum;
> 	unsigned long carry;
> 
> 	asm("addc %0, %2, %3" : "=r"(sum), "=z"(carry) : "r"(a), "r"(b));
> 	asm("adde %0, %0, %2" : "+r"(sum), "+z"(carry) : "r"(c));
> 	asm("addze %0, %0" : "+r"(sum) : "z"(carry));
> 
> 	return sum;
> }
> 
> 
> 
> csum.c: In function 'cksum':
> csum.c:6:2: error: inconsistent operand constraints in an 'asm'
>   asm("addc %0, %2, %3" : "=r"(sum), "=z"(carry) : "r"(a), "r"(b));
>   ^
> csum.c:7:2: error: inconsistent operand constraints in an 'asm'
>   asm("adde %0, %0, %2" : "+r"(sum), "+z"(carry) : "r"(c));
>   ^
> csum.c:8:2: error: inconsistent operand constraints in an 'asm'
>   asm("addze %0, %0" : "+r"(sum) : "z"(carry));
>   ^
> 
> Thanks
> Christophe
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: z constraint in powerpc inline assembly ?
  2020-01-16  8:06 ` Gabriel Paubert
@ 2020-01-16 13:57   ` Segher Boessenkool
  2020-01-16 17:42     ` [PosibleSpam] " Gabriel Paubert
  0 siblings, 1 reply; 10+ messages in thread
From: Segher Boessenkool @ 2020-01-16 13:57 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: linuxppc-dev

On Thu, Jan 16, 2020 at 09:06:08AM +0100, Gabriel Paubert wrote:
> On Thu, Jan 16, 2020 at 07:11:36AM +0100, Christophe Leroy wrote:
> > Hi Segher,
> > 
> > I'm trying to see if we could enhance TCP checksum calculations by splitting
> > inline assembly blocks to give GCC the opportunity to mix it with other
> > stuff, but I'm getting difficulties with the carry.
> > 
> > As far as I can read in the documentation, the z constraint represents
> > '‘XER[CA]’ carry bit (part of the XER register)'
> 
> Well, the documentation is very optimisitic. From the GCC source code
> (thanks for switching to git last week-end ;-)), it is clear that the
> carry is not, for the time being, properly modeled. 

What?  It certainly *is*, I spent ages on that back in 2014 and before.
See gcc.gnu.org/PR64180 etc.

You can not put the carry as input or output to an asm, of course: no C
variable can be assigned to it.

We don't do the "flag outputs" thing, either, as it is largely useless
for Power (and using it would often make *worse* code).

If you want to access a carry, write C code that does that operation.
The compiler knows how to optimise it well.

> Right now, in the machine description, all setters and users of the carry
> are in the same block of generated instructions.

No, they are not.  For over five years now.  (Since GCC 5).

> For a start, all single instructions patterns that set the carry (and
> do not use it) as a side effect should mention the they clobber the 
> carry, otherwise inserting one between a setter and a user of the carry 
> would break.

And they do.

All asms that change the carry should mention that, too, but this is
automatically done for all inline asms, because there was a lot of code
in the wild that does not clobber it.

> This includes all arithmetic right shift (sra[wd]{,i}, 
> subfic, addic{,\.} and I may have forgotten some.

{add,subf}{ic,c,e,ze,me} and sra[wd][i] and their dots.  Sure.  And
mcrxr and mcrxrx and mfxer and mtxer.  That's about it.

We don't model the second carry at all yet btw, in GCC.  Not too many
people know it exists even, so no big loss there.

(One nasty was that addi. does not exist, so we used addic. where it was
wanted before, so that had to change.)


Segher

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: z constraint in powerpc inline assembly ?
  2020-01-16  6:11 z constraint in powerpc inline assembly ? Christophe Leroy
  2020-01-16  8:06 ` Gabriel Paubert
@ 2020-01-16 14:01 ` Segher Boessenkool
  2020-01-16 15:54 ` David Laight
  2 siblings, 0 replies; 10+ messages in thread
From: Segher Boessenkool @ 2020-01-16 14:01 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: linuxppc-dev

Hi!

On Thu, Jan 16, 2020 at 07:11:36AM +0100, Christophe Leroy wrote:
> I'm trying to see if we could enhance TCP checksum calculations by 
> splitting inline assembly blocks to give GCC the opportunity to mix it 
> with other stuff, but I'm getting difficulties with the carry.
> 
> As far as I can read in the documentation, the z constraint represents 
> '‘XER[CA]’ carry bit (part of the XER register)'
> 
> I've tried the following, but I get errors. Can you help ?
> 
> unsigned long cksum(unsigned long a, unsigned long b, unsigned long c)
> {
> 	unsigned long sum;
> 	unsigned long carry;
> 
> 	asm("addc %0, %2, %3" : "=r"(sum), "=z"(carry) : "r"(a), "r"(b));
> 	asm("adde %0, %0, %2" : "+r"(sum), "+z"(carry) : "r"(c));
> 	asm("addze %0, %0" : "+r"(sum) : "z"(carry));
> 
> 	return sum;
> }

The only register allowed by "z" is a fixed register.  You cannot use "z"
in inline asm.

Just write this as C?  It should do a reasonable job of it.  If you want
*good* code, you need to write it in *actual* assembler code, anyway (hand
scheduled and everything).


Segher

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: z constraint in powerpc inline assembly ?
  2020-01-16  6:11 z constraint in powerpc inline assembly ? Christophe Leroy
  2020-01-16  8:06 ` Gabriel Paubert
  2020-01-16 14:01 ` Segher Boessenkool
@ 2020-01-16 15:54 ` David Laight
  2020-01-16 16:21   ` Segher Boessenkool
  2 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2020-01-16 15:54 UTC (permalink / raw)
  To: 'Christophe Leroy', Segher Boessenkool; +Cc: linuxppc-dev

From: Christophe Leroy
> Sent: 16 January 2020 06:12
> 
> I'm trying to see if we could enhance TCP checksum calculations by
> splitting inline assembly blocks to give GCC the opportunity to mix it
> with other stuff, but I'm getting difficulties with the carry.

if you are trying to 'loop carry' the 'carry flag' with 'add with carry'
instructions you'll almost certainly need to write the loop in asm.
Since the loop itself is simple, this probably doesn't matter.

However a loop of 'add with carry' instructions may not be the
fastest code by any means.
Because the carry flag is needed for every 'adc' you can't do more
that one adc per clock.
This limits you to 8 bytes/clock on a 64bit system - even one
that can schedule multiple memory reads and lots of instructions
every clock.

I don't know ppc, but on x86 you don't even get 1 adc per clock
until very recent (Haswell I think) cpus.
Sandy/Ivy bridge will do so if you add to alternate registers.

For earlier cpu it is actually difficult to beat the 4 bytes/clock
you get by adding 32bit values to a 64bit register in C code.

One possibility is to do a normal add then shift the carry
into a separate register.
After 64 words use 'popcnt' to sum the carry bits.
With 2 accumulators (and carry shifts) you'd need to
break the loop every 1024 bytes.
This should beat 8 bytes/clock if you can exeute more than
1 memory read, one add and one shift each clock.

I've not tried this on an old x86 cpu - which would need a software
'popcnt'. It got close to 8 bytes/clock on Ivy bridge.
It almost certainly beats the 4 bytes/clock of the current x86-64
code on such systems.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: z constraint in powerpc inline assembly ?
  2020-01-16 15:54 ` David Laight
@ 2020-01-16 16:21   ` Segher Boessenkool
  2020-01-16 16:52     ` David Laight
  2020-01-16 17:10     ` Christophe Leroy
  0 siblings, 2 replies; 10+ messages in thread
From: Segher Boessenkool @ 2020-01-16 16:21 UTC (permalink / raw)
  To: David Laight; +Cc: linuxppc-dev

Hi!

On Thu, Jan 16, 2020 at 03:54:58PM +0000, David Laight wrote:
> if you are trying to 'loop carry' the 'carry flag' with 'add with carry'
> instructions you'll almost certainly need to write the loop in asm.
> Since the loop itself is simple, this probably doesn't matter.

Agreed.

> However a loop of 'add with carry' instructions may not be the
> fastest code by any means.
> Because the carry flag is needed for every 'adc' you can't do more
> that one adc per clock.
> This limits you to 8 bytes/clock on a 64bit system - even one
> that can schedule multiple memory reads and lots of instructions
> every clock.
> 
> I don't know ppc, but on x86 you don't even get 1 adc per clock
> until very recent (Haswell I think) cpus.
> Sandy/Ivy bridge will do so if you add to alternate registers.

The carry bit is renamed just fine on all modern Power cpus.  On Power9
there is an extra carry bit, precisely so you can do two interleaved
chains.  And you can run lots of these insns at once, every cycle.

On older cpus there were other limitations as well, but those have been
solved essentially.

> For earlier cpu it is actually difficult to beat the 4 bytes/clock
> you get by adding 32bit values to a 64bit register in C code.

Christophe uses a very primitive 32-bit cpu, not even superscalar.  A
loop doing adde is pretty much optimal, probably wants some unrolling
though.

> One possibility is to do a normal add then shift the carry
> into a separate register.
> After 64 words use 'popcnt' to sum the carry bits.
> With 2 accumulators (and carry shifts) you'd need to
> break the loop every 1024 bytes.
> This should beat 8 bytes/clock if you can exeute more than
> 1 memory read, one add and one shift each clock.

Do normal 64-bit adds, and in parallel also accumulate the values shifted
right by 32 bits.  You can add 4G of them this way, and restore the 96-bit
actual sum from these two accumulators, so that you can fold it to a proper
ones' complement sum after the loop.

But you can easily beat 8B/clock using vectors, or doing multiple addition
chains (interleaved) in parallel.  Not that it helps, your limiting factor
is the memory bandwidth anyway, if anything in the memory pipeline stalls
all your optimisations are for nothing.


Segher

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: z constraint in powerpc inline assembly ?
  2020-01-16 16:21   ` Segher Boessenkool
@ 2020-01-16 16:52     ` David Laight
  2020-01-16 17:10     ` Christophe Leroy
  1 sibling, 0 replies; 10+ messages in thread
From: David Laight @ 2020-01-16 16:52 UTC (permalink / raw)
  To: 'Segher Boessenkool'; +Cc: linuxppc-dev

From: Segher Boessenkool
> Sent: 16 January 2020 16:22
...
> > However a loop of 'add with carry' instructions may not be the
> > fastest code by any means.
> > Because the carry flag is needed for every 'adc' you can't do more
> > that one adc per clock.
> > This limits you to 8 bytes/clock on a 64bit system - even one
> > that can schedule multiple memory reads and lots of instructions
> > every clock.
> >
> > I don't know ppc, but on x86 you don't even get 1 adc per clock
> > until very recent (Haswell I think) cpus.
> > Sandy/Ivy bridge will do so if you add to alternate registers.
> 
> The carry bit is renamed just fine on all modern Power cpus.  On Power9
> there is an extra carry bit, precisely so you can do two interleaved
> chains.  And you can run lots of these insns at once, every cycle.

The limitation on old x86 was that each u-op could only have 2 inputs.
Since adc needs 3 it always took 2 clocks.
The first 'fix' still had an extra delay on the result register.

There is also a big problem of false dependencies against the flags.
PPC may not have this problem, but it makes it very difficult to
loop carry any of the flags.
Using 'dec' (which doesn't affect carry, but does set zero) is really slow.

Even though the latest x86 cpu have ADOX and ADCX (that use the
overflow and carry flags) and can run in parallel the LOOP 'dec jump
non-zero' instruction is microcoded and serialising!
I have got 12 bytes/clock without too much unrolling, but it is hard
work and probably not worth the effort.

...
> Christophe uses a very primitive 32-bit cpu, not even superscalar.  A
> loop doing adde is pretty much optimal, probably wants some unrolling
> though.

Or interleaving so it does read_a, [read_b, adc_a, read_a, adc_b]* adc_a.
That might be enough to get the loop 'for free' if there are memory stalls.

> Do normal 64-bit adds, and in parallel also accumulate the values shifted
> right by 32 bits.  You can add 4G of them this way, and restore the 96-bit
> actual sum from these two accumulators, so that you can fold it to a proper
> ones' complement sum after the loop.

That is probably too many instructions per word - unless you are using
simd ones.

> But you can easily beat 8B/clock using vectors, or doing multiple addition
> chains (interleaved) in parallel.  Not that it helps, your limiting factor
> is the memory bandwidth anyway, if anything in the memory pipeline stalls
> all your optimisations are for nothing.

Yep, if the data isn't in the L1 cache anything complex is a waste of time.

Unrolling too much just makes the top/bottom code take too long and
then it dominates for a lot of 'real world' buffers.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: z constraint in powerpc inline assembly ?
  2020-01-16 16:21   ` Segher Boessenkool
  2020-01-16 16:52     ` David Laight
@ 2020-01-16 17:10     ` Christophe Leroy
  2020-01-16 17:20       ` David Laight
  1 sibling, 1 reply; 10+ messages in thread
From: Christophe Leroy @ 2020-01-16 17:10 UTC (permalink / raw)
  To: Segher Boessenkool, David Laight; +Cc: linuxppc-dev



Le 16/01/2020 à 17:21, Segher Boessenkool a écrit :
> Christophe uses a very primitive 32-bit cpu, not even superscalar.  A
> loop doing adde is pretty much optimal, probably wants some unrolling
> though.

You mean the mpc8xx , but I'm also using the mpc832x which has a e300c2 
core and is capable of executing 2 insns in parallel if not in the same 
Unit.

Christophe

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: z constraint in powerpc inline assembly ?
  2020-01-16 17:10     ` Christophe Leroy
@ 2020-01-16 17:20       ` David Laight
  0 siblings, 0 replies; 10+ messages in thread
From: David Laight @ 2020-01-16 17:20 UTC (permalink / raw)
  To: 'Christophe Leroy', Segher Boessenkool; +Cc: linuxppc-dev

> You mean the mpc8xx , but I'm also using the mpc832x which has a e300c2
> core and is capable of executing 2 insns in parallel if not in the same
> Unit.

That should let you do a memory read and an add.
(I can't remember if the ppc has 'add from memory' but that is
likely to use both units anyway.)
An infinitely unrolled loop will then be 4 clocks/byte (for 32bit).
If you get to 3 for a real loop you are doing ok.

Remember, unroll too much and you displace other code from
the i-cache. Also the i-cache loads themselves kill you.
(A hot-cache benchmark won't see this...)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PosibleSpam] Re: z constraint in powerpc inline assembly ?
  2020-01-16 13:57   ` Segher Boessenkool
@ 2020-01-16 17:42     ` Gabriel Paubert
  0 siblings, 0 replies; 10+ messages in thread
From: Gabriel Paubert @ 2020-01-16 17:42 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: linuxppc-dev

On Thu, Jan 16, 2020 at 07:57:29AM -0600, Segher Boessenkool wrote:
> On Thu, Jan 16, 2020 at 09:06:08AM +0100, Gabriel Paubert wrote:
> > On Thu, Jan 16, 2020 at 07:11:36AM +0100, Christophe Leroy wrote:
> > > Hi Segher,
> > > 
> > > I'm trying to see if we could enhance TCP checksum calculations by splitting
> > > inline assembly blocks to give GCC the opportunity to mix it with other
> > > stuff, but I'm getting difficulties with the carry.
> > > 
> > > As far as I can read in the documentation, the z constraint represents
> > > '‘XER[CA]’ carry bit (part of the XER register)'
> > 
> > Well, the documentation is very optimisitic. From the GCC source code
> > (thanks for switching to git last week-end ;-)), it is clear that the
> > carry is not, for the time being, properly modeled. 
> 
> What?  It certainly *is*, I spent ages on that back in 2014 and before.
> See gcc.gnu.org/PR64180 etc.
> 
> You can not put the carry as input or output to an asm, of course: no C
> variable can be assigned to it.
> 
> We don't do the "flag outputs" thing, either, as it is largely useless
> for Power (and using it would often make *worse* code).
> 
> If you want to access a carry, write C code that does that operation.
> The compiler knows how to optimise it well.
> 
> > Right now, in the machine description, all setters and users of the carry
> > are in the same block of generated instructions.
> 
> No, they are not.  For over five years now.  (Since GCC 5).
> 
> > For a start, all single instructions patterns that set the carry (and
> > do not use it) as a side effect should mention the they clobber the 
> > carry, otherwise inserting one between a setter and a user of the carry 
> > would break.
> 
> And they do.
>

Apologies, I don't know how I could misread the .md files this badly.
Indeed I see everything now that you mention it.

I'm still a bit surprised that I have found zero "z" constraints in the
whole gcc/config/rs6000 directory. Everything seems to be CA_REGNO.

> All asms that change the carry should mention that, too, but this is
> automatically done for all inline asms, because there was a lot of code
> in the wild that does not clobber it.

I was not aware of this, anyway I would always put as correct as
possible clobbers for my inline assembly code.

> 
> > This includes all arithmetic right shift (sra[wd]{,i}, 
> > subfic, addic{,\.} and I may have forgotten some.
> 
> {add,subf}{ic,c,e,ze,me} and sra[wd][i] and their dots.  Sure.  And
> mcrxr and mcrxrx and mfxer and mtxer.  That's about it.

Yes, but are last ones (the moves) are ever generated by the compiler?

Looking at the source (again) it seems that even lswi has disappeared.

> 
> We don't model the second carry at all yet btw, in GCC.  Not too many
> people know it exists even, so no big loss there.
> 

Anyway, I couldn't use it. I tried to buy a Talos II at work but
management made it too complex to negotiate. The problem was not the
money, but the paperwork :-(. Now my most powerful PPC machine is a 17" 
Powerbook G4.

> (One nasty was that addi. does not exist, so we used addic. where it was
> wanted before, so that had to change.)
> 
> 
> Segher


	Regards,
	Gabriel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-01-16 19:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-16  6:11 z constraint in powerpc inline assembly ? Christophe Leroy
2020-01-16  8:06 ` Gabriel Paubert
2020-01-16 13:57   ` Segher Boessenkool
2020-01-16 17:42     ` [PosibleSpam] " Gabriel Paubert
2020-01-16 14:01 ` Segher Boessenkool
2020-01-16 15:54 ` David Laight
2020-01-16 16:21   ` Segher Boessenkool
2020-01-16 16:52     ` David Laight
2020-01-16 17:10     ` Christophe Leroy
2020-01-16 17:20       ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.