linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
@ 2021-08-23 15:29 Christophe Leroy
  2021-08-23 18:46 ` Segher Boessenkool
  2021-11-02 10:11 ` Michael Ellerman
  0 siblings, 2 replies; 8+ messages in thread
From: Christophe Leroy @ 2021-08-23 15:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev

Instructions lmw/stmw are interesting for functions that are rarely
used and not in the cache, because only one instruction is to be
copied into the instruction cache instead of 19. However those
instruction are less performant than 19x raw lwz/stw as they require
synchronisation plus one additional cycle.

SAVE_NVGPRS / REST_NVGPRS are used in only a few places which are
mostly in interrupts entries/exits and in task switch so they are
likely already in the cache.

Using standard lwz improves null_syscall selftest by:
- 10 cycles on mpc832x.
- 2 cycles on mpc8xx.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/ppc_asm.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h
index ffe712307e11..349fc0ec0dbb 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -28,8 +28,8 @@
 #else
 #define SAVE_GPR(n, base)	stw	n,GPR0+4*(n)(base)
 #define REST_GPR(n, base)	lwz	n,GPR0+4*(n)(base)
-#define SAVE_NVGPRS(base)	stmw	13, GPR0+4*13(base)
-#define REST_NVGPRS(base)	lmw	13, GPR0+4*13(base)
+#define SAVE_NVGPRS(base)	SAVE_GPR(13, base); SAVE_8GPRS(14, base); SAVE_10GPRS(22, base)
+#define REST_NVGPRS(base)	REST_GPR(13, base); REST_8GPRS(14, base); REST_10GPRS(22, base)
 #endif
 
 #define SAVE_2GPRS(n, base)	SAVE_GPR(n, base); SAVE_GPR(n+1, base)
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
  2021-08-23 15:29 [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs Christophe Leroy
@ 2021-08-23 18:46 ` Segher Boessenkool
  2021-08-24  5:54   ` Christophe Leroy
  2021-11-02 10:11 ` Michael Ellerman
  1 sibling, 1 reply; 8+ messages in thread
From: Segher Boessenkool @ 2021-08-23 18:46 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel

On Mon, Aug 23, 2021 at 03:29:12PM +0000, Christophe Leroy wrote:
> Instructions lmw/stmw are interesting for functions that are rarely
> used and not in the cache, because only one instruction is to be
> copied into the instruction cache instead of 19. However those
> instruction are less performant than 19x raw lwz/stw as they require
> synchronisation plus one additional cycle.

lmw takes N+2 cycles for loading N words on 603/604/750/7400, and N+3 on
7450.  stmw takes N+1 cycles for storing N words on 603, N+2 on 604/750/
7400, and N+3 on 7450 (load latency is 3 instead of 2 on 7450).

There is no synchronisation needed, although there is some serialisation,
which of course doesn't mean much since there can be only 6 or 8 or so
insns executing at once anyway.

So, these insns are almost never slower, they can easily win cycles back
because of the smaller code, too.

What 32-bit core do you see where load/store multiple are more than a
fraction of a cycle (per memory access) slower?

> SAVE_NVGPRS / REST_NVGPRS are used in only a few places which are
> mostly in interrupts entries/exits and in task switch so they are
> likely already in the cache.

Nothing is likely in the cache on the older cores (except in
microbenchmarks), the caches are not big enough for that!

> Using standard lwz improves null_syscall selftest by:
> - 10 cycles on mpc832x.
> - 2 cycles on mpc8xx.

And in real benchmarks?

On mpccore both lmw and stmw are only N+1 btw.  But the serialization
might cost another cycle here?


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
  2021-08-23 18:46 ` Segher Boessenkool
@ 2021-08-24  5:54   ` Christophe Leroy
  2021-08-24 13:16     ` Segher Boessenkool
  0 siblings, 1 reply; 8+ messages in thread
From: Christophe Leroy @ 2021-08-24  5:54 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel



Le 23/08/2021 à 20:46, Segher Boessenkool a écrit :
> On Mon, Aug 23, 2021 at 03:29:12PM +0000, Christophe Leroy wrote:
>> Instructions lmw/stmw are interesting for functions that are rarely
>> used and not in the cache, because only one instruction is to be
>> copied into the instruction cache instead of 19. However those
>> instruction are less performant than 19x raw lwz/stw as they require
>> synchronisation plus one additional cycle.
> 
> lmw takes N+2 cycles for loading N words on 603/604/750/7400, and N+3 on
> 7450.  stmw takes N+1 cycles for storing N words on 603, N+2 on 604/750/
> 7400, and N+3 on 7450 (load latency is 3 instead of 2 on 7450).
> 
> There is no synchronisation needed, although there is some serialisation,
> which of course doesn't mean much since there can be only 6 or 8 or so
> insns executing at once anyway.

Yes I meant serialisation, isn't it the same as synchronisation ?

> 
> So, these insns are almost never slower, they can easily win cycles back
> because of the smaller code, too.
> 
> What 32-bit core do you see where load/store multiple are more than a
> fraction of a cycle (per memory access) slower?
> 
>> SAVE_NVGPRS / REST_NVGPRS are used in only a few places which are
>> mostly in interrupts entries/exits and in task switch so they are
>> likely already in the cache.
> 
> Nothing is likely in the cache on the older cores (except in
> microbenchmarks), the caches are not big enough for that!

Even syscall entries/exit pathes and/or most frequent interrupts entries and interrupt exit ?


> 
>> Using standard lwz improves null_syscall selftest by:
>> - 10 cycles on mpc832x.
>> - 2 cycles on mpc8xx.
> 
> And in real benchmarks?

Don't know, what benchmark should I use to evaluate syscall entry/exit if 'null_syscall' selftest is 
not relevant ?

> 
> On mpccore both lmw and stmw are only N+1 btw.  But the serialization
> might cost another cycle here?
> 

That coherent on MPC8xx, that's only 2 cycles.
But on the mpc832x which has a e300c2 core, it looks like I have 10 cycles difference. Is anything 
wrong ?

Christophe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
  2021-08-24  5:54   ` Christophe Leroy
@ 2021-08-24 13:16     ` Segher Boessenkool
  2021-08-24 15:28       ` Segher Boessenkool
  2021-08-25  9:39       ` Christophe Leroy
  0 siblings, 2 replies; 8+ messages in thread
From: Segher Boessenkool @ 2021-08-24 13:16 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel

Hi!

On Tue, Aug 24, 2021 at 07:54:22AM +0200, Christophe Leroy wrote:
> Le 23/08/2021 à 20:46, Segher Boessenkool a écrit :
> >On Mon, Aug 23, 2021 at 03:29:12PM +0000, Christophe Leroy wrote:
> >>Instructions lmw/stmw are interesting for functions that are rarely
> >>used and not in the cache, because only one instruction is to be
> >>copied into the instruction cache instead of 19. However those
> >>instruction are less performant than 19x raw lwz/stw as they require
> >>synchronisation plus one additional cycle.
> >
> >lmw takes N+2 cycles for loading N words on 603/604/750/7400, and N+3 on
> >7450.  stmw takes N+1 cycles for storing N words on 603, N+2 on 604/750/
> >7400, and N+3 on 7450 (load latency is 3 instead of 2 on 7450).
> >
> >There is no synchronisation needed, although there is some serialisation,
> >which of course doesn't mean much since there can be only 6 or 8 or so
> >insns executing at once anyway.
> 
> Yes I meant serialisation, isn't it the same as synchronisation ?

Ha no, synchronisation are insns like sync and eieio :-)  Synchronisation
is architectural, serialisation is (mostly) not, it is a feature of the
specific core.

> >So, these insns are almost never slower, they can easily win cycles back
> >because of the smaller code, too.
> >
> >What 32-bit core do you see where load/store multiple are more than a
> >fraction of a cycle (per memory access) slower?
> >
> >>SAVE_NVGPRS / REST_NVGPRS are used in only a few places which are
> >>mostly in interrupts entries/exits and in task switch so they are
> >>likely already in the cache.
> >
> >Nothing is likely in the cache on the older cores (except in
> >microbenchmarks), the caches are not big enough for that!
> 
> Even syscall entries/exit pathes and/or most frequent interrupts entries 
> and interrupt exit ?

It has to be measured.  You are probably right for programs that use a
lot of system calls, and (unmeasurably :-) ) wrong for those that don't.

So that is a good argument: it speeds up some scenarios, and does not
make any real impact on anything else.

This also does not replace all {l,st}mw in the kernel, only those on
interrupt paths.  So it is not necessarily bad :-)

> >>Using standard lwz improves null_syscall selftest by:
> >>- 10 cycles on mpc832x.
> >>- 2 cycles on mpc8xx.
> >
> >And in real benchmarks?
> 
> Don't know, what benchmark should I use to evaluate syscall entry/exit if 
> 'null_syscall' selftest is not relevant ?

Some real workload (something that uses memory and computational insns a
lot, in addition to many syscalls).

> >On mpccore both lmw and stmw are only N+1 btw.  But the serialization
> >might cost another cycle here?
> 
> That coherent on MPC8xx, that's only 2 cycles.
> But on the mpc832x which has a e300c2 core, it looks like I have 10 cycles 
> difference. Is anything wrong ?

I don't know that core very well, I'll have a look.


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
  2021-08-24 13:16     ` Segher Boessenkool
@ 2021-08-24 15:28       ` Segher Boessenkool
  2021-08-25  8:42         ` David Laight
  2021-08-25  9:39       ` Christophe Leroy
  1 sibling, 1 reply; 8+ messages in thread
From: Segher Boessenkool @ 2021-08-24 15:28 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel

On Tue, Aug 24, 2021 at 08:16:00AM -0500, Segher Boessenkool wrote:
> On Tue, Aug 24, 2021 at 07:54:22AM +0200, Christophe Leroy wrote:
> > >On mpccore both lmw and stmw are only N+1 btw.  But the serialization
> > >might cost another cycle here?
> > 
> > That coherent on MPC8xx, that's only 2 cycles.
> > But on the mpc832x which has a e300c2 core, it looks like I have 10 cycles 
> > difference. Is anything wrong ?
> 
> I don't know that core very well, I'll have a look.

So, I don't see any difference between e300c2 and e300c1 (which is 603
basically, for this) that is significant here.  The e300c2 has two
integer units instead of just one, but it still has only one load/store
unit, and I don't see anything else that could matter either.  Huh.


Segher

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
  2021-08-24 15:28       ` Segher Boessenkool
@ 2021-08-25  8:42         ` David Laight
  0 siblings, 0 replies; 8+ messages in thread
From: David Laight @ 2021-08-25  8:42 UTC (permalink / raw)
  To: 'Segher Boessenkool', Christophe Leroy
  Cc: linuxppc-dev, Paul Mackerras, linux-kernel

From: Segher Boessenkool
> Sent: 24 August 2021 16:28

> 
> On Tue, Aug 24, 2021 at 08:16:00AM -0500, Segher Boessenkool wrote:
> > On Tue, Aug 24, 2021 at 07:54:22AM +0200, Christophe Leroy wrote:
> > > >On mpccore both lmw and stmw are only N+1 btw.  But the serialization
> > > >might cost another cycle here?
> > >
> > > That coherent on MPC8xx, that's only 2 cycles.
> > > But on the mpc832x which has a e300c2 core, it looks like I have 10 cycles
> > > difference. Is anything wrong ?
> >
> > I don't know that core very well, I'll have a look.
> 
> So, I don't see any difference between e300c2 and e300c1 (which is 603
> basically, for this) that is significant here.  The e300c2 has two
> integer units instead of just one, but it still has only one load/store
> unit, and I don't see anything else that could matter either.  Huh.

Is the cpu as brain-damaged as the (old) strongarm (SA1100 etc)
where ldm/stm always took 1 clock to check each register bit
regardless of the number of registers to copy?
(IIRC it also took the same length of time when conditionally not
executed.)

If x86 had ever had ldm/stm then it would end up being a microcoded
instruction and take forever to decode.
Intel never managed to optimise 'loop' (dec %cx and jump nz).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
  2021-08-24 13:16     ` Segher Boessenkool
  2021-08-24 15:28       ` Segher Boessenkool
@ 2021-08-25  9:39       ` Christophe Leroy
  1 sibling, 0 replies; 8+ messages in thread
From: Christophe Leroy @ 2021-08-25  9:39 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel



Le 24/08/2021 à 15:16, Segher Boessenkool a écrit :
> Hi!
> 
> On Tue, Aug 24, 2021 at 07:54:22AM +0200, Christophe Leroy wrote:
>> Le 23/08/2021 à 20:46, Segher Boessenkool a écrit :
>>> On Mon, Aug 23, 2021 at 03:29:12PM +0000, Christophe Leroy wrote:
>>>> Instructions lmw/stmw are interesting for functions that are rarely
>>>> used and not in the cache, because only one instruction is to be
>>>> copied into the instruction cache instead of 19. However those
>>>> instruction are less performant than 19x raw lwz/stw as they require
>>>> synchronisation plus one additional cycle.
>>>
>>> lmw takes N+2 cycles for loading N words on 603/604/750/7400, and N+3 on
>>> 7450.  stmw takes N+1 cycles for storing N words on 603, N+2 on 604/750/
>>> 7400, and N+3 on 7450 (load latency is 3 instead of 2 on 7450).
>>>
>>> There is no synchronisation needed, although there is some serialisation,
>>> which of course doesn't mean much since there can be only 6 or 8 or so
>>> insns executing at once anyway.
>>
>> Yes I meant serialisation, isn't it the same as synchronisation ?
> 
> Ha no, synchronisation are insns like sync and eieio :-)  Synchronisation
> is architectural, serialisation is (mostly) not, it is a feature of the
> specific core.
> 
>>> So, these insns are almost never slower, they can easily win cycles back
>>> because of the smaller code, too.
>>>
>>> What 32-bit core do you see where load/store multiple are more than a
>>> fraction of a cycle (per memory access) slower?
>>>
>>>> SAVE_NVGPRS / REST_NVGPRS are used in only a few places which are
>>>> mostly in interrupts entries/exits and in task switch so they are
>>>> likely already in the cache.
>>>
>>> Nothing is likely in the cache on the older cores (except in
>>> microbenchmarks), the caches are not big enough for that!
>>
>> Even syscall entries/exit pathes and/or most frequent interrupts entries
>> and interrupt exit ?
> 
> It has to be measured.  You are probably right for programs that use a
> lot of system calls, and (unmeasurably :-) ) wrong for those that don't.
> 
> So that is a good argument: it speeds up some scenarios, and does not
> make any real impact on anything else.
> 
> This also does not replace all {l,st}mw in the kernel, only those on
> interrupt paths.  So it is not necessarily bad :-)

Yes exactly, I wanted to focus on interrupt paths which are the bottle neck.

So I take it that you finally don't disagree with the change.

By the way, it has to be noted that later versions of GCC do less and less use of lmw/stmw. See for 
exemple show_user_instructions():

c0007114 <show_user_instructions>:
c0007114:	94 21 ff 50 	stwu    r1,-176(r1)
c0007118:	7d 80 00 26 	mfcr    r12
c000711c:	7c 08 02 a6 	mflr    r0
c0007120:	93 01 00 90 	stw     r24,144(r1)
c0007124:	93 21 00 94 	stw     r25,148(r1)
c0007128:	93 41 00 98 	stw     r26,152(r1)
c000712c:	93 61 00 9c 	stw     r27,156(r1)
c0007130:	93 81 00 a0 	stw     r28,160(r1)
c0007134:	93 c1 00 a8 	stw     r30,168(r1)
c0007138:	91 81 00 8c 	stw     r12,140(r1)
c000713c:	90 01 00 b4 	stw     r0,180(r1)
c0007140:	93 a1 00 a4 	stw     r29,164(r1)
c0007144:	93 e1 00 ac 	stw     r31,172(r1)
...
c0007244:	80 01 00 b4 	lwz     r0,180(r1)
c0007248:	81 81 00 8c 	lwz     r12,140(r1)
c000724c:	83 01 00 90 	lwz     r24,144(r1)
c0007250:	83 21 00 94 	lwz     r25,148(r1)
c0007254:	83 41 00 98 	lwz     r26,152(r1)
c0007258:	83 61 00 9c 	lwz     r27,156(r1)
c000725c:	83 81 00 a0 	lwz     r28,160(r1)
c0007260:	83 a1 00 a4 	lwz     r29,164(r1)
c0007264:	83 c1 00 a8 	lwz     r30,168(r1)
c0007268:	83 e1 00 ac 	lwz     r31,172(r1)
c000726c:	7c 08 03 a6 	mtlr    r0
c0007270:	7d 80 81 20 	mtcrf   8,r12
c0007274:	38 21 00 b0 	addi    r1,r1,176
c0007278:	4e 80 00 20 	blr


On older version (GCC 5.5 here) it used to be:

00000408 <show_user_instructions>:
  408:	7c 08 02 a6 	mflr    r0
  40c:	94 21 ff 40 	stwu    r1,-192(r1)
  410:	7d 80 00 26 	mfcr    r12
  414:	be a1 00 94 	stmw    r21,148(r1)
  418:	91 81 00 90 	stw     r12,144(r1)
  41c:	90 01 00 c4 	stw     r0,196(r1)
...
  504:	80 01 00 c4 	lwz     r0,196(r1)
  508:	81 81 00 90 	lwz     r12,144(r1)
  50c:	7c 08 03 a6 	mtlr    r0
  510:	ba a1 00 94 	lmw     r21,148(r1)
  514:	7d 80 81 20 	mtcrf   8,r12
  518:	38 21 00 c0 	addi    r1,r1,192
  51c:	4e 80 00 20 	blr

Christophe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
  2021-08-23 15:29 [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs Christophe Leroy
  2021-08-23 18:46 ` Segher Boessenkool
@ 2021-11-02 10:11 ` Michael Ellerman
  1 sibling, 0 replies; 8+ messages in thread
From: Michael Ellerman @ 2021-11-02 10:11 UTC (permalink / raw)
  To: Michael Ellerman, Paul Mackerras, Christophe Leroy,
	Benjamin Herrenschmidt
  Cc: linux-kernel, linuxppc-dev

On Mon, 23 Aug 2021 15:29:12 +0000 (UTC), Christophe Leroy wrote:
> Instructions lmw/stmw are interesting for functions that are rarely
> used and not in the cache, because only one instruction is to be
> copied into the instruction cache instead of 19. However those
> instruction are less performant than 19x raw lwz/stw as they require
> synchronisation plus one additional cycle.
> 
> SAVE_NVGPRS / REST_NVGPRS are used in only a few places which are
> mostly in interrupts entries/exits and in task switch so they are
> likely already in the cache.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
      https://git.kernel.org/powerpc/c/a85c728cb5e12216c19ae5878980c2cbbbf8616d

cheers

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-11-02 11:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-23 15:29 [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs Christophe Leroy
2021-08-23 18:46 ` Segher Boessenkool
2021-08-24  5:54   ` Christophe Leroy
2021-08-24 13:16     ` Segher Boessenkool
2021-08-24 15:28       ` Segher Boessenkool
2021-08-25  8:42         ` David Laight
2021-08-25  9:39       ` Christophe Leroy
2021-11-02 10:11 ` Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).