All of lore.kernel.org
 help / color / mirror / Atom feed
* Optimised memset64/memset32 for powerpc
@ 2017-03-20 21:14 Matthew Wilcox
  2017-03-20 21:23 ` Benjamin Herrenschmidt
  2017-03-21 12:23 ` Christophe LEROY
  0 siblings, 2 replies; 20+ messages in thread
From: Matthew Wilcox @ 2017-03-20 21:14 UTC (permalink / raw)
  To: paulus, benh, mpe; +Cc: linuxppc-dev

I recently introduced memset32() / memset64().  I've done implementations
for x86 & ARM; akpm has agreed to take the patchset through his tree.
Do you fancy doing a powerpc version?  Minchan Kim got a 7% performance
increase with zram from switching to the optimised version on x86.

Here's the development git tree:
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/memfill
(most recent 7 commits)

ARM probably offers the best model for you to work from; it's basically
just a case of jumping into the middle of your existing memset loop.
It was only three instructions to add support to ARM, but I don't know
PowerPC well enough to understand how your existing memset works.
I'd start with something like this ... note that you don't have to
implement memset64 on 32-bit; I only did it on ARM because it was free.
It doesn't look free for you as you only store one register each time
around the loop in the 32-bit memset implementation:

1:      stwu    r4,4(r6)
        bdnz    1b

(wouldn't you get better performance on 32-bit powerpc by unrolling that
loop like you do on 64-bit?)

diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
index da3cdffca440..c02392fced98 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -6,6 +6,7 @@
 #define __HAVE_ARCH_STRNCPY
 #define __HAVE_ARCH_STRNCMP
 #define __HAVE_ARCH_MEMSET
+#define __HAVE_ARCH_MEMSET_PLUS
 #define __HAVE_ARCH_MEMCPY
 #define __HAVE_ARCH_MEMMOVE
 #define __HAVE_ARCH_MEMCMP
@@ -23,6 +24,18 @@ extern void * memmove(void *,const void *,__kernel_size_t);
 extern int memcmp(const void *,const void *,__kernel_size_t);
 extern void * memchr(const void *,int,__kernel_size_t);
 
+extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
+static inline void *memset32(uint32_t *p, uint32_t v, __kernel_size_t n)
+{
+	return __memset32(p, v, n * 4);
+}
+
+extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
+static inline void *memset64(uint64_t *p, uint64_t v, __kernel_size_t n)
+{
+	return __memset64(p, v, n * 8);
+}
+
 #endif /* __KERNEL__ */
 
 #endif	/* _ASM_POWERPC_STRING_H */

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Optimised memset64/memset32 for powerpc
  2017-03-20 21:14 Optimised memset64/memset32 for powerpc Matthew Wilcox
@ 2017-03-20 21:23 ` Benjamin Herrenschmidt
  2017-03-21 12:23 ` Christophe LEROY
  1 sibling, 0 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2017-03-20 21:23 UTC (permalink / raw)
  To: Matthew Wilcox, paulus, mpe, Anton Blanchard; +Cc: linuxppc-dev

On Mon, 2017-03-20 at 14:14 -0700, Matthew Wilcox wrote:
> I recently introduced memset32() / memset64().  I've done implementations
> for x86 & ARM; akpm has agreed to take the patchset through his tree.
> Do you fancy doing a powerpc version?  Minchan Kim got a 7% performance
> increase with zram from switching to the optimised version on x86.

+Anton

Thanks Matthew !

> Here's the development git tree:
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/memfill
> (most recent 7 commits)
> 
> ARM probably offers the best model for you to work from; it's basically
> just a case of jumping into the middle of your existing memset loop.
> It was only three instructions to add support to ARM, but I don't know
> PowerPC well enough to understand how your existing memset works.
> I'd start with something like this ... note that you don't have to
> implement memset64 on 32-bit; I only did it on ARM because it was free.
> It doesn't look free for you as you only store one register each time
> around the loop in the 32-bit memset implementation:
> 
> 1:      stwu    r4,4(r6)
>         bdnz    1b
> 
> (wouldn't you get better performance on 32-bit powerpc by unrolling that
> loop like you do on 64-bit?)
> 
> diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
> index da3cdffca440..c02392fced98 100644
> --- a/arch/powerpc/include/asm/string.h
> +++ b/arch/powerpc/include/asm/string.h
> @@ -6,6 +6,7 @@
>  #define __HAVE_ARCH_STRNCPY
>  #define __HAVE_ARCH_STRNCMP
>  #define __HAVE_ARCH_MEMSET
> +#define __HAVE_ARCH_MEMSET_PLUS
>  #define __HAVE_ARCH_MEMCPY
>  #define __HAVE_ARCH_MEMMOVE
>  #define __HAVE_ARCH_MEMCMP
> @@ -23,6 +24,18 @@ extern void * memmove(void *,const void *,__kernel_size_t);
>  extern int memcmp(const void *,const void *,__kernel_size_t);
>  extern void * memchr(const void *,int,__kernel_size_t);
>  
> +extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
> +static inline void *memset32(uint32_t *p, uint32_t v, __kernel_size_t n)
> +{
> > +	return __memset32(p, v, n * 4);
> +}
> +
> +extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
> +static inline void *memset64(uint64_t *p, uint64_t v, __kernel_size_t n)
> +{
> > +	return __memset64(p, v, n * 8);
> +}
> +
>  #endif /* __KERNEL__ */
>  
> >  #endif	/* _ASM_POWERPC_STRING_H */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Optimised memset64/memset32 for powerpc
  2017-03-20 21:14 Optimised memset64/memset32 for powerpc Matthew Wilcox
  2017-03-20 21:23 ` Benjamin Herrenschmidt
@ 2017-03-21 12:23 ` Christophe LEROY
  2017-03-21 13:29   ` Matthew Wilcox
  1 sibling, 1 reply; 20+ messages in thread
From: Christophe LEROY @ 2017-03-21 12:23 UTC (permalink / raw)
  To: Matthew Wilcox, paulus, benh, mpe; +Cc: linuxppc-dev

Hi Matthew

Le 20/03/2017 à 22:14, Matthew Wilcox a écrit :
> I recently introduced memset32() / memset64().  I've done implementations
> for x86 & ARM; akpm has agreed to take the patchset through his tree.
> Do you fancy doing a powerpc version?  Minchan Kim got a 7% performance
> increase with zram from switching to the optimised version on x86.
>
> Here's the development git tree:
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/memfill
> (most recent 7 commits)
>
> ARM probably offers the best model for you to work from; it's basically
> just a case of jumping into the middle of your existing memset loop.
> It was only three instructions to add support to ARM, but I don't know
> PowerPC well enough to understand how your existing memset works.
> I'd start with something like this ... note that you don't have to
> implement memset64 on 32-bit; I only did it on ARM because it was free.
> It doesn't look free for you as you only store one register each time
> around the loop in the 32-bit memset implementation:
>
> 1:      stwu    r4,4(r6)
>         bdnz    1b
>
> (wouldn't you get better performance on 32-bit powerpc by unrolling that
> loop like you do on 64-bit?)

In arch/powerpc/lib/copy_32.S, the implementation of memset() is 
optimised when the value to be set is zero. It makes use of the 'dcbz' 
instruction which zeroizes a complete cache line.

Not much effort has been put on optimising non-zero memset() because 
there are almost none.

Unrolling the loop could help a bit on old powerpc32s that don't have 
branch units, but on those processors the main driver is the time spent 
to do the effective write to memory, and the operations necessary to 
unroll the loop are not worth the cycle added by the branch.

On more modern powerpc32s, the branch unit implies that branches have a 
zero cost.

A simple static inline C function would probably do the job, based on 
what I get below:

void memset32(int *p, int v, unsigned int c)
{
	int i;

	for (i = 0; i < c; i++)
		*p++ = v;
}

void memset64(long long *p, long long v, unsigned int c)
{
	int i;

	for (i = 0; i < c; i++)
		*p++ = v;
}

test.o:     file format elf32-powerpc


Disassembly of section .text:

00000000 <memset32>:
    0:	2c 05 00 00 	cmpwi   r5,0
    4:	38 63 ff fc 	addi    r3,r3,-4
    8:	4d 82 00 20 	beqlr
    c:	7c a9 03 a6 	mtctr   r5
   10:	94 83 00 04 	stwu    r4,4(r3)
   14:	42 00 ff fc 	bdnz    10 <memset32+0x10>
   18:	4e 80 00 20 	blr

0000001c <memset64>:
   1c:	2c 07 00 00 	cmpwi   r7,0
   20:	7c cb 33 78 	mr      r11,r6
   24:	7c aa 2b 78 	mr      r10,r5
   28:	38 63 ff f8 	addi    r3,r3,-8
   2c:	4d 82 00 20 	beqlr
   30:	7c e9 03 a6 	mtctr   r7
   34:	95 43 00 08 	stwu    r10,8(r3)
   38:	91 63 00 04 	stw     r11,4(r3)
   3c:	42 00 ff f8 	bdnz    34 <memset64+0x18>
   40:	4e 80 00 20 	blr



Christophe

>
> diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
> index da3cdffca440..c02392fced98 100644
> --- a/arch/powerpc/include/asm/string.h
> +++ b/arch/powerpc/include/asm/string.h
> @@ -6,6 +6,7 @@
>  #define __HAVE_ARCH_STRNCPY
>  #define __HAVE_ARCH_STRNCMP
>  #define __HAVE_ARCH_MEMSET
> +#define __HAVE_ARCH_MEMSET_PLUS
>  #define __HAVE_ARCH_MEMCPY
>  #define __HAVE_ARCH_MEMMOVE
>  #define __HAVE_ARCH_MEMCMP
> @@ -23,6 +24,18 @@ extern void * memmove(void *,const void *,__kernel_size_t);
>  extern int memcmp(const void *,const void *,__kernel_size_t);
>  extern void * memchr(const void *,int,__kernel_size_t);
>
> +extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
> +static inline void *memset32(uint32_t *p, uint32_t v, __kernel_size_t n)
> +{
> +	return __memset32(p, v, n * 4);
> +}
> +
> +extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
> +static inline void *memset64(uint64_t *p, uint64_t v, __kernel_size_t n)
> +{
> +	return __memset64(p, v, n * 8);
> +}
> +
>  #endif /* __KERNEL__ */
>
>  #endif	/* _ASM_POWERPC_STRING_H */
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Optimised memset64/memset32 for powerpc
  2017-03-21 12:23 ` Christophe LEROY
@ 2017-03-21 13:29   ` Matthew Wilcox
  2017-03-21 16:45     ` Segher Boessenkool
  2017-03-21 21:26     ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 20+ messages in thread
From: Matthew Wilcox @ 2017-03-21 13:29 UTC (permalink / raw)
  To: Christophe LEROY; +Cc: paulus, benh, mpe, linuxppc-dev

On Tue, Mar 21, 2017 at 01:23:36PM +0100, Christophe LEROY wrote:
> > It doesn't look free for you as you only store one register each time
> > around the loop in the 32-bit memset implementation:
> > 
> > 1:      stwu    r4,4(r6)
> >         bdnz    1b
> > 
> > (wouldn't you get better performance on 32-bit powerpc by unrolling that
> > loop like you do on 64-bit?)
> 
> In arch/powerpc/lib/copy_32.S, the implementation of memset() is optimised
> when the value to be set is zero. It makes use of the 'dcbz' instruction
> which zeroizes a complete cache line.
> 
> Not much effort has been put on optimising non-zero memset() because there
> are almost none.

Yes, bzero() is much more common than setting an 8-bit pattern.
And setting an 8-bit pattern is almost certainly more common than setting
a 32 or 64 bit pattern.

> Unrolling the loop could help a bit on old powerpc32s that don't have branch
> units, but on those processors the main driver is the time spent to do the
> effective write to memory, and the operations necessary to unroll the loop
> are not worth the cycle added by the branch.
> 
> On more modern powerpc32s, the branch unit implies that branches have a zero
> cost.

Fair enough.  I'm just surprised it was worth unrolling the loop on
powerpc64 and not on powerpc32 -- see mem_64.S.

> A simple static inline C function would probably do the job, based on what I
> get below:
> 
> void memset32(int *p, int v, unsigned int c)
> {
> 	int i;
> 
> 	for (i = 0; i < c; i++)
> 		*p++ = v;
> }
> 
> void memset64(long long *p, long long v, unsigned int c)
> {
> 	int i;
> 
> 	for (i = 0; i < c; i++)
> 		*p++ = v;
> }

Well, those are the generic versions in the first patch:

http://git.infradead.org/users/willy/linux-dax.git/commitdiff/538b9776ac925199969bd5af4e994da776d461e7

so if those are good enough for you guys, there's no need for you to
do anything.

Thanks for your time!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Optimised memset64/memset32 for powerpc
  2017-03-21 13:29   ` Matthew Wilcox
@ 2017-03-21 16:45     ` Segher Boessenkool
  2017-03-21 21:26     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 20+ messages in thread
From: Segher Boessenkool @ 2017-03-21 16:45 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Christophe LEROY, paulus, linuxppc-dev

On Tue, Mar 21, 2017 at 06:29:10AM -0700, Matthew Wilcox wrote:
> > Unrolling the loop could help a bit on old powerpc32s that don't have branch
> > units, but on those processors the main driver is the time spent to do the
> > effective write to memory, and the operations necessary to unroll the loop
> > are not worth the cycle added by the branch.
> > 
> > On more modern powerpc32s, the branch unit implies that branches have a zero
> > cost.
> 
> Fair enough.  I'm just surprised it was worth unrolling the loop on
> powerpc64 and not on powerpc32 -- see mem_64.S.

We can do at most one loop iteration per cycle, but we can do multiple
stores per cycle, on modern, bigger CPUs.  Many old or small CPUs have
only one load/store unit on the other hand.  There are other issues,
but that is the biggest difference.


Segher

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Optimised memset64/memset32 for powerpc
  2017-03-21 13:29   ` Matthew Wilcox
  2017-03-21 16:45     ` Segher Boessenkool
@ 2017-03-21 21:26     ` Benjamin Herrenschmidt
  2017-03-22 13:18       ` Matthew Wilcox
  1 sibling, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2017-03-21 21:26 UTC (permalink / raw)
  To: Matthew Wilcox, Christophe LEROY; +Cc: paulus, mpe, linuxppc-dev

On Tue, 2017-03-21 at 06:29 -0700, Matthew Wilcox wrote:
> 
> Well, those are the generic versions in the first patch:
> 
> http://git.infradead.org/users/willy/linux-dax.git/commitdiff/538b977
> 6ac925199969bd5af4e994da776d461e7
> 
> so if those are good enough for you guys, there's no need for you to
> do anything.
> 
> Thanks for your time!

I suspect on ppc64 we can do much better, if anything moving 64-bit at
a time. Matthew, what are the main use cases of these ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Optimised memset64/memset32 for powerpc
  2017-03-21 21:26     ` Benjamin Herrenschmidt
@ 2017-03-22 13:18       ` Matthew Wilcox
  2017-03-22 19:30         ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2017-03-22 13:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Christophe LEROY, paulus, mpe, linuxppc-dev

On Wed, Mar 22, 2017 at 08:26:12AM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2017-03-21 at 06:29 -0700, Matthew Wilcox wrote:
> > 
> > Well, those are the generic versions in the first patch:
> > 
> > http://git.infradead.org/users/willy/linux-dax.git/commitdiff/538b977
> > 6ac925199969bd5af4e994da776d461e7
> > 
> > so if those are good enough for you guys, there's no need for you to
> > do anything.
> > 
> > Thanks for your time!
> 
> I suspect on ppc64 we can do much better, if anything moving 64-bit at
> a time. Matthew, what are the main use cases of these ?

I've only converted two users so far -- zram was the initial inspiration
for this.  It notices when a page has a pattern in it which is
representable as a repetition of an 'unsigned long' (this seems to be
a relatively common thing for userspace to do -- not as common as an
entirely zero page, but common enough to be worth optimising for).  So it
may be doing an entire page worth of this to handle a page fault, or if
there's an I/O to such a page, it will be doing a multiple of 512 bytes.

The other user is sym53c8xx_2; it's an initialisation path thing, and
it saves a few bytes in the driver to call the optimised routine rather
than have its own loop to initialise the array.

I suspect we have additional places in the kernel that could use
memset32/memset64 -- look for loops which store a value which is not
dependent on the loop counter.  They're probably not performance path
though; I'd focus on zram as being the case to optimise for.

There's one other potential user I've been wondering about, which are the
various console drivers.  They use 'memsetw' to blank the entire console
or lines of the console when scrolling, but the only architecture which
ever bothered implementing an optimised version of it was Alpha.

Might be worth it on powerpc actually ... better than a loop calling
cpu_to_le16() on each iteration.  That'd complete the set with a
memset16().

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Optimised memset64/memset32 for powerpc
  2017-03-22 13:18       ` Matthew Wilcox
@ 2017-03-22 19:30         ` Matthew Wilcox
  2017-03-27 19:37           ` Naveen N. Rao
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2017-03-22 19:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Christophe LEROY, paulus, mpe, linuxppc-dev

On Wed, Mar 22, 2017 at 06:18:05AM -0700, Matthew Wilcox wrote:
> There's one other potential user I've been wondering about, which are the
> various console drivers.  They use 'memsetw' to blank the entire console
> or lines of the console when scrolling, but the only architecture which
> ever bothered implementing an optimised version of it was Alpha.
> 
> Might be worth it on powerpc actually ... better than a loop calling
> cpu_to_le16() on each iteration.  That'd complete the set with a
> memset16().

All hail plane rides ... This would need to be resplit and merged properly,
but I think it makes life a little saner.

I make no claims that the ARM assembly in here is correct.  The single
x86 instruction that I wrote^W coped and pasted appears to be correct by
my understanding of the instruction set.


diff --git a/arch/alpha/include/asm/string.h b/arch/alpha/include/asm/string.h
index c2911f591704..74c0a693b76b 100644
--- a/arch/alpha/include/asm/string.h
+++ b/arch/alpha/include/asm/string.h
@@ -65,13 +65,14 @@ extern void * memchr(const void *, int, size_t);
    aligned values.  The DEST and COUNT parameters must be even for 
    correct operation.  */
 
-#define __HAVE_ARCH_MEMSETW
-extern void * __memsetw(void *dest, unsigned short, size_t count);
-
-#define memsetw(s, c, n)						 \
-(__builtin_constant_p(c)						 \
- ? __constant_c_memset((s),0x0001000100010001UL*(unsigned short)(c),(n)) \
- : __memsetw((s),(c),(n)))
+#define __HAVE_ARCH_MEMSET16
+extern void * __memset16(void *dest, unsigned short, size_t count);
+static inline void *memset16(uint16_t *p, uint16_t v, size_t n)
+{
+	if (__builtin_constant_p(v))
+		return __constant_c_memset(p, 0x0001000100010001UL * v, n * 2)
+	return __memset16(p, v, n * 2);
+}
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/alpha/include/asm/vga.h b/arch/alpha/include/asm/vga.h
index c00106bac521..3c1c2b6128e7 100644
--- a/arch/alpha/include/asm/vga.h
+++ b/arch/alpha/include/asm/vga.h
@@ -34,7 +34,7 @@ static inline void scr_memsetw(u16 *s, u16 c, unsigned int count)
 	if (__is_ioaddr(s))
 		memsetw_io((u16 __iomem *) s, c, count);
 	else
-		memsetw(s, c, count);
+		memset16(s, c, count / 2);
 }
 
 /* Do not trust that the usage will be correct; analyze the arguments.  */
diff --git a/arch/alpha/lib/memset.S b/arch/alpha/lib/memset.S
index 89a26f5e89de..f824969e9e77 100644
--- a/arch/alpha/lib/memset.S
+++ b/arch/alpha/lib/memset.S
@@ -20,7 +20,7 @@
 	.globl memset
 	.globl __memset
 	.globl ___memset
-	.globl __memsetw
+	.globl __memset16
 	.globl __constant_c_memset
 
 	.ent ___memset
@@ -110,8 +110,8 @@ EXPORT_SYMBOL(___memset)
 EXPORT_SYMBOL(__constant_c_memset)
 
 	.align 5
-	.ent __memsetw
-__memsetw:
+	.ent __memset16
+__memset16:
 	.prologue 0
 
 	inswl $17,0,$1		/* E0 */
@@ -123,8 +123,8 @@ __memsetw:
 	or $1,$4,$17		/* E0 */
 	br __constant_c_memset	/* .. E1 */
 
-	.end __memsetw
-EXPORT_SYMBOL(__memsetw)
+	.end __memset16
+EXPORT_SYMBOL(__memset16)
 
 memset = ___memset
 __memset = ___memset
diff --git a/arch/arm/include/asm/string.h b/arch/arm/include/asm/string.h
index da88299f758b..bc7a1be7a76a 100644
--- a/arch/arm/include/asm/string.h
+++ b/arch/arm/include/asm/string.h
@@ -24,15 +24,22 @@ extern void * memchr(const void *, int, __kernel_size_t);
 #define __HAVE_ARCH_MEMSET
 extern void * memset(void *, int, __kernel_size_t);
 
-#define __HAVE_ARCH_MEMSET_PLUS
-extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
-extern void *__memset64(uint64_t *, uint32_t low, __kernel_size_t, uint32_t hi);
+#define __HAVE_ARCH_MEMSET16
+extern void *__memset16(uint16_t *, uint16_t v, __kernel_size_t);
+static inline void *memset16(uint16_t *p, uint16_t v, __kernel_size_t n)
+{
+	return __memset16(p, v, n * 2);
+}
 
+#define __HAVE_ARCH_MEMSET32
+extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
 static inline void *memset32(uint32_t *p, uint32_t v, __kernel_size_t n)
 {
 	return __memset32(p, v, n * 4);
 }
 
+#define __HAVE_ARCH_MEMSET64
+extern void *__memset64(uint64_t *, uint32_t low, __kernel_size_t, uint32_t hi);
 static inline void *memset64(uint64_t *p, uint64_t v, __kernel_size_t n)
 {
 	return __memset64(p, v, n * 8, v >> 32);
diff --git a/arch/arm/lib/memset.S b/arch/arm/lib/memset.S
index a835ff9ed30c..0b6cbaa25b33 100644
--- a/arch/arm/lib/memset.S
+++ b/arch/arm/lib/memset.S
@@ -21,12 +21,12 @@ ENTRY(memset)
 UNWIND( .fnstart         )
 	ands	r3, r0, #3		@ 1 unaligned?
 	mov	ip, r0			@ preserve r0 as return value
+	orr	r1, r1, r1, lsl #8
 	bne	6f			@ 1
 /*
  * we know that the pointer in ip is aligned to a word boundary.
  */
-1:	orr	r1, r1, r1, lsl #8
-	orr	r1, r1, r1, lsl #16
+1:	orr	r1, r1, r1, lsl #16
 	mov	r3, r1
 7:	cmp	r2, #16
 	blt	4f
@@ -114,12 +114,13 @@ UNWIND( .fnstart            )
 	tst	r2, #4
 	strne	r1, [ip], #4
 /*
- * When we get here, we've got less than 4 bytes to zero.  We
+ * When we get here, we've got less than 4 bytes to set.  We
  * may have an unaligned pointer as well.
  */
 5:	tst	r2, #2
+	movne	r3, r1, lsr #8		@ the top half of a 16-bit pattern
 	strneb	r1, [ip], #1
-	strneb	r1, [ip], #1
+	strneb	r3, [ip], #1
 	tst	r2, #1
 	strneb	r1, [ip], #1
 	ret	lr
@@ -136,6 +137,17 @@ UNWIND( .fnend   )
 ENDPROC(memset)
 ENDPROC(mmioset)
 
+ENTRY(__memset16)
+UNWIND( .fnstart         )
+	tst	r0, #2			@ pointer unaligned?
+	mov	ip, r0			@ preserve r0 as return value
+	movne	r3, r1, lsr #8		@ r3 = r1 >> 8
+	strneb	r1, [ip], #1
+	strneb	r3, [ip], #1
+	subne	r2, r2, #2
+	b	1b			@ jump into the middle of memset
+UNWIND( .fnend   )
+ENDPROC(__memset16)
 ENTRY(__memset32)
 UNWIND( .fnstart         )
 	mov	r3, r1			@ copy r1 to r3 and fall into memset64
diff --git a/arch/powerpc/include/asm/vga.h b/arch/powerpc/include/asm/vga.h
index ab3acd2f2786..1fcda81d0fac 100644
--- a/arch/powerpc/include/asm/vga.h
+++ b/arch/powerpc/include/asm/vga.h
@@ -33,6 +33,12 @@ static inline u16 scr_readw(volatile const u16 *addr)
 	return le16_to_cpu(*addr);
 }
 
+#define VT_BUF_HAVE_MEMSET
+static inline void scr_memsetw(u16 *s, u16 v, unsigned int n)
+{
+	memset16(s, cpu_to_le16(v), n / 2);
+}
+
 #define VT_BUF_HAVE_MEMCPYW
 #define scr_memcpyw	memcpy
 
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 55614ccabb5c..84da91fe13ac 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -331,7 +331,19 @@ void *__constant_c_and_count_memset(void *s, unsigned long pattern,
 	 : __memset((s), (c), (count)))
 #endif
 
-#define __HAVE_ARCH_MEMSET_PLUS
+#define __HAVE_ARCH_MEMSET16
+static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
+{
+	int d0, d1;
+	asm volatile("rep\n\t"
+		     "stosw"
+		     : "=&c" (d0), "=&D" (d1)
+		     : "a" (v), "1" (s), "0" (n)
+		     : "memory");
+	return s;
+}
+
+#define __HAVE_ARCH_MEMSET_32
 static inline void *memset32(uint32_t *s, uint32_t v, size_t n)
 {
 	int d0, d1;
@@ -343,8 +355,6 @@ static inline void *memset32(uint32_t *s, uint32_t v, size_t n)
 	return s;
 }
 
-extern void *memset64(uint64_t *s, uint64_t v, size_t n);
-
 /*
  * find the first occurrence of byte 'c', or 1 past the area if none
  */
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 43210320ea05..71c5e860c7da 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -56,10 +56,22 @@ extern void *__memcpy(void *to, const void *from, size_t len);
 void *memset(void *s, int c, size_t n);
 void *__memset(void *s, int c, size_t n);
 
-#define __HAVE_ARCH_MEMSET_PLUS
+#define __HAVE_ARCH_MEMSET16
+static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
+{
+	long d0, d1;
+	asm volatile("rep\n\t"
+		     "stosw"
+		     : "=&c" (d0), "=&D" (d1)
+		     : "a" (v), "1" (s), "0" (n)
+		     : "memory");
+	return s;
+}
+
+#define __HAVE_ARCH_MEMSET32
 static inline void *memset32(uint32_t *s, uint32_t v, size_t n)
 {
-	int d0, d1;
+	long d0, d1;
 	asm volatile("rep\n\t"
 		     "stosl"
 		     : "=&c" (d0), "=&D" (d1)
@@ -68,9 +80,10 @@ static inline void *memset32(uint32_t *s, uint32_t v, size_t n)
 	return s;
 }
 
+#define __HAVE_ARCH_MEMSET64
 static inline void *memset64(uint64_t *s, uint64_t v, size_t n)
 {
-	int d0, d1;
+	long d0, d1;
 	asm volatile("rep\n\t"
 		     "stosq"
 		     : "=&c" (d0), "=&D" (d1)
diff --git a/include/linux/string.h b/include/linux/string.h
index 087d4d7bafd4..148b88b6ea00 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -99,8 +99,16 @@ extern __kernel_size_t strcspn(const char *,const char *);
 #ifndef __HAVE_ARCH_MEMSET
 extern void * memset(void *,int,__kernel_size_t);
 #endif
-#ifndef __HAVE_ARCH_MEMSET_PLUS
+
+#ifndef __HAVE_ARCH_MEMSET16
+extern void *memset16(uint16_t *, uint16_t, __kernel_size_t);
+#endif
+
+#ifndef __HAVE_ARCH_MEMSET32
 extern void *memset32(uint32_t *, uint32_t, __kernel_size_t);
+#endif
+
+#ifndef __HAVE_ARCH_MEMSET64
 extern void *memset64(uint64_t *, uint64_t, __kernel_size_t);
 #endif
 
diff --git a/include/linux/vt_buffer.h b/include/linux/vt_buffer.h
index f38c10ba3ff5..fddb010be886 100644
--- a/include/linux/vt_buffer.h
+++ b/include/linux/vt_buffer.h
@@ -26,9 +26,13 @@
 #ifndef VT_BUF_HAVE_MEMSETW
 static inline void scr_memsetw(u16 *s, u16 c, unsigned int count)
 {
+#ifdef VT_BUF_HAVE_RW
 	count /= 2;
 	while (count--)
 		scr_writew(c, s++);
+#else
+	memset16(s, c, count / 2);
+#endif
 }
 #endif
 
diff --git a/lib/string.c b/lib/string.c
index d22711e6490a..1e74a89e0af5 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -697,7 +697,29 @@ void memzero_explicit(void *s, size_t count)
 }
 EXPORT_SYMBOL(memzero_explicit);
 
-#ifndef __HAVE_ARCH_MEMSET_PLUS
+#ifndef __HAVE_ARCH_MEMSET16
+/**
+ * memset16() - Fill a memory area with a uint16_t
+ * @s: Pointer to the start of the area.
+ * @v: The value to fill the area with
+ * @count: The number of values to store
+ *
+ * Differs from memset() in that it fills with a uint16_t instead
+ * of a byte.  Remember that @count is the number of uint16_ts to
+ * store, not the number of bytes.
+ */
+void *memset16(uint16_t *s, uint16_t v, size_t count)
+{
+	uint16_t *xs = s;
+
+	while (count--)
+		*xs++ = v;
+	return s;
+}
+EXPORT_SYMBOL(memset16);
+#endif
+
+#ifndef __HAVE_ARCH_MEMSET32
 /**
  * memset32() - Fill a memory area with a uint32_t
  * @s: Pointer to the start of the area.
@@ -717,7 +739,9 @@ void *memset32(uint32_t *s, uint32_t v, size_t count)
 	return s;
 }
 EXPORT_SYMBOL(memset32);
+#endif
 
+#ifndef __HAVE_ARCH_MEMSET64
 #if BITS_PER_LONG > 32
 /**
  * memset64() - Fill a memory area with a uint64_t

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Optimised memset64/memset32 for powerpc
  2017-03-22 19:30         ` Matthew Wilcox
@ 2017-03-27 19:37           ` Naveen N. Rao
  2017-03-27 19:37             ` [PATCH 1/2] powerpc: string: implement optimized memset variants Naveen N. Rao
  2017-03-27 19:37             ` [PATCH 2/2] powerpc: bpf: use memset32() to pre-fill traps in BPF page(s) Naveen N. Rao
  0 siblings, 2 replies; 20+ messages in thread
From: Naveen N. Rao @ 2017-03-27 19:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	Michael Ellerman, Anton Blanchard

On 2017/03/22 12:30PM, Matthew Wilcox wrote:
> On Wed, Mar 22, 2017 at 06:18:05AM -0700, Matthew Wilcox wrote:
> > There's one other potential user I've been wondering about, which are the
> > various console drivers.  They use 'memsetw' to blank the entire console
> > or lines of the console when scrolling, but the only architecture which
> > ever bothered implementing an optimised version of it was Alpha.
> > 
> > Might be worth it on powerpc actually ... better than a loop calling
> > cpu_to_le16() on each iteration.  That'd complete the set with a
> > memset16().
> 
> All hail plane rides ... This would need to be resplit and merged properly,
> but I think it makes life a little saner.

... not to forget train rides :)

Here's a straight-forward implementation for powerpc64, along with one
other user in bpf. It is obviously non-critical, but given that we have
64K pages on powerpc64, it does help to speed up the BPF JIT.

- Naveen

Naveen N. Rao (2):
  powerpc: string: implement optimized memset variants
  powerpc: bpf: use memset32() to pre-fill traps in BPF page(s)

 arch/powerpc/include/asm/string.h | 24 ++++++++++++++++++++++++
 arch/powerpc/lib/mem_64.S         | 19 ++++++++++++++++++-
 arch/powerpc/net/bpf_jit_comp64.c |  6 +-----
 3 files changed, 43 insertions(+), 6 deletions(-)

-- 
2.11.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/2] powerpc: string: implement optimized memset variants
  2017-03-27 19:37           ` Naveen N. Rao
@ 2017-03-27 19:37             ` Naveen N. Rao
  2017-03-28  0:44               ` Michael Ellerman
  2017-08-18 12:50               ` [1/2] " Michael Ellerman
  2017-03-27 19:37             ` [PATCH 2/2] powerpc: bpf: use memset32() to pre-fill traps in BPF page(s) Naveen N. Rao
  1 sibling, 2 replies; 20+ messages in thread
From: Naveen N. Rao @ 2017-03-27 19:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	Michael Ellerman, Anton Blanchard

Based on Matthew Wilcox's patches for other architectures.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/string.h | 24 ++++++++++++++++++++++++
 arch/powerpc/lib/mem_64.S         | 19 ++++++++++++++++++-
 2 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/string.h b/arch/powerpc/include/asm/string.h
index da3cdffca440..b0e82466d4e8 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -23,6 +23,30 @@ extern void * memmove(void *,const void *,__kernel_size_t);
 extern int memcmp(const void *,const void *,__kernel_size_t);
 extern void * memchr(const void *,int,__kernel_size_t);
 
+#ifdef CONFIG_PPC64
+#define __HAVE_ARCH_MEMSET16
+#define __HAVE_ARCH_MEMSET32
+#define __HAVE_ARCH_MEMSET64
+
+extern void *__memset16(uint16_t *, uint16_t v, __kernel_size_t);
+extern void *__memset32(uint32_t *, uint32_t v, __kernel_size_t);
+extern void *__memset64(uint64_t *, uint64_t v, __kernel_size_t);
+
+static inline void *memset16(uint16_t *p, uint16_t v, __kernel_size_t n)
+{
+	return __memset16(p, v, n * 2);
+}
+
+static inline void *memset32(uint32_t *p, uint32_t v, __kernel_size_t n)
+{
+	return __memset32(p, v, n * 4);
+}
+
+static inline void *memset64(uint64_t *p, uint64_t v, __kernel_size_t n)
+{
+	return __memset64(p, v, n * 8);
+}
+#endif
 #endif /* __KERNEL__ */
 
 #endif	/* _ASM_POWERPC_STRING_H */
diff --git a/arch/powerpc/lib/mem_64.S b/arch/powerpc/lib/mem_64.S
index 85fa9869aec5..ec531de99996 100644
--- a/arch/powerpc/lib/mem_64.S
+++ b/arch/powerpc/lib/mem_64.S
@@ -13,6 +13,23 @@
 #include <asm/ppc_asm.h>
 #include <asm/export.h>
 
+_GLOBAL(__memset16)
+	rlwimi	r4,r4,16,0,15
+	/* fall through */
+
+_GLOBAL(__memset32)
+	rldimi	r4,r4,32,0
+	/* fall through */
+
+_GLOBAL(__memset64)
+	neg	r0,r3
+	andi.	r0,r0,7
+	cmplw	cr1,r5,r0
+	b	.Lms
+EXPORT_SYMBOL(__memset16)
+EXPORT_SYMBOL(__memset32)
+EXPORT_SYMBOL(__memset64)
+
 _GLOBAL(memset)
 	neg	r0,r3
 	rlwimi	r4,r4,8,16,23
@@ -20,7 +37,7 @@ _GLOBAL(memset)
 	rlwimi	r4,r4,16,0,15
 	cmplw	cr1,r5,r0		/* do we get that far? */
 	rldimi	r4,r4,32,0
-	PPC_MTOCRF(1,r0)
+.Lms:	PPC_MTOCRF(1,r0)
 	mr	r6,r3
 	blt	cr1,8f
 	beq+	3f			/* if already 8-byte aligned */
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] powerpc: bpf: use memset32() to pre-fill traps in BPF page(s)
  2017-03-27 19:37           ` Naveen N. Rao
  2017-03-27 19:37             ` [PATCH 1/2] powerpc: string: implement optimized memset variants Naveen N. Rao
@ 2017-03-27 19:37             ` Naveen N. Rao
  1 sibling, 0 replies; 20+ messages in thread
From: Naveen N. Rao @ 2017-03-27 19:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
	Michael Ellerman, Anton Blanchard

Use the newly introduced memset32() to pre-fill BPF page(s) with trap
instructions.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
 arch/powerpc/net/bpf_jit_comp64.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
index aee2bb817ac6..1a92ab6c245f 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -25,11 +25,7 @@ int bpf_jit_enable __read_mostly;
 
 static void bpf_jit_fill_ill_insns(void *area, unsigned int size)
 {
-	int *p = area;
-
-	/* Fill whole space with trap instructions */
-	while (p < (int *)((char *)area + size))
-		*p++ = BREAKPOINT_INSTRUCTION;
+	memset32(area, BREAKPOINT_INSTRUCTION, size/4);
 }
 
 static inline void bpf_flush_icache(void *start, void *end)
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] powerpc: string: implement optimized memset variants
  2017-03-27 19:37             ` [PATCH 1/2] powerpc: string: implement optimized memset variants Naveen N. Rao
@ 2017-03-28  0:44               ` Michael Ellerman
  2017-03-28 10:21                 ` Naveen N. Rao
  2017-08-18 12:50               ` [1/2] " Michael Ellerman
  1 sibling, 1 reply; 20+ messages in thread
From: Michael Ellerman @ 2017-03-28  0:44 UTC (permalink / raw)
  To: Naveen N. Rao, Matthew Wilcox
  Cc: Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev, Anton Blanchard

"Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:

> diff --git a/arch/powerpc/lib/mem_64.S b/arch/powerpc/lib/mem_64.S
> index 85fa9869aec5..ec531de99996 100644
> --- a/arch/powerpc/lib/mem_64.S
> +++ b/arch/powerpc/lib/mem_64.S
> @@ -13,6 +13,23 @@
>  #include <asm/ppc_asm.h>
>  #include <asm/export.h>
>  
> +_GLOBAL(__memset16)
> +	rlwimi	r4,r4,16,0,15
> +	/* fall through */
> +
> +_GLOBAL(__memset32)
> +	rldimi	r4,r4,32,0
> +	/* fall through */
> +
> +_GLOBAL(__memset64)
> +	neg	r0,r3
> +	andi.	r0,r0,7
> +	cmplw	cr1,r5,r0
> +	b	.Lms
> +EXPORT_SYMBOL(__memset16)
> +EXPORT_SYMBOL(__memset32)
> +EXPORT_SYMBOL(__memset64)

You'll have to convince me that's better than what GCC produces.

cheers

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] powerpc: string: implement optimized memset variants
  2017-03-28  0:44               ` Michael Ellerman
@ 2017-03-28 10:21                 ` Naveen N. Rao
  2017-03-29 11:36                   ` Michael Ellerman
  0 siblings, 1 reply; 20+ messages in thread
From: Naveen N. Rao @ 2017-03-28 10:21 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Matthew Wilcox, Paul Mackerras, linuxppc-dev, Anton Blanchard

On 2017/03/28 11:44AM, Michael Ellerman wrote:
> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
> 
> > diff --git a/arch/powerpc/lib/mem_64.S b/arch/powerpc/lib/mem_64.S
> > index 85fa9869aec5..ec531de99996 100644
> > --- a/arch/powerpc/lib/mem_64.S
> > +++ b/arch/powerpc/lib/mem_64.S
> > @@ -13,6 +13,23 @@
> >  #include <asm/ppc_asm.h>
> >  #include <asm/export.h>
> >  
> > +_GLOBAL(__memset16)
> > +	rlwimi	r4,r4,16,0,15
> > +	/* fall through */
> > +
> > +_GLOBAL(__memset32)
> > +	rldimi	r4,r4,32,0
> > +	/* fall through */
> > +
> > +_GLOBAL(__memset64)
> > +	neg	r0,r3
> > +	andi.	r0,r0,7
> > +	cmplw	cr1,r5,r0
> > +	b	.Lms
> > +EXPORT_SYMBOL(__memset16)
> > +EXPORT_SYMBOL(__memset32)
> > +EXPORT_SYMBOL(__memset64)
> 
> You'll have to convince me that's better than what GCC produces.

Sure :) I got lazy yesterday night and didn't post the test results...

I hadn't tested zram yesterday, but only done tests with a naive test 
module that memset's a large 1GB buffer with integers. With that test, I 
saw:

without patch:       0.389253910 seconds time elapsed    ( +-  1.49% )
with patch:          0.173269267 seconds time elapsed    ( +-  1.55% )

.. which is better than 2x.

I also tested zram today with the command shared by Wilcox:

without patch:       1.493782568 seconds time elapsed    ( +-  0.08% )
with patch:          1.408457577 seconds time elapsed    ( +-  0.15% )

... which also shows an improvement along the same lines as x86, as 
reported by Minchan Kim.


- Naveen

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] powerpc: string: implement optimized memset variants
  2017-03-28 10:21                 ` Naveen N. Rao
@ 2017-03-29 11:36                   ` Michael Ellerman
  2017-03-30  7:16                     ` Naveen N. Rao
  0 siblings, 1 reply; 20+ messages in thread
From: Michael Ellerman @ 2017-03-29 11:36 UTC (permalink / raw)
  To: Naveen N. Rao
  Cc: Matthew Wilcox, Paul Mackerras, linuxppc-dev, Anton Blanchard

"Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
> I also tested zram today with the command shared by Wilcox:
>
> without patch:       1.493782568 seconds time elapsed    ( +-  0.08% )
> with patch:          1.408457577 seconds time elapsed    ( +-  0.15% )
>
> ... which also shows an improvement along the same lines as x86, as 
> reported by Minchan Kim.

I got:

  1.344847397 seconds time elapsed                                          ( +-  0.13% )

Using the C versions. Can you also benchmark those on your setup so we
can compare? So basically apply Matt's series but not your 2.

cheers

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] powerpc: string: implement optimized memset variants
  2017-03-29 11:36                   ` Michael Ellerman
@ 2017-03-30  7:16                     ` Naveen N. Rao
  2017-04-04 12:00                       ` Michael Ellerman
  2017-04-05  5:51                       ` PrasannaKumar Muralidharan
  0 siblings, 2 replies; 20+ messages in thread
From: Naveen N. Rao @ 2017-03-30  7:16 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Paul Mackerras, linuxppc-dev, Anton Blanchard, Matthew Wilcox

On 2017/03/29 10:36PM, Michael Ellerman wrote:
> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
> > I also tested zram today with the command shared by Wilcox:
> >
> > without patch:       1.493782568 seconds time elapsed    ( +-  0.08% )
> > with patch:          1.408457577 seconds time elapsed    ( +-  0.15% )
> >
> > ... which also shows an improvement along the same lines as x86, as 
> > reported by Minchan Kim.
> 
> I got:
> 
>   1.344847397 seconds time elapsed                                          ( +-  0.13% )
> 
> Using the C versions. Can you also benchmark those on your setup so we
> can compare? So basically apply Matt's series but not your 2.

Ok, with a more comprehensive test:
	$ sudo modprobe zram
	$ sudo zramctl -f -s 1G
	# ~/tmp/1g has repeated 8 byte patterns
	$ sudo bash -c "cat ~/tmp/1g > /dev/zram0"

Here are the results I got on a P8 vm with:
	$ sudo ./perf stat -r 10 taskset -c 16-23 dd if=/dev/zram0 of=/dev/null

vanilla:	1.770592578 seconds time elapsed	( +-  0.07% )
generic:	1.728865141 seconds time elapsed	( +-  0.06% )
optimized:	1.695363255 seconds time elapsed        ( +-  0.10% )

(generic) is with Matt's arch-independent patches applied. Profiling 
indicates that most of the overhead is actually with the lzo 
decompression...

Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here 
are the results:
generic:	0.245315533 seconds time elapsed	( +-  1.83% )
optimized:	0.169282701 seconds time elapsed	( +-  1.96% )


- Naveen

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] powerpc: string: implement optimized memset variants
  2017-03-30  7:16                     ` Naveen N. Rao
@ 2017-04-04 12:00                       ` Michael Ellerman
  2017-04-18  6:45                         ` Michael Ellerman
  2017-04-05  5:51                       ` PrasannaKumar Muralidharan
  1 sibling, 1 reply; 20+ messages in thread
From: Michael Ellerman @ 2017-04-04 12:00 UTC (permalink / raw)
  To: Naveen N. Rao
  Cc: Paul Mackerras, linuxppc-dev, Anton Blanchard, Matthew Wilcox

"Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
> (generic) is with Matt's arch-independent patches applied. Profiling 
> indicates that most of the overhead is actually with the lzo 
> decompression...
>
> Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here 
> are the results:
> generic:	0.245315533 seconds time elapsed	( +-  1.83% )
> optimized:	0.169282701 seconds time elapsed	( +-  1.96% )

Great, that's pretty conclusive.

I'm pretty sure I can take these 2 patches independently of Matt's
series, they just won't be used by much until his series goes in, so
I'll do that unless someone yells.

cheers

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] powerpc: string: implement optimized memset variants
  2017-03-30  7:16                     ` Naveen N. Rao
  2017-04-04 12:00                       ` Michael Ellerman
@ 2017-04-05  5:51                       ` PrasannaKumar Muralidharan
  2017-04-12 15:05                         ` Naveen N. Rao
  1 sibling, 1 reply; 20+ messages in thread
From: PrasannaKumar Muralidharan @ 2017-04-05  5:51 UTC (permalink / raw)
  To: Naveen N. Rao
  Cc: Michael Ellerman, Paul Mackerras, linuxppc-dev, Anton Blanchard,
	Matthew Wilcox

On 30 March 2017 at 12:46, Naveen N. Rao
<naveen.n.rao@linux.vnet.ibm.com> wrote:
> Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here
> are the results:
> generic:        0.245315533 seconds time elapsed        ( +-  1.83% )
> optimized:      0.169282701 seconds time elapsed        ( +-  1.96% )

Wondering what makes gcc not to produce efficient assembly code. Can
you please post the disassembly of C implementation of memset64? Just
for info purpose.

Thanks,
Prasanna

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] powerpc: string: implement optimized memset variants
  2017-04-05  5:51                       ` PrasannaKumar Muralidharan
@ 2017-04-12 15:05                         ` Naveen N. Rao
  0 siblings, 0 replies; 20+ messages in thread
From: Naveen N. Rao @ 2017-04-12 15:05 UTC (permalink / raw)
  To: PrasannaKumar Muralidharan
  Cc: Anton Blanchard, linuxppc-dev, Michael Ellerman, Paul Mackerras,
	Matthew Wilcox

Excerpts from PrasannaKumar Muralidharan's message of April 5, 2017 11:21:
> On 30 March 2017 at 12:46, Naveen N. Rao
> <naveen.n.rao@linux.vnet.ibm.com> wrote:
>> Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here
>> are the results:
>> generic:        0.245315533 seconds time elapsed        ( +-  1.83% )
>> optimized:      0.169282701 seconds time elapsed        ( +-  1.96% )
>=20
> Wondering what makes gcc not to produce efficient assembly code. Can
> you please post the disassembly of C implementation of memset64? Just
> for info purpose.

It's largely the same as what Christophe posted for powerpc32.

Others will have better insights, but afaics, gcc only seems to be=20
unrolling the loop with -funroll-loops (which we don't use).

As an aside, it looks like gcc recently picked up an optimization in v7=20
that can also help (from https://gcc.gnu.org/gcc-7/changes.html):
"A new store merging pass has been added. It merges constant stores to=20
adjacent memory locations into fewer, wider, stores. It is enabled by=20
the -fstore-merging option and at the -O2 optimization level or higher=20
(and -Os)."


- Naveen

=

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] powerpc: string: implement optimized memset variants
  2017-04-04 12:00                       ` Michael Ellerman
@ 2017-04-18  6:45                         ` Michael Ellerman
  0 siblings, 0 replies; 20+ messages in thread
From: Michael Ellerman @ 2017-04-18  6:45 UTC (permalink / raw)
  To: Naveen N. Rao
  Cc: Paul Mackerras, linuxppc-dev, Anton Blanchard, Matthew Wilcox

Michael Ellerman <mpe@ellerman.id.au> writes:

> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> writes:
>> (generic) is with Matt's arch-independent patches applied. Profiling 
>> indicates that most of the overhead is actually with the lzo 
>> decompression...
>>
>> Also, with a simple module to memset64() a 1GB vmalloc'ed buffer, here 
>> are the results:
>> generic:	0.245315533 seconds time elapsed	( +-  1.83% )
>> optimized:	0.169282701 seconds time elapsed	( +-  1.96% )
>
> Great, that's pretty conclusive.
>
> I'm pretty sure I can take these 2 patches independently of Matt's
> series, they just won't be used by much until his series goes in, so
> I'll do that unless someone yells.

Hmm, just went to merge these, but I don't see Matt's series in
linux-next, so I'll hold off for now.

cheers

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [1/2] powerpc: string: implement optimized memset variants
  2017-03-27 19:37             ` [PATCH 1/2] powerpc: string: implement optimized memset variants Naveen N. Rao
  2017-03-28  0:44               ` Michael Ellerman
@ 2017-08-18 12:50               ` Michael Ellerman
  1 sibling, 0 replies; 20+ messages in thread
From: Michael Ellerman @ 2017-08-18 12:50 UTC (permalink / raw)
  To: Naveen N. Rao, Matthew Wilcox
  Cc: Paul Mackerras, linuxppc-dev, Anton Blanchard

On Mon, 2017-03-27 at 19:37:40 UTC, "Naveen N. Rao" wrote:
> Based on Matthew Wilcox's patches for other architectures.
> 
> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/694fc88ce271fd48f7939c032c1247

cheers

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-08-18 12:50 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-20 21:14 Optimised memset64/memset32 for powerpc Matthew Wilcox
2017-03-20 21:23 ` Benjamin Herrenschmidt
2017-03-21 12:23 ` Christophe LEROY
2017-03-21 13:29   ` Matthew Wilcox
2017-03-21 16:45     ` Segher Boessenkool
2017-03-21 21:26     ` Benjamin Herrenschmidt
2017-03-22 13:18       ` Matthew Wilcox
2017-03-22 19:30         ` Matthew Wilcox
2017-03-27 19:37           ` Naveen N. Rao
2017-03-27 19:37             ` [PATCH 1/2] powerpc: string: implement optimized memset variants Naveen N. Rao
2017-03-28  0:44               ` Michael Ellerman
2017-03-28 10:21                 ` Naveen N. Rao
2017-03-29 11:36                   ` Michael Ellerman
2017-03-30  7:16                     ` Naveen N. Rao
2017-04-04 12:00                       ` Michael Ellerman
2017-04-18  6:45                         ` Michael Ellerman
2017-04-05  5:51                       ` PrasannaKumar Muralidharan
2017-04-12 15:05                         ` Naveen N. Rao
2017-08-18 12:50               ` [1/2] " Michael Ellerman
2017-03-27 19:37             ` [PATCH 2/2] powerpc: bpf: use memset32() to pre-fill traps in BPF page(s) Naveen N. Rao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.