* [PATCH v2] powerpc: Speed up clear_page by unrolling it
@ 2014-10-02 5:44 Anton Blanchard
2014-10-02 14:17 ` Segher Boessenkool
0 siblings, 1 reply; 2+ messages in thread
From: Anton Blanchard @ 2014-10-02 5:44 UTC (permalink / raw)
To: benh, paulus, mpe; +Cc: linuxppc-dev
Unroll clear_page 8 times. A simple microbenchmark which
allocates and frees a zeroed page:
for (i = 0; i < iterations; i++) {
unsigned long p = __get_free_page(GFP_KERNEL | __GFP_ZERO);
free_page(p);
}
improves 20% on POWER8.
This assumes cacheline sizes won't grow beyond 512 bytes or
page sizes wont drop below 1kB, which is unlikely, but we could
add a runtime check during early init if it makes people nervous.
Michael found that some versions of gcc produce quite bad code
(all multiplies), so we give gcc a hand by using shifts and adds.
Signed-off-by: Anton Blanchard <anton@samba.org>
---
arch/powerpc/include/asm/page_64.h | 42 ++++++++++++++++++++++++++++----------
1 file changed, 31 insertions(+), 11 deletions(-)
diff --git a/arch/powerpc/include/asm/page_64.h b/arch/powerpc/include/asm/page_64.h
index d0d6afb..d908a46 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -42,20 +42,40 @@
typedef unsigned long pte_basic_t;
-static __inline__ void clear_page(void *addr)
+static inline void clear_page(void *addr)
{
- unsigned long lines, line_size;
-
- line_size = ppc64_caches.dline_size;
- lines = ppc64_caches.dlines_per_page;
-
- __asm__ __volatile__(
+ unsigned long iterations;
+ unsigned long onex, twox, fourx, eightx;
+
+ iterations = ppc64_caches.dlines_per_page / 8;
+
+ /*
+ * Some verisions of gcc use multiply instructions to
+ * calculate the offsets so lets give it a hand to
+ * do better.
+ */
+ onex = ppc64_caches.dline_size;
+ twox = onex << 1;
+ fourx = onex << 2;
+ eightx = onex << 3;
+
+ asm volatile(
"mtctr %1 # clear_page\n\
-1: dcbz 0,%0\n\
- add %0,%0,%3\n\
+ .balign 16\n\
+1: dcbz 0,%0\n\
+ dcbz %3,%0\n\
+ dcbz %4,%0\n\
+ dcbz %5,%0\n\
+ dcbz %6,%0\n\
+ dcbz %7,%0\n\
+ dcbz %8,%0\n\
+ dcbz %9,%0\n\
+ add %0,%0,%10\n\
bdnz+ 1b"
- : "=r" (addr)
- : "r" (lines), "0" (addr), "r" (line_size)
+ : "=&r" (addr)
+ : "r" (iterations), "0" (addr), "b" (onex), "b" (twox),
+ "b" (twox+onex), "b" (fourx), "b" (fourx+onex),
+ "b" (twox+fourx), "b" (eightx-onex), "r" (eightx)
: "ctr", "memory");
}
--
1.9.1
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH v2] powerpc: Speed up clear_page by unrolling it
2014-10-02 5:44 [PATCH v2] powerpc: Speed up clear_page by unrolling it Anton Blanchard
@ 2014-10-02 14:17 ` Segher Boessenkool
0 siblings, 0 replies; 2+ messages in thread
From: Segher Boessenkool @ 2014-10-02 14:17 UTC (permalink / raw)
To: Anton Blanchard; +Cc: paulus, linuxppc-dev
On Thu, Oct 02, 2014 at 03:44:21PM +1000, Anton Blanchard wrote:
> This assumes cacheline sizes won't grow beyond 512 bytes or
> page sizes wont drop below 1kB,
Or a combination of those.
> Michael found that some versions of gcc produce quite bad code
> (all multiplies), so we give gcc a hand by using shifts and adds.
You can make the code a lot less cluttered as well as making the
generated code independent of compiler version by writing the setup
of twox..eightx in the asm block itself.
Segher
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2014-10-02 14:18 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-02 5:44 [PATCH v2] powerpc: Speed up clear_page by unrolling it Anton Blanchard
2014-10-02 14:17 ` Segher Boessenkool
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.