* Re: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
@ 2015-12-22 23:32 Andrew Pinski
2016-01-06 16:31 ` Will Deacon
0 siblings, 1 reply; 9+ messages in thread
From: Andrew Pinski @ 2015-12-22 23:32 UTC (permalink / raw)
To: linux-arm-kernel, Arnd Bergmann; +Cc: Will Deacon, Andrew Pinski, pinsia, LKML
On Tue, Dec 21, 2015 at 5:43 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Monday 21 December 2015, Will Deacon wrote:
>> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
>> > Adding a check for the cache line size is not much overhead.
>> > Special case 128 byte cache line size.
>> > This improves copy_page by 85% on ThunderX compared to the original
>> > implementation.
>>
>> So this patch seems to:
>>
>> - Align the loop
>> - Increase the prefetch size
>> - Unroll the loop once
>>
>> Do you know where your 85% boost comes from between these? I'd really
>> like to avoid having multiple versions of copy_page, if possible, but
>> maybe we could end up with something that works well enough regardless
>> of cacheline size. Understanding what your bottleneck is would help to
>> lead us in the right direction.
I think it is the prefetching. ThunderX T88 pass 1 and pass 2 does
not have a hardware prefetcher so prefetching a half of a cacheline
ahead does not help at all.
>>
>> Also, how are you measuring the improvement? If you can share your
>> test somewhere, I can see how it affects the other systems I have
>> access to.
You can find my benchmark at
https://github.com/apinski-cavium/copy_page_benchmark .
copy_page is my previous patch.
copy_page128 is just the unrolled and only 128 byte prefetching
copy_page64 is the original code
copy_page64unroll is the new patch which I will be sending out soon.
>
> A related question would be how other CPU cores are affected by the change.
> The test for the cache line size is going to take a few cycles, possibly a lot on certain implementations, e.g. if we ever get one where 'mrs' is microcoded or trapped by a hypervisor.
>
> Are there any possible downsides to using the ThunderX version on other microarchitectures too and skip the check?
Yes that is a good idea. I will send out a new patch in a little bit
which just unrolls the loop with keeping of the two prefetch
instructions in there.
Thanks,
Andrew Pinski
>
> Arnd
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
2015-12-22 23:32 [PATCH] ARM64: Improve copy_page for 128 cache line sizes Andrew Pinski
@ 2016-01-06 16:31 ` Will Deacon
0 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2016-01-06 16:31 UTC (permalink / raw)
To: Andrew Pinski
Cc: linux-arm-kernel, Arnd Bergmann, Andrew Pinski, pinsia, LKML
Hi Andrew,
On Tue, Dec 22, 2015 at 03:32:19PM -0800, Andrew Pinski wrote:
> On Tue, Dec 21, 2015 at 5:43 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Monday 21 December 2015, Will Deacon wrote:
> >> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> I think it is the prefetching. ThunderX T88 pass 1 and pass 2 does
> not have a hardware prefetcher so prefetching a half of a cacheline
> ahead does not help at all.
>
> >>
> >> Also, how are you measuring the improvement? If you can share your
> >> test somewhere, I can see how it affects the other systems I have
> >> access to.
>
> You can find my benchmark at
> https://github.com/apinski-cavium/copy_page_benchmark .
> copy_page is my previous patch.
> copy_page128 is just the unrolled and only 128 byte prefetching
> copy_page64 is the original code
> copy_page64unroll is the new patch which I will be sending out soon.
Thanks, this was really helpful to evaluate the different versions on
the Cortex-A* cores I've got on my desk. Doing so showed that, in fact,
having explicit prfm instructions tends to be *harmful* for us -- the
hardware prefetcher is actually doing a much better job on its own.
Now, I still maintain that we don't want lots of different copy_page
implementations, but I'm not averse to patching a nop with a prfm on
cores that benefit from software-driven prefetching. We could hang it
off the alternatives framework that we have already.
> > Are there any possible downsides to using the ThunderX version on other
> > microarchitectures too and skip the check?
>
> Yes that is a good idea. I will send out a new patch in a little bit
> which just unrolls the loop with keeping of the two prefetch
> instructions in there.
copy_page64unroll didn't perform well on all of my systems. The code
below was the best all-rounder I could come up with. Do you reckon you
could try taking it and adding prefetches to see if you can make it fly
on ThunderX?
Cheers,
Will
--->8
ENTRY(copy_page)
ldp x2, x3, [x1]
ldp x4, x5, [x1, #16]
ldp x6, x7, [x1, #32]
ldp x8, x9, [x1, #48]
ldp x10, x11, [x1, #64]
ldp x12, x13, [x1, #80]
ldp x14, x15, [x1, #96]
ldp x16, x17, [x1, #112]
mov x18, #(PAGE_SIZE - 128)
add x1, x1, #128
1:
subs x18, x18, #128
stnp x2, x3, [x0]
ldp x2, x3, [x1]
stnp x4, x5, [x0, #16]
ldp x4, x5, [x1, #16]
stnp x6, x7, [x0, #32]
ldp x6, x7, [x1, #32]
stnp x8, x9, [x0, #48]
ldp x8, x9, [x1, #48]
stnp x10, x11, [x0, #64]
ldp x10, x11, [x1, #64]
stnp x12, x13, [x0, #80]
ldp x12, x13, [x1, #80]
stnp x14, x15, [x0, #96]
ldp x14, x15, [x1, #96]
stnp x16, x17, [x0, #112]
ldp x16, x17, [x1, #112]
add x0, x0, #128
add x1, x1, #128
b.gt 1b
stnp x2, x3, [x0]
stnp x4, x5, [x0, #16]
stnp x6, x7, [x0, #32]
stnp x8, x9, [x0, #48]
stnp x10, x11, [x0, #64]
stnp x12, x13, [x0, #80]
stnp x14, x15, [x0, #96]
stnp x16, x17, [x0, #112]
ret
ENDPROC(copy_page)
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
@ 2016-01-06 16:31 ` Will Deacon
0 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2016-01-06 16:31 UTC (permalink / raw)
To: linux-arm-kernel
Hi Andrew,
On Tue, Dec 22, 2015 at 03:32:19PM -0800, Andrew Pinski wrote:
> On Tue, Dec 21, 2015 at 5:43 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Monday 21 December 2015, Will Deacon wrote:
> >> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> I think it is the prefetching. ThunderX T88 pass 1 and pass 2 does
> not have a hardware prefetcher so prefetching a half of a cacheline
> ahead does not help at all.
>
> >>
> >> Also, how are you measuring the improvement? If you can share your
> >> test somewhere, I can see how it affects the other systems I have
> >> access to.
>
> You can find my benchmark at
> https://github.com/apinski-cavium/copy_page_benchmark .
> copy_page is my previous patch.
> copy_page128 is just the unrolled and only 128 byte prefetching
> copy_page64 is the original code
> copy_page64unroll is the new patch which I will be sending out soon.
Thanks, this was really helpful to evaluate the different versions on
the Cortex-A* cores I've got on my desk. Doing so showed that, in fact,
having explicit prfm instructions tends to be *harmful* for us -- the
hardware prefetcher is actually doing a much better job on its own.
Now, I still maintain that we don't want lots of different copy_page
implementations, but I'm not averse to patching a nop with a prfm on
cores that benefit from software-driven prefetching. We could hang it
off the alternatives framework that we have already.
> > Are there any possible downsides to using the ThunderX version on other
> > microarchitectures too and skip the check?
>
> Yes that is a good idea. I will send out a new patch in a little bit
> which just unrolls the loop with keeping of the two prefetch
> instructions in there.
copy_page64unroll didn't perform well on all of my systems. The code
below was the best all-rounder I could come up with. Do you reckon you
could try taking it and adding prefetches to see if you can make it fly
on ThunderX?
Cheers,
Will
--->8
ENTRY(copy_page)
ldp x2, x3, [x1]
ldp x4, x5, [x1, #16]
ldp x6, x7, [x1, #32]
ldp x8, x9, [x1, #48]
ldp x10, x11, [x1, #64]
ldp x12, x13, [x1, #80]
ldp x14, x15, [x1, #96]
ldp x16, x17, [x1, #112]
mov x18, #(PAGE_SIZE - 128)
add x1, x1, #128
1:
subs x18, x18, #128
stnp x2, x3, [x0]
ldp x2, x3, [x1]
stnp x4, x5, [x0, #16]
ldp x4, x5, [x1, #16]
stnp x6, x7, [x0, #32]
ldp x6, x7, [x1, #32]
stnp x8, x9, [x0, #48]
ldp x8, x9, [x1, #48]
stnp x10, x11, [x0, #64]
ldp x10, x11, [x1, #64]
stnp x12, x13, [x0, #80]
ldp x12, x13, [x1, #80]
stnp x14, x15, [x0, #96]
ldp x14, x15, [x1, #96]
stnp x16, x17, [x0, #112]
ldp x16, x17, [x1, #112]
add x0, x0, #128
add x1, x1, #128
b.gt 1b
stnp x2, x3, [x0]
stnp x4, x5, [x0, #16]
stnp x6, x7, [x0, #32]
stnp x8, x9, [x0, #48]
stnp x10, x11, [x0, #64]
stnp x12, x13, [x0, #80]
stnp x14, x15, [x0, #96]
stnp x16, x17, [x0, #112]
ret
ENDPROC(copy_page)
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
2015-12-21 12:46 ` Will Deacon
@ 2015-12-21 13:42 ` Arnd Bergmann
-1 siblings, 0 replies; 9+ messages in thread
From: Arnd Bergmann @ 2015-12-21 13:42 UTC (permalink / raw)
To: linux-arm-kernel; +Cc: Will Deacon, Andrew Pinski, pinsia, linux-kernel
On Monday 21 December 2015, Will Deacon wrote:
> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> > Adding a check for the cache line size is not much overhead.
> > Special case 128 byte cache line size.
> > This improves copy_page by 85% on ThunderX compared to the
> > original implementation.
>
> So this patch seems to:
>
> - Align the loop
> - Increase the prefetch size
> - Unroll the loop once
>
> Do you know where your 85% boost comes from between these? I'd really
> like to avoid having multiple versions of copy_page, if possible, but
> maybe we could end up with something that works well enough regardless
> of cacheline size. Understanding what your bottleneck is would help to
> lead us in the right direction.
>
> Also, how are you measuring the improvement? If you can share your
> test somewhere, I can see how it affects the other systems I have access
> to.
A related question would be how other CPU cores are affected by the change.
The test for the cache line size is going to take a few cycles, possibly
a lot on certain implementations, e.g. if we ever get one where 'mrs' is
microcoded or trapped by a hypervisor.
Are there any possible downsides to using the ThunderX version on other
microarchitectures too and skip the check?
Arnd
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
@ 2015-12-21 13:42 ` Arnd Bergmann
0 siblings, 0 replies; 9+ messages in thread
From: Arnd Bergmann @ 2015-12-21 13:42 UTC (permalink / raw)
To: linux-arm-kernel
On Monday 21 December 2015, Will Deacon wrote:
> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> > Adding a check for the cache line size is not much overhead.
> > Special case 128 byte cache line size.
> > This improves copy_page by 85% on ThunderX compared to the
> > original implementation.
>
> So this patch seems to:
>
> - Align the loop
> - Increase the prefetch size
> - Unroll the loop once
>
> Do you know where your 85% boost comes from between these? I'd really
> like to avoid having multiple versions of copy_page, if possible, but
> maybe we could end up with something that works well enough regardless
> of cacheline size. Understanding what your bottleneck is would help to
> lead us in the right direction.
>
> Also, how are you measuring the improvement? If you can share your
> test somewhere, I can see how it affects the other systems I have access
> to.
A related question would be how other CPU cores are affected by the change.
The test for the cache line size is going to take a few cycles, possibly
a lot on certain implementations, e.g. if we ever get one where 'mrs' is
microcoded or trapped by a hypervisor.
Are there any possible downsides to using the ThunderX version on other
microarchitectures too and skip the check?
Arnd
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
2015-12-20 0:11 ` Andrew Pinski
@ 2015-12-21 12:46 ` Will Deacon
-1 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2015-12-21 12:46 UTC (permalink / raw)
To: Andrew Pinski; +Cc: pinsia, linux-arm-kernel, linux-kernel
On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> Adding a check for the cache line size is not much overhead.
> Special case 128 byte cache line size.
> This improves copy_page by 85% on ThunderX compared to the
> original implementation.
So this patch seems to:
- Align the loop
- Increase the prefetch size
- Unroll the loop once
Do you know where your 85% boost comes from between these? I'd really
like to avoid having multiple versions of copy_page, if possible, but
maybe we could end up with something that works well enough regardless
of cacheline size. Understanding what your bottleneck is would help to
lead us in the right direction.
Also, how are you measuring the improvement? If you can share your
test somewhere, I can see how it affects the other systems I have access
to.
Will
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
@ 2015-12-21 12:46 ` Will Deacon
0 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2015-12-21 12:46 UTC (permalink / raw)
To: linux-arm-kernel
On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> Adding a check for the cache line size is not much overhead.
> Special case 128 byte cache line size.
> This improves copy_page by 85% on ThunderX compared to the
> original implementation.
So this patch seems to:
- Align the loop
- Increase the prefetch size
- Unroll the loop once
Do you know where your 85% boost comes from between these? I'd really
like to avoid having multiple versions of copy_page, if possible, but
maybe we could end up with something that works well enough regardless
of cacheline size. Understanding what your bottleneck is would help to
lead us in the right direction.
Also, how are you measuring the improvement? If you can share your
test somewhere, I can see how it affects the other systems I have access
to.
Will
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
@ 2015-12-20 0:11 ` Andrew Pinski
0 siblings, 0 replies; 9+ messages in thread
From: Andrew Pinski @ 2015-12-20 0:11 UTC (permalink / raw)
To: pinsia, linux-arm-kernel, linux-kernel; +Cc: Andrew Pinski
Adding a check for the cache line size is not much overhead.
Special case 128 byte cache line size.
This improves copy_page by 85% on ThunderX compared to the
original implementation.
For LMBench, it improves between 4-10%.
Signed-off-by: Andrew Pinski <apinski@cavium.com>
---
arch/arm64/lib/copy_page.S | 39 +++++++++++++++++++++++++++++++++++++++
1 files changed, 39 insertions(+), 0 deletions(-)
diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 512b9a7..4c28789 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -18,6 +18,7 @@
#include <linux/const.h>
#include <asm/assembler.h>
#include <asm/page.h>
+#include <asm/cachetype.h>
/*
* Copy a page from src to dest (both are page aligned)
@@ -27,8 +28,17 @@
* x1 - src
*/
ENTRY(copy_page)
+ /* Special case 128 byte or more cache lines */
+ mrs x2, ctr_el0
+ lsr x2, x2, CTR_CWG_SHIFT
+ and w2, w2, CTR_CWG_MASK
+ cmp w2, 5
+ b.ge 2f
+
/* Assume cache line size is 64 bytes. */
prfm pldl1strm, [x1, #64]
+ /* Align the loop is it fits in one cache line. */
+ .balign 64
1: ldp x2, x3, [x1]
ldp x4, x5, [x1, #16]
ldp x6, x7, [x1, #32]
@@ -43,4 +53,33 @@ ENTRY(copy_page)
tst x1, #(PAGE_SIZE - 1)
b.ne 1b
ret
+
+2:
+ /* The cache line size is at least 128 bytes. */
+ prfm pldl1strm, [x1, #128]
+ /* Align the loop so it fits in one cache line */
+ .balign 128
+1: prfm pldl1strm, [x1, #256]
+ ldp x2, x3, [x1]
+ ldp x4, x5, [x1, #16]
+ ldp x6, x7, [x1, #32]
+ ldp x8, x9, [x1, #48]
+ stnp x2, x3, [x0]
+ stnp x4, x5, [x0, #16]
+ stnp x6, x7, [x0, #32]
+ stnp x8, x9, [x0, #48]
+
+ ldp x2, x3, [x1, #64]
+ ldp x4, x5, [x1, #80]
+ ldp x6, x7, [x1, #96]
+ ldp x8, x9, [x1, #112]
+ add x1, x1, #128
+ stnp x2, x3, [x0, #64]
+ stnp x4, x5, [x0, #80]
+ stnp x6, x7, [x0, #96]
+ stnp x8, x9, [x0, #112]
+ add x0, x0, #128
+ tst x1, #(PAGE_SIZE - 1)
+ b.ne 1b
+ ret
ENDPROC(copy_page)
--
1.7.2.5
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
@ 2015-12-20 0:11 ` Andrew Pinski
0 siblings, 0 replies; 9+ messages in thread
From: Andrew Pinski @ 2015-12-20 0:11 UTC (permalink / raw)
To: linux-arm-kernel
Adding a check for the cache line size is not much overhead.
Special case 128 byte cache line size.
This improves copy_page by 85% on ThunderX compared to the
original implementation.
For LMBench, it improves between 4-10%.
Signed-off-by: Andrew Pinski <apinski@cavium.com>
---
arch/arm64/lib/copy_page.S | 39 +++++++++++++++++++++++++++++++++++++++
1 files changed, 39 insertions(+), 0 deletions(-)
diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 512b9a7..4c28789 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -18,6 +18,7 @@
#include <linux/const.h>
#include <asm/assembler.h>
#include <asm/page.h>
+#include <asm/cachetype.h>
/*
* Copy a page from src to dest (both are page aligned)
@@ -27,8 +28,17 @@
* x1 - src
*/
ENTRY(copy_page)
+ /* Special case 128 byte or more cache lines */
+ mrs x2, ctr_el0
+ lsr x2, x2, CTR_CWG_SHIFT
+ and w2, w2, CTR_CWG_MASK
+ cmp w2, 5
+ b.ge 2f
+
/* Assume cache line size is 64 bytes. */
prfm pldl1strm, [x1, #64]
+ /* Align the loop is it fits in one cache line. */
+ .balign 64
1: ldp x2, x3, [x1]
ldp x4, x5, [x1, #16]
ldp x6, x7, [x1, #32]
@@ -43,4 +53,33 @@ ENTRY(copy_page)
tst x1, #(PAGE_SIZE - 1)
b.ne 1b
ret
+
+2:
+ /* The cache line size is at least 128 bytes. */
+ prfm pldl1strm, [x1, #128]
+ /* Align the loop so it fits in one cache line */
+ .balign 128
+1: prfm pldl1strm, [x1, #256]
+ ldp x2, x3, [x1]
+ ldp x4, x5, [x1, #16]
+ ldp x6, x7, [x1, #32]
+ ldp x8, x9, [x1, #48]
+ stnp x2, x3, [x0]
+ stnp x4, x5, [x0, #16]
+ stnp x6, x7, [x0, #32]
+ stnp x8, x9, [x0, #48]
+
+ ldp x2, x3, [x1, #64]
+ ldp x4, x5, [x1, #80]
+ ldp x6, x7, [x1, #96]
+ ldp x8, x9, [x1, #112]
+ add x1, x1, #128
+ stnp x2, x3, [x0, #64]
+ stnp x4, x5, [x0, #80]
+ stnp x6, x7, [x0, #96]
+ stnp x8, x9, [x0, #112]
+ add x0, x0, #128
+ tst x1, #(PAGE_SIZE - 1)
+ b.ne 1b
+ ret
ENDPROC(copy_page)
--
1.7.2.5
^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-01-06 16:31 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-22 23:32 [PATCH] ARM64: Improve copy_page for 128 cache line sizes Andrew Pinski
2016-01-06 16:31 ` Will Deacon
2016-01-06 16:31 ` Will Deacon
-- strict thread matches above, loose matches on Subject: below --
2015-12-20 0:11 Andrew Pinski
2015-12-20 0:11 ` Andrew Pinski
2015-12-21 12:46 ` Will Deacon
2015-12-21 12:46 ` Will Deacon
2015-12-21 13:42 ` Arnd Bergmann
2015-12-21 13:42 ` Arnd Bergmann
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.