Re: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
@ 2015-12-22 23:32 Andrew Pinski
  2016-01-06 16:31   ` Will Deacon
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Pinski @ 2015-12-22 23:32 UTC (permalink / raw)
  To: linux-arm-kernel, Arnd Bergmann; +Cc: Will Deacon, Andrew Pinski, pinsia, LKML

On Tue, Dec 21, 2015 at 5:43 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Monday 21 December 2015, Will Deacon wrote:
>> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
>> > Adding a check for the cache line size is not much overhead.
>> > Special case 128 byte cache line size.
>> > This improves copy_page by 85% on ThunderX compared to the original
>> > implementation.
>>
>> So this patch seems to:
>>
>>   - Align the loop
>>   - Increase the prefetch size
>>   - Unroll the loop once
>>
>> Do you know where your 85% boost comes from between these? I'd really
>> like to avoid having multiple versions of copy_page, if possible, but
>> maybe we could end up with something that works well enough regardless
>> of cacheline size. Understanding what your bottleneck is would help to
>> lead us in the right direction.

I think it is the prefetching.  ThunderX T88 pass 1 and pass 2 does
not have a hardware prefetcher so prefetching a half of a cacheline
ahead does not help at all.

>>
>> Also, how are you measuring the improvement? If you can share your
>> test somewhere, I can see how it affects the other systems I have
>> access to.

You can find my benchmark at
https://github.com/apinski-cavium/copy_page_benchmark .
copy_page is my previous patch.
copy_page128 is just the unrolled and only 128 byte prefetching
copy_page64 is the original code
copy_page64unroll is the new patch which I will be sending out soon.

>
> A related question would be how other CPU cores are affected by the change.
> The test for the cache line size is going to take a few cycles, possibly a lot on certain implementations, e.g. if we ever get one where 'mrs' is microcoded or trapped by a hypervisor.
>
> Are there any possible downsides to using the ThunderX version on other microarchitectures too and skip the check?

Yes that is a good idea.  I will send out a new patch in a little bit
which just unrolls the loop with keeping of the two prefetch
instructions in there.

Thanks,
Andrew Pinski

>
>         Arnd

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
  2015-12-22 23:32 [PATCH] ARM64: Improve copy_page for 128 cache line sizes Andrew Pinski
@ 2016-01-06 16:31   ` Will Deacon
  0 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2016-01-06 16:31 UTC (permalink / raw)
  To: Andrew Pinski
  Cc: linux-arm-kernel, Arnd Bergmann, Andrew Pinski, pinsia, LKML

Hi Andrew,

On Tue, Dec 22, 2015 at 03:32:19PM -0800, Andrew Pinski wrote:
> On Tue, Dec 21, 2015 at 5:43 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Monday 21 December 2015, Will Deacon wrote:
> >> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> I think it is the prefetching.  ThunderX T88 pass 1 and pass 2 does
> not have a hardware prefetcher so prefetching a half of a cacheline
> ahead does not help at all.
> 
> >>
> >> Also, how are you measuring the improvement? If you can share your
> >> test somewhere, I can see how it affects the other systems I have
> >> access to.
> 
> You can find my benchmark at
> https://github.com/apinski-cavium/copy_page_benchmark .
> copy_page is my previous patch.
> copy_page128 is just the unrolled and only 128 byte prefetching
> copy_page64 is the original code
> copy_page64unroll is the new patch which I will be sending out soon.

Thanks, this was really helpful to evaluate the different versions on
the Cortex-A* cores I've got on my desk. Doing so showed that, in fact,
having explicit prfm instructions tends to be *harmful* for us -- the
hardware prefetcher is actually doing a much better job on its own.

Now, I still maintain that we don't want lots of different copy_page
implementations, but I'm not averse to patching a nop with a prfm on
cores that benefit from software-driven prefetching. We could hang it
off the alternatives framework that we have already.

> > Are there any possible downsides to using the ThunderX version on other
> > microarchitectures too and skip the check?
> 
> Yes that is a good idea.  I will send out a new patch in a little bit
> which just unrolls the loop with keeping of the two prefetch
> instructions in there.

copy_page64unroll didn't perform well on all of my systems. The code
below was the best all-rounder I could come up with. Do you reckon you
could try taking it and adding prefetches to see if you can make it fly
on ThunderX?

Cheers,

Will

--->8

ENTRY(copy_page)
	ldp	x2, x3, [x1]
	ldp	x4, x5, [x1, #16]
	ldp	x6, x7, [x1, #32]
	ldp	x8, x9, [x1, #48]
	ldp	x10, x11, [x1, #64]
	ldp	x12, x13, [x1, #80]
	ldp	x14, x15, [x1, #96]
	ldp	x16, x17, [x1, #112]

	mov	x18, #(PAGE_SIZE - 128)
	add	x1, x1, #128
1:
	subs	x18, x18, #128

	stnp	x2, x3, [x0]
	ldp	x2, x3, [x1]
	stnp	x4, x5, [x0, #16]
	ldp	x4, x5, [x1, #16]
	stnp	x6, x7, [x0, #32]
	ldp	x6, x7, [x1, #32]
	stnp	x8, x9, [x0, #48]
	ldp	x8, x9, [x1, #48]
	stnp	x10, x11, [x0, #64]
	ldp	x10, x11, [x1, #64]
	stnp	x12, x13, [x0, #80]
	ldp	x12, x13, [x1, #80]
	stnp	x14, x15, [x0, #96]
	ldp	x14, x15, [x1, #96]
	stnp	x16, x17, [x0, #112]
	ldp	x16, x17, [x1, #112]

	add	x0, x0, #128
	add	x1, x1, #128

	b.gt	1b

	stnp	x2, x3, [x0]
	stnp	x4, x5, [x0, #16]
	stnp	x6, x7, [x0, #32]
	stnp	x8, x9, [x0, #48]
	stnp	x10, x11, [x0, #64]
	stnp	x12, x13, [x0, #80]
	stnp	x14, x15, [x0, #96]
	stnp	x16, x17, [x0, #112]

	ret
ENDPROC(copy_page)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
@ 2016-01-06 16:31   ` Will Deacon
  0 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2016-01-06 16:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Andrew,

On Tue, Dec 22, 2015 at 03:32:19PM -0800, Andrew Pinski wrote:
> On Tue, Dec 21, 2015 at 5:43 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Monday 21 December 2015, Will Deacon wrote:
> >> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> I think it is the prefetching.  ThunderX T88 pass 1 and pass 2 does
> not have a hardware prefetcher so prefetching a half of a cacheline
> ahead does not help at all.
> 
> >>
> >> Also, how are you measuring the improvement? If you can share your
> >> test somewhere, I can see how it affects the other systems I have
> >> access to.
> 
> You can find my benchmark at
> https://github.com/apinski-cavium/copy_page_benchmark .
> copy_page is my previous patch.
> copy_page128 is just the unrolled and only 128 byte prefetching
> copy_page64 is the original code
> copy_page64unroll is the new patch which I will be sending out soon.

Thanks, this was really helpful to evaluate the different versions on
the Cortex-A* cores I've got on my desk. Doing so showed that, in fact,
having explicit prfm instructions tends to be *harmful* for us -- the
hardware prefetcher is actually doing a much better job on its own.

Now, I still maintain that we don't want lots of different copy_page
implementations, but I'm not averse to patching a nop with a prfm on
cores that benefit from software-driven prefetching. We could hang it
off the alternatives framework that we have already.

> > Are there any possible downsides to using the ThunderX version on other
> > microarchitectures too and skip the check?
> 
> Yes that is a good idea.  I will send out a new patch in a little bit
> which just unrolls the loop with keeping of the two prefetch
> instructions in there.

copy_page64unroll didn't perform well on all of my systems. The code
below was the best all-rounder I could come up with. Do you reckon you
could try taking it and adding prefetches to see if you can make it fly
on ThunderX?

Cheers,

Will

--->8

ENTRY(copy_page)
	ldp	x2, x3, [x1]
	ldp	x4, x5, [x1, #16]
	ldp	x6, x7, [x1, #32]
	ldp	x8, x9, [x1, #48]
	ldp	x10, x11, [x1, #64]
	ldp	x12, x13, [x1, #80]
	ldp	x14, x15, [x1, #96]
	ldp	x16, x17, [x1, #112]

	mov	x18, #(PAGE_SIZE - 128)
	add	x1, x1, #128
1:
	subs	x18, x18, #128

	stnp	x2, x3, [x0]
	ldp	x2, x3, [x1]
	stnp	x4, x5, [x0, #16]
	ldp	x4, x5, [x1, #16]
	stnp	x6, x7, [x0, #32]
	ldp	x6, x7, [x1, #32]
	stnp	x8, x9, [x0, #48]
	ldp	x8, x9, [x1, #48]
	stnp	x10, x11, [x0, #64]
	ldp	x10, x11, [x1, #64]
	stnp	x12, x13, [x0, #80]
	ldp	x12, x13, [x1, #80]
	stnp	x14, x15, [x0, #96]
	ldp	x14, x15, [x1, #96]
	stnp	x16, x17, [x0, #112]
	ldp	x16, x17, [x1, #112]

	add	x0, x0, #128
	add	x1, x1, #128

	b.gt	1b

	stnp	x2, x3, [x0]
	stnp	x4, x5, [x0, #16]
	stnp	x6, x7, [x0, #32]
	stnp	x8, x9, [x0, #48]
	stnp	x10, x11, [x0, #64]
	stnp	x12, x13, [x0, #80]
	stnp	x14, x15, [x0, #96]
	stnp	x16, x17, [x0, #112]

	ret
ENDPROC(copy_page)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
  2015-12-21 12:46   ` Will Deacon
@ 2015-12-21 13:42     ` Arnd Bergmann
  -1 siblings, 0 replies; 9+ messages in thread
From: Arnd Bergmann @ 2015-12-21 13:42 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: Will Deacon, Andrew Pinski, pinsia, linux-kernel

On Monday 21 December 2015, Will Deacon wrote:
> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> > Adding a check for the cache line size is not much overhead.
> > Special case 128 byte cache line size.
> > This improves copy_page by 85% on ThunderX compared to the
> > original implementation.
> 
> So this patch seems to:
> 
>   - Align the loop
>   - Increase the prefetch size
>   - Unroll the loop once
> 
> Do you know where your 85% boost comes from between these? I'd really
> like to avoid having multiple versions of copy_page, if possible, but
> maybe we could end up with something that works well enough regardless
> of cacheline size. Understanding what your bottleneck is would help to
> lead us in the right direction.
> 
> Also, how are you measuring the improvement? If you can share your
> test somewhere, I can see how it affects the other systems I have access
> to.

A related question would be how other CPU cores are affected by the change.
The test for the cache line size is going to take a few cycles, possibly
a lot on certain implementations, e.g. if we ever get one where 'mrs' is
microcoded or trapped by a hypervisor.

Are there any possible downsides to using the ThunderX version on other
microarchitectures too and skip the check?

	Arnd

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
@ 2015-12-21 13:42     ` Arnd Bergmann
  0 siblings, 0 replies; 9+ messages in thread
From: Arnd Bergmann @ 2015-12-21 13:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday 21 December 2015, Will Deacon wrote:
> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> > Adding a check for the cache line size is not much overhead.
> > Special case 128 byte cache line size.
> > This improves copy_page by 85% on ThunderX compared to the
> > original implementation.
> 
> So this patch seems to:
> 
>   - Align the loop
>   - Increase the prefetch size
>   - Unroll the loop once
> 
> Do you know where your 85% boost comes from between these? I'd really
> like to avoid having multiple versions of copy_page, if possible, but
> maybe we could end up with something that works well enough regardless
> of cacheline size. Understanding what your bottleneck is would help to
> lead us in the right direction.
> 
> Also, how are you measuring the improvement? If you can share your
> test somewhere, I can see how it affects the other systems I have access
> to.

A related question would be how other CPU cores are affected by the change.
The test for the cache line size is going to take a few cycles, possibly
a lot on certain implementations, e.g. if we ever get one where 'mrs' is
microcoded or trapped by a hypervisor.

Are there any possible downsides to using the ThunderX version on other
microarchitectures too and skip the check?

	Arnd

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
  2015-12-20  0:11 ` Andrew Pinski
@ 2015-12-21 12:46   ` Will Deacon
  -1 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2015-12-21 12:46 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: pinsia, linux-arm-kernel, linux-kernel

On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> Adding a check for the cache line size is not much overhead.
> Special case 128 byte cache line size.
> This improves copy_page by 85% on ThunderX compared to the
> original implementation.

So this patch seems to:

  - Align the loop
  - Increase the prefetch size
  - Unroll the loop once

Do you know where your 85% boost comes from between these? I'd really
like to avoid having multiple versions of copy_page, if possible, but
maybe we could end up with something that works well enough regardless
of cacheline size. Understanding what your bottleneck is would help to
lead us in the right direction.

Also, how are you measuring the improvement? If you can share your
test somewhere, I can see how it affects the other systems I have access
to.

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
@ 2015-12-21 12:46   ` Will Deacon
  0 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2015-12-21 12:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote:
> Adding a check for the cache line size is not much overhead.
> Special case 128 byte cache line size.
> This improves copy_page by 85% on ThunderX compared to the
> original implementation.

So this patch seems to:

  - Align the loop
  - Increase the prefetch size
  - Unroll the loop once

Do you know where your 85% boost comes from between these? I'd really
like to avoid having multiple versions of copy_page, if possible, but
maybe we could end up with something that works well enough regardless
of cacheline size. Understanding what your bottleneck is would help to
lead us in the right direction.

Also, how are you measuring the improvement? If you can share your
test somewhere, I can see how it affects the other systems I have access
to.

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
@ 2015-12-20  0:11 ` Andrew Pinski
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Pinski @ 2015-12-20  0:11 UTC (permalink / raw)
  To: pinsia, linux-arm-kernel, linux-kernel; +Cc: Andrew Pinski

Adding a check for the cache line size is not much overhead.
Special case 128 byte cache line size.
This improves copy_page by 85% on ThunderX compared to the
original implementation.

For LMBench, it improves between 4-10%.

Signed-off-by: Andrew Pinski <apinski@cavium.com>
---
 arch/arm64/lib/copy_page.S |   39 +++++++++++++++++++++++++++++++++++++++
 1 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 512b9a7..4c28789 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -18,6 +18,7 @@
 #include <linux/const.h>
 #include <asm/assembler.h>
 #include <asm/page.h>
+#include <asm/cachetype.h>
 
 /*
  * Copy a page from src to dest (both are page aligned)
@@ -27,8 +28,17 @@
  *	x1 - src
  */
 ENTRY(copy_page)
+	/* Special case 128 byte or more cache lines */
+	mrs	x2, ctr_el0
+	lsr	x2, x2, CTR_CWG_SHIFT
+	and	w2, w2, CTR_CWG_MASK
+	cmp	w2, 5
+	b.ge    2f
+
 	/* Assume cache line size is 64 bytes. */
 	prfm	pldl1strm, [x1, #64]
+	/* Align the loop is it fits in one cache line. */
+	.balign 64
 1:	ldp	x2, x3, [x1]
 	ldp	x4, x5, [x1, #16]
 	ldp	x6, x7, [x1, #32]
@@ -43,4 +53,33 @@ ENTRY(copy_page)
 	tst	x1, #(PAGE_SIZE - 1)
 	b.ne	1b
 	ret
+
+2:
+	/* The cache line size is at least 128 bytes. */
+	prfm	pldl1strm, [x1, #128]
+	/* Align the loop so it fits in one cache line  */
+	.balign 128
+1:	prfm	pldl1strm, [x1, #256]
+	ldp	x2, x3, [x1]
+	ldp	x4, x5, [x1, #16]
+	ldp	x6, x7, [x1, #32]
+	ldp	x8, x9, [x1, #48]
+	stnp	x2, x3, [x0]
+	stnp	x4, x5, [x0, #16]
+	stnp	x6, x7, [x0, #32]
+	stnp	x8, x9, [x0, #48]
+
+	ldp	x2, x3, [x1, #64]
+	ldp	x4, x5, [x1, #80]
+	ldp	x6, x7, [x1, #96]
+	ldp	x8, x9, [x1, #112]
+	add	x1, x1, #128
+	stnp	x2, x3, [x0, #64]
+	stnp	x4, x5, [x0, #80]
+	stnp	x6, x7, [x0, #96]
+	stnp	x8, x9, [x0, #112]
+	add	x0, x0, #128
+	tst	x1, #(PAGE_SIZE - 1)
+	b.ne	1b
+	ret
 ENDPROC(copy_page)
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH] ARM64: Improve copy_page for 128 cache line sizes.
@ 2015-12-20  0:11 ` Andrew Pinski
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Pinski @ 2015-12-20  0:11 UTC (permalink / raw)
  To: linux-arm-kernel

Adding a check for the cache line size is not much overhead.
Special case 128 byte cache line size.
This improves copy_page by 85% on ThunderX compared to the
original implementation.

For LMBench, it improves between 4-10%.

Signed-off-by: Andrew Pinski <apinski@cavium.com>
---
 arch/arm64/lib/copy_page.S |   39 +++++++++++++++++++++++++++++++++++++++
 1 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 512b9a7..4c28789 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -18,6 +18,7 @@
 #include <linux/const.h>
 #include <asm/assembler.h>
 #include <asm/page.h>
+#include <asm/cachetype.h>
 
 /*
  * Copy a page from src to dest (both are page aligned)
@@ -27,8 +28,17 @@
  *	x1 - src
  */
 ENTRY(copy_page)
+	/* Special case 128 byte or more cache lines */
+	mrs	x2, ctr_el0
+	lsr	x2, x2, CTR_CWG_SHIFT
+	and	w2, w2, CTR_CWG_MASK
+	cmp	w2, 5
+	b.ge    2f
+
 	/* Assume cache line size is 64 bytes. */
 	prfm	pldl1strm, [x1, #64]
+	/* Align the loop is it fits in one cache line. */
+	.balign 64
 1:	ldp	x2, x3, [x1]
 	ldp	x4, x5, [x1, #16]
 	ldp	x6, x7, [x1, #32]
@@ -43,4 +53,33 @@ ENTRY(copy_page)
 	tst	x1, #(PAGE_SIZE - 1)
 	b.ne	1b
 	ret
+
+2:
+	/* The cache line size is at least 128 bytes. */
+	prfm	pldl1strm, [x1, #128]
+	/* Align the loop so it fits in one cache line  */
+	.balign 128
+1:	prfm	pldl1strm, [x1, #256]
+	ldp	x2, x3, [x1]
+	ldp	x4, x5, [x1, #16]
+	ldp	x6, x7, [x1, #32]
+	ldp	x8, x9, [x1, #48]
+	stnp	x2, x3, [x0]
+	stnp	x4, x5, [x0, #16]
+	stnp	x6, x7, [x0, #32]
+	stnp	x8, x9, [x0, #48]
+
+	ldp	x2, x3, [x1, #64]
+	ldp	x4, x5, [x1, #80]
+	ldp	x6, x7, [x1, #96]
+	ldp	x8, x9, [x1, #112]
+	add	x1, x1, #128
+	stnp	x2, x3, [x0, #64]
+	stnp	x4, x5, [x0, #80]
+	stnp	x6, x7, [x0, #96]
+	stnp	x8, x9, [x0, #112]
+	add	x0, x0, #128
+	tst	x1, #(PAGE_SIZE - 1)
+	b.ne	1b
+	ret
 ENDPROC(copy_page)
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-01-06 16:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-22 23:32 [PATCH] ARM64: Improve copy_page for 128 cache line sizes Andrew Pinski
2016-01-06 16:31 ` Will Deacon
2016-01-06 16:31   ` Will Deacon
  -- strict thread matches above, loose matches on Subject: below --
2015-12-20  0:11 Andrew Pinski
2015-12-20  0:11 ` Andrew Pinski
2015-12-21 12:46 ` Will Deacon
2015-12-21 12:46   ` Will Deacon
2015-12-21 13:42   ` Arnd Bergmann
2015-12-21 13:42     ` Arnd Bergmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.