linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Built a neon version copy_page/clear_page is correct?
@ 2018-12-12  6:01 JackieLiu
  2018-12-12  7:18 ` Ard Biesheuvel
  0 siblings, 1 reply; 6+ messages in thread
From: JackieLiu @ 2018-12-12  6:01 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Andrew Pinski, linux-arm-kernel

Hello, Maintainer.

I want use neon-intrinsics to built a neon version copy_page/clear_page
function. but I don’t know why other platform haven’t do this before. so
I send this email to ask is that correct?

BTW, I found a similar implementation in arch/x86/lib/mmx_32.c, any idea
welcome.

BR. 
Jackie


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Built a neon version copy_page/clear_page is correct?
  2018-12-12  6:01 Built a neon version copy_page/clear_page is correct? JackieLiu
@ 2018-12-12  7:18 ` Ard Biesheuvel
  2018-12-12  7:32   ` JackieLiu
  0 siblings, 1 reply; 6+ messages in thread
From: Ard Biesheuvel @ 2018-12-12  7:18 UTC (permalink / raw)
  To: liuyun01; +Cc: Catalin Marinas, Will Deacon, Andrew Pinski, linux-arm-kernel

On Wed, 12 Dec 2018 at 07:01, JackieLiu <liuyun01@kylinos.cn> wrote:
>
> Hello, Maintainer.
>
> I want use neon-intrinsics to built a neon version copy_page/clear_page
> function. but I don’t know why other platform haven’t do this before. so
> I send this email to ask is that correct?
>

clear_page() uses DC ZVA instructions, so I doubt you'd be able to
improve on that with NEON stores.

As for copy_page(), please describe a use case where it is a
bottleneck, and reason about how much you could improve performance in
that case by improving the speed of copy_page() itself. Otherwise,
we're just adding NEON routines for the sake if it, which is a bad
idea imo.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Built a neon version copy_page/clear_page is correct?
  2018-12-12  7:18 ` Ard Biesheuvel
@ 2018-12-12  7:32   ` JackieLiu
  2018-12-12 11:45     ` Robin Murphy
  2018-12-12 12:21     ` Russell King - ARM Linux
  0 siblings, 2 replies; 6+ messages in thread
From: JackieLiu @ 2018-12-12  7:32 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Andrew Pinski, linux-arm-kernel

Yes. I have a bottleneck, maybe it’s not copy_page’s. but
during the debugging process, this function has a very high
CPU utilization. 

The test program is UnixBench’s src/spawn.c, with a while to 
fork process. The only variable for test is PAGE_SIZE, one is
4k PAGE_SIZE, next is 64k PAGE_SIZE.

result for "perf top":
4k  |  13% CPU  copy_page 
64k |  48% CPU  copy_page

This is why I want to optimize this function. Maybe bottleneck
is not here?

> 在 2018年12月12日,15:18,Ard Biesheuvel <ard.biesheuvel@linaro.org> 写道:
> 
> As for copy_page(), please describe a use case where it is a
> bottleneck, and reason about how much you could improve performance in
> that case by improving the speed of copy_page() itself. Otherwise,
> we're just adding NEON routines for the sake if it, which is a bad
> idea imo.





_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Built a neon version copy_page/clear_page is correct?
  2018-12-12  7:32   ` JackieLiu
@ 2018-12-12 11:45     ` Robin Murphy
  2018-12-12 11:58       ` JackieLiu
  2018-12-12 12:21     ` Russell King - ARM Linux
  1 sibling, 1 reply; 6+ messages in thread
From: Robin Murphy @ 2018-12-12 11:45 UTC (permalink / raw)
  To: JackieLiu, Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Andrew Pinski, linux-arm-kernel

On 12/12/2018 07:32, JackieLiu wrote:
> Yes. I have a bottleneck, maybe it’s not copy_page’s. but
> during the debugging process, this function has a very high
> CPU utilization.
> 
> The test program is UnixBench’s src/spawn.c, with a while to
> fork process. The only variable for test is PAGE_SIZE, one is
> 4k PAGE_SIZE, next is 64k PAGE_SIZE.

AFAICS all that does is call fork() in a loop as fast as it possibly 
can. Forking involves copying pages, either during the call or via 
copy-on-write triggering in one or both processes after the syscall 
returns. So your 'problem' is that some benchmark code spends a fair 
amount of time doing the major part of the operation it's benchmarking... :/

> result for "perf top":
> 4k  |  13% CPU  copy_page
> 64k |  48% CPU  copy_page
> 
> This is why I want to optimize this function. Maybe bottleneck
> is not here?

Also bear in mind that AFAIK most current cores can happily saturate 
their load/store unit with just LDP/STP - this isn't like Armv7 where 
VLD* was the only way to generate a single 128-bit wide access. If 
you're not doing any actual calculation or LDn/STn interleaving 
trickery, using NEON purely to move data is unlikely to be worthwhile in 
general. On many cores it may well end up being slower.

Robin.

> 
>> 在 2018年12月12日,15:18,Ard Biesheuvel <ard.biesheuvel@linaro.org> 写道:
>>
>> As for copy_page(), please describe a use case where it is a
>> bottleneck, and reason about how much you could improve performance in
>> that case by improving the speed of copy_page() itself. Otherwise,
>> we're just adding NEON routines for the sake if it, which is a bad
>> idea imo.
> 
> 
> 
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Built a neon version copy_page/clear_page is correct?
  2018-12-12 11:45     ` Robin Murphy
@ 2018-12-12 11:58       ` JackieLiu
  0 siblings, 0 replies; 6+ messages in thread
From: JackieLiu @ 2018-12-12 11:58 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Catalin Marinas, Will Deacon, Andrew Pinski, linux-arm-kernel,
	Ard Biesheuvel

Thanks for your suggestion, maybe the problem is not on the copy_page, 
I need to re-understand the fork function code, looking for the real bottleneck.

BR.
Jackie

> 在 2018年12月12日,19:45,Robin Murphy <robin.murphy@arm.com> 写道:
> 
> On 12/12/2018 07:32, JackieLiu wrote:
>> Yes. I have a bottleneck, maybe it’s not copy_page’s. but
>> during the debugging process, this function has a very high
>> CPU utilization.
>> The test program is UnixBench’s src/spawn.c, with a while to
>> fork process. The only variable for test is PAGE_SIZE, one is
>> 4k PAGE_SIZE, next is 64k PAGE_SIZE.
> 
> AFAICS all that does is call fork() in a loop as fast as it possibly can. Forking involves copying pages, either during the call or via copy-on-write triggering in one or both processes after the syscall returns. So your 'problem' is that some benchmark code spends a fair amount of time doing the major part of the operation it's benchmarking... :/
> 
>> result for "perf top":
>> 4k  |  13% CPU  copy_page
>> 64k |  48% CPU  copy_page
>> This is why I want to optimize this function. Maybe bottleneck
>> is not here?
> 
> Also bear in mind that AFAIK most current cores can happily saturate their load/store unit with just LDP/STP - this isn't like Armv7 where VLD* was the only way to generate a single 128-bit wide access. If you're not doing any actual calculation or LDn/STn interleaving trickery, using NEON purely to move data is unlikely to be worthwhile in general. On many cores it may well end up being slower.
> 
> Robin.
> 
>>> 在 2018年12月12日,15:18,Ard Biesheuvel <ard.biesheuvel@linaro.org> 写道:
>>> 
>>> As for copy_page(), please describe a use case where it is a
>>> bottleneck, and reason about how much you could improve performance in
>>> that case by improving the speed of copy_page() itself. Otherwise,
>>> we're just adding NEON routines for the sake if it, which is a bad
>>> idea imo.
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 




_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Built a neon version copy_page/clear_page is correct?
  2018-12-12  7:32   ` JackieLiu
  2018-12-12 11:45     ` Robin Murphy
@ 2018-12-12 12:21     ` Russell King - ARM Linux
  1 sibling, 0 replies; 6+ messages in thread
From: Russell King - ARM Linux @ 2018-12-12 12:21 UTC (permalink / raw)
  To: JackieLiu
  Cc: Catalin Marinas, Will Deacon, Andrew Pinski, linux-arm-kernel,
	Ard Biesheuvel

On Wed, Dec 12, 2018 at 03:32:17PM +0800, JackieLiu wrote:
> Yes. I have a bottleneck, maybe it’s not copy_page’s. but
> during the debugging process, this function has a very high
> CPU utilization. 
> 
> The test program is UnixBench’s src/spawn.c, with a while to 
> fork process. The only variable for test is PAGE_SIZE, one is
> 4k PAGE_SIZE, next is 64k PAGE_SIZE.
> 
> result for "perf top":
> 4k  |  13% CPU  copy_page 
> 64k |  48% CPU  copy_page
> 
> This is why I want to optimize this function. Maybe bottleneck
> is not here?

I don't see anything out of the ordinary or unexpected here.  In
comparison to the rest of the work being done at and after fork(),
the most expensive bit _will_ be copying data around, so of course
this comes out high in the statistics.

However, UnixBench's spawn program is a system benchmark that allows
you to compare specific details of one implementation with another -
it is not supposed to be used to optimise a system.  Why?  It's not
a realistic workload.

Most programs either create threads (which are clones of another
thread, and share pages) or they fork() and then shortly later execve()
another program.  In both cases, there is very little page copying.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-12-12 12:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-12  6:01 Built a neon version copy_page/clear_page is correct? JackieLiu
2018-12-12  7:18 ` Ard Biesheuvel
2018-12-12  7:32   ` JackieLiu
2018-12-12 11:45     ` Robin Murphy
2018-12-12 11:58       ` JackieLiu
2018-12-12 12:21     ` Russell King - ARM Linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).