linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Adding A64FX hardware prefetch sysfs interface
@ 2021-06-07  1:39 tarumizu.kohei
  2021-06-07  8:11 ` Borislav Petkov
  0 siblings, 1 reply; 6+ messages in thread
From: tarumizu.kohei @ 2021-06-07  1:39 UTC (permalink / raw)
  To: 'hpa@zytor.com', 'tglx@linutronix.de',
	'mingo@redhat.com', 'x86@kernel.org',
	'linux-kernel@vger.kernel.org'
  Cc: tarumizu.kohei

Hello

I'm Kohei Tarumizu from Fujitsu Limited. 

Fujitsu A64FX processor implements a vendor specific function, the HPC extensions[1].
A64FX has some registers for HPC extensions.
We would like to use the register IMP_PF_STREAM_DETECT_CTRL_EL0 for tuning the hardware prefetch, but it's not accessible from userspace.
We are considering to implement a kernel common interface via sysfs as a way to control IMP_PF_STREAM_DETECT_CTRL_EL0 from userspace.
FYI, A64FX also has registers (e.g. IMP_PF_INJECTION_*) to control the behavior of the hardware prefetch from the software using "HPC tag address override", but this time we don't considered.

[1]https://github.com/fujitsu/A64FX/tree/master/doc/
   A64FX_Specification_HPC_Extension_v1_EN.pdf

This register is similar to the MSR registers 0x1A4(MSR_MISC_FEATURE_CONTROL)[2] and its details are described in [Similarity of each register].
From the discussion about the MSR driver, I understood it is not good idea to access registers directly from userspace, and that we want to move it to the proper interface.
We think it would be better to have the common interface which can control these registers in the future.
Therefore, we would like to design new sysfs interface, could you give me some advice?

[2]https://software.intel.com/content/www/us/en/develop/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors.html

[Similarity of each register]
* Settings for Hardware Prefetch
  These registers enable or disable hardware prefetching for L1/L2 cache.
  The A64FX's register also have "Prefetch Distance (bit: [27:24], [19:16])" and "Reliableness attribute for prefetch access (bit: [55], [54])".
* Not accessible from userspace
  In the expected usage scene (e.g. User wants to disable hardware prefetch), it is necessary to be able to access from the userspace.
* Share settings on a per-CPU basis
  A64FX's register is used in HPC applications and assumes that the process is bound to one core.

Currently, the path name has not been decided yet, but we consider of the following structure like cpufreq(/sys/devices/system/cpu/[CPUNUM]/cpufreq).

/sys/devices/system/cpu/[CPUNUM]/prefetcher/
    l1_enable   : This sets or displays whether hardware prefetch is enabled for L1 cache.
    l2_enable   : This sets or displays whether hardware prefetch is enabled for L2 cache.
    l1_dist     : This sets or displays whether hardware prefetch distance for L1 cache.
    l2_dist     : This sets or displays whether hardware prefetch distance for L2 cache.
    l1_reliable : This sets or displays whether reliableness attribute for prefetch access for L1 cache.
    l2_reliable : This sets or displays whether reliableness attribute for prefetch access for L2 cache.

We would like to implement only the enablement interface, if the A64FX-specific parameters ("dist" and "reliable") are not accepted.

Best regerds
Kohei Tarumizu

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Adding A64FX hardware prefetch sysfs interface
  2021-06-07  1:39 [RFC] Adding A64FX hardware prefetch sysfs interface tarumizu.kohei
@ 2021-06-07  8:11 ` Borislav Petkov
  2021-06-09  9:40   ` tarumizu.kohei
  2021-06-11 18:03   ` James Morse
  0 siblings, 2 replies; 6+ messages in thread
From: Borislav Petkov @ 2021-06-07  8:11 UTC (permalink / raw)
  To: tarumizu.kohei, linux-arm-kernel
  Cc: 'hpa@zytor.com', 'tglx@linutronix.de',
	'mingo@redhat.com', 'x86@kernel.org',
	'linux-kernel@vger.kernel.org'

Hi,

(not trimming the mail so that ARM folks can see the whole thing)

On Mon, Jun 07, 2021 at 01:39:21AM +0000, tarumizu.kohei@fujitsu.com wrote:
> Hello
> 
> I'm Kohei Tarumizu from Fujitsu Limited. 
> 
> Fujitsu A64FX processor implements a vendor specific function, the HPC extensions[1].
> A64FX has some registers for HPC extensions.
> We would like to use the register IMP_PF_STREAM_DETECT_CTRL_EL0 for tuning the hardware prefetch, but it's not accessible from userspace.
> We are considering to implement a kernel common interface via sysfs as a way to control IMP_PF_STREAM_DETECT_CTRL_EL0 from userspace.
> FYI, A64FX also has registers (e.g. IMP_PF_INJECTION_*) to control the behavior of the hardware prefetch from the software using "HPC tag address override", but this time we don't considered.
> 
> [1]https://github.com/fujitsu/A64FX/tree/master/doc/
>    A64FX_Specification_HPC_Extension_v1_EN.pdf
> 
> This register is similar to the MSR registers 0x1A4(MSR_MISC_FEATURE_CONTROL)[2] and its details are described in [Similarity of each register].
> From the discussion about the MSR driver, I understood it is not good idea to access registers directly from userspace, and that we want to move it to the proper interface.
> 

That's very nice of you that you're asking upfront, thanks!

> We think it would be better to have the common interface which can control these registers in the future.
> Therefore, we would like to design new sysfs interface, could you give me some advice?
> 
> [2]https://software.intel.com/content/www/us/en/develop/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors.html
> 
> [Similarity of each register]
> * Settings for Hardware Prefetch
>   These registers enable or disable hardware prefetching for L1/L2 cache.
>   The A64FX's register also have "Prefetch Distance (bit: [27:24], [19:16])" and "Reliableness attribute for prefetch access (bit: [55], [54])".
> * Not accessible from userspace
>   In the expected usage scene (e.g. User wants to disable hardware prefetch), it is necessary to be able to access from the userspace.
> * Share settings on a per-CPU basis
>   A64FX's register is used in HPC applications and assumes that the process is bound to one core.
> 
> Currently, the path name has not been decided yet, but we consider of the following structure like cpufreq(/sys/devices/system/cpu/[CPUNUM]/cpufreq).
> 
> /sys/devices/system/cpu/[CPUNUM]/prefetcher/

For that we already have a hierarchy:

tree /sys/devices/system/cpu/cpu0/cache/
/sys/devices/system/cpu/cpu0/cache/
├── index0
│   ├── coherency_line_size
│   ├── id
│   ├── level
│   ├── number_of_sets
│   ├── physical_line_partition
│   ├── shared_cpu_list
│   ├── shared_cpu_map
│   ├── size
│   ├── type
│   ├── uevent
│   └── ways_of_associativity
├── index1
│   ├── coherency_line_size
│   ├── id
│   ├── level
│   ├── number_of_sets
...

that's cpu<NUM>/cache/ and I believe ARM shares some of that code too.

>     l1_enable   : This sets or displays whether hardware prefetch is enabled for L1 cache.
>     l2_enable   : This sets or displays whether hardware prefetch is enabled for L2 cache.
>     l1_dist     : This sets or displays whether hardware prefetch distance for L1 cache.
>     l2_dist     : This sets or displays whether hardware prefetch distance for L2 cache.
>     l1_reliable : This sets or displays whether reliableness attribute for prefetch access for L1 cache.
>     l2_reliable : This sets or displays whether reliableness attribute for prefetch access for L2 cache.

Right, that I'd design differently:

	.../cache/prefetcher/l1/
		            /l1/enable
			    /l1/dist
		            /l1/reliable
	...		    /l2/
	...		    /l3/

so that you have a directory per cache level and in that directory you
have each file.

But let's loop in ARM folks as this is an ARM CPU after all and they'd
care for that code.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [RFC] Adding A64FX hardware prefetch sysfs interface
  2021-06-07  8:11 ` Borislav Petkov
@ 2021-06-09  9:40   ` tarumizu.kohei
  2021-06-11 18:03   ` James Morse
  1 sibling, 0 replies; 6+ messages in thread
From: tarumizu.kohei @ 2021-06-09  9:40 UTC (permalink / raw)
  To: 'Borislav Petkov', linux-arm-kernel
  Cc: 'hpa@zytor.com', 'tglx@linutronix.de',
	'mingo@redhat.com', 'x86@kernel.org',
	'linux-kernel@vger.kernel.org'

Hi, Borislav and ARM folks.

> For that we already have a hierarchy:

Thank you for the information.
We would like to see how cpu<NUM>/cache is implemented on x86 first, since we are not familiar with the design of cpu<num>/cache.

> Right, that I'd design differently:
> 
> 	.../cache/prefetcher/l1/
> 		            /l1/enable
> 			    /l1/dist
> 		            /l1/reliable
> 	...		    /l2/
> 	...		    /l3/
> 
> so that you have a directory per cache level and in that directory you have each
> file.

We agree that it is better to place hardware prefetch files under the cpu<num>/cache directory.

> But let's loop in ARM folks as this is an ARM CPU after all and they'd care for
> that code.

To the ARM folks:
Would you give me information about the current state of cpu<num>/cache implementation in ARM and the future plans?
If it doesn't yet exist as a feature, we would like to contribute to the work to enable it.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Adding A64FX hardware prefetch sysfs interface
  2021-06-07  8:11 ` Borislav Petkov
  2021-06-09  9:40   ` tarumizu.kohei
@ 2021-06-11 18:03   ` James Morse
  2021-06-18  1:32     ` tarumizu.kohei
  1 sibling, 1 reply; 6+ messages in thread
From: James Morse @ 2021-06-11 18:03 UTC (permalink / raw)
  To: tarumizu.kohei, linux-arm-kernel
  Cc: 'hpa@zytor.com', 'tglx@linutronix.de',
	'mingo@redhat.com', 'x86@kernel.org',
	'linux-kernel@vger.kernel.org',
	Will@kernel.org, Catalin Marinas, Borislav Petkov

Hello!

(CC: +Catalin and Will)

On 07/06/2021 09:11, Borislav Petkov wrote:
> (not trimming the mail so that ARM folks can see the whole thing)
> 
> On Mon, Jun 07, 2021 at 01:39:21AM +0000, tarumizu.kohei@fujitsu.com wrote:
>> Hello
>>
>> I'm Kohei Tarumizu from Fujitsu Limited. 
>>
>> Fujitsu A64FX processor implements a vendor specific function, the HPC extensions[1].
>> A64FX has some registers for HPC extensions.
>> We would like to use the register IMP_PF_STREAM_DETECT_CTRL_EL0 for tuning the hardware prefetch, but it's not accessible from userspace.
>> We are considering to implement a kernel common interface via sysfs as a way to control IMP_PF_STREAM_DETECT_CTRL_EL0 from userspace.


>> FYI, A64FX also has registers (e.g. IMP_PF_INJECTION_*) to control the behavior of the hardware prefetch from the software using "HPC tag address override", but this time we don't considered.
>>
>> [1]https://github.com/fujitsu/A64FX/tree/master/doc/
>>    A64FX_Specification_HPC_Extension_v1_EN.pdf

While this is initially about sysfs, don't you need the 'HPC tag address override' to be
enabled for this to be useful? I don't think that feature can be managed by a driver:

'HPC tag address override' changes the top byte of all user-space pointers from being
ignored (as they have been since day-1 on arm64) to having implications for the hardware.
If I've read the document correctly this affects the prefetch mode and where in the L1/L2
such accesses will be allocated.

This would impact user-space that is using the top-byte for their own purposes.
For example hwasan uses this field as a tag it allocates itself:
https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
Enabling 'HPC tag address override' for all user-space is going to have weird performance
effects.

To make this work, I think you'd need a per-process opt-in, and __switch_to() would need
to toggle your IMP_FJ_TAG_ADDRESS_CTRL_EL1.TBOx bits. Because its an
implementation-defined feature, but the controls can't be confined to a driver, I don't
think enabling 'HPC tag address override' is viable.

Is the sysfs information useful without it?


Thanks,

James

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [RFC] Adding A64FX hardware prefetch sysfs interface
  2021-06-11 18:03   ` James Morse
@ 2021-06-18  1:32     ` tarumizu.kohei
  2021-07-08  1:59       ` tarumizu.kohei
  0 siblings, 1 reply; 6+ messages in thread
From: tarumizu.kohei @ 2021-06-18  1:32 UTC (permalink / raw)
  To: 'James Morse', linux-arm-kernel
  Cc: 'hpa@zytor.com', 'tglx@linutronix.de',
	'mingo@redhat.com', 'x86@kernel.org',
	'linux-kernel@vger.kernel.org',
	Will@kernel.org, Catalin Marinas, Borislav Petkov

Hi, James.

Thank you for your comment.

> While this is initially about sysfs, don't you need the 'HPC tag address override'
> to be enabled for this to be useful? I don't think that feature can be managed by
> a driver:

It is certainly useful to enable 'HPC tag address override' for more control.
However, enabling "HPC tag address override" has some challenges as you commented.
We have also verified that the performance can be improved via IMP_PF_STREAM_DETECT_CTRL_EL0 without using 'HPC tag address override'.
Therefore, first, we would like to implement sysfs interface to control only IMP_PF_STREAM_DETECT_CTRL_EL0.

At this time, we don't intend to enable "HPC tag address override", but if necessary, we would like to consider it.

> 'HPC tag address override' changes the top byte of all user-space pointers from
> being ignored (as they have been since day-1 on arm64) to having implications
> for the hardware.
> If I've read the document correctly this affects the prefetch mode and where in
> the L1/L2 such accesses will be allocated.

Your understanding of 'HPC tag address override' is correct.
If it's enabled, tuning according to characteristics of each load/store instruction is possible.
On the other hand, we can still change system-wide settings 'Prefetch Enablement (bit: [59], [58])', 'Prefetch Distance (bit: [27:24], [19:16])', and 'Prefetch Reliableness (bit: [55], [54])' via IMP_PF_STREAM_DETECT_CTRL_EL0 without it.
The latter does not allow to per-instruction tuning, but allow per-application tuning.
At this point, we assume that one application is bound to one core.

> This would impact user-space that is using the top-byte for their own purposes.
> For example hwasan uses this field as a tag it allocates itself:
> https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
> Enabling 'HPC tag address override' for all user-space is going to have weird
> performance effects.
> 
> To make this work, I think you'd need a per-process opt-in, and __switch_to()
> would need to toggle your IMP_FJ_TAG_ADDRESS_CTRL_EL1.TBOx bits.
> Because its an implementation-defined feature, but the controls can't be
> confined to a driver, I don't think enabling 'HPC tag address override' is viable.

We understood that there are these challenges if we try to enable 'HPC tag address override'.
However, if we don't enable 'HPC tag address override', these considerations are probably unnecessary because settings via IMP_PF_STREAM_DETECT_CTRL_EL0 are treated as system-wide settings.

> Is the sysfs information useful without it?

We think it's enough to tune system-wide settings 'Prefetch Enablement', 'Prefetch Distance', and 'Prefetch Reliableness' via IMP_PF_STREAM_DETECT_CTRL_EL0 in most case.
Therefore, we think it is useful to implement sysfs interface to operate only IMP_PF_STREAM_DETECT_CTRL_EL0.

Best regards.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [RFC] Adding A64FX hardware prefetch sysfs interface
  2021-06-18  1:32     ` tarumizu.kohei
@ 2021-07-08  1:59       ` tarumizu.kohei
  0 siblings, 0 replies; 6+ messages in thread
From: tarumizu.kohei @ 2021-07-08  1:59 UTC (permalink / raw)
  To: 'James Morse', 'linux-arm-kernel@lists.infradead.org'
  Cc: 'hpa@zytor.com', 'tglx@linutronix.de',
	'mingo@redhat.com', 'x86@kernel.org',
	'linux-kernel@vger.kernel.org', 'Will@kernel.org',
	'Catalin Marinas', 'Borislav Petkov'

Hi, ARM folks.

> For that we already have a hierarchy:
> 
> tree /sys/devices/system/cpu/cpu0/cache/
> /sys/devices/system/cpu/cpu0/cache/
> ├── index0
> │   ├── coherency_line_size
> │   ├── id
> │   ├── level
> │   ├── number_of_sets
> │   ├── physical_line_partition
> │   ├── shared_cpu_list
> │   ├── shared_cpu_map
> │   ├── size
> │   ├── type
> │   ├── uevent
> │   └── ways_of_associativity
> ├── index1
> │   ├── coherency_line_size
> │   ├── id
> │   ├── level
> │   ├── number_of_sets
> ...
> 
> that's cpu<NUM>/cache/ and I believe ARM shares some of that code too.
> 
> >     l1_enable   : This sets or displays whether hardware prefetch is enabled for L1 cache.
> >     l2_enable   : This sets or displays whether hardware prefetch is enabled for L2 cache.
> >     l1_dist     : This sets or displays whether hardware prefetch distance for L1 cache.
> >     l2_dist     : This sets or displays whether hardware prefetch distance for L2 cache.
> >     l1_reliable : This sets or displays whether reliableness attribute for prefetch access for L1 cache.
> >     l2_reliable : This sets or displays whether reliableness attribute for prefetch access for L2 cache.
> 
> Right, that I'd design differently:
> 
> 	.../cache/prefetcher/l1/
> 		            /l1/enable
> 			    /l1/dist
> 		            /l1/reliable
> 	...		    /l2/
> 	...		    /l3/
> 
> so that you have a directory per cache level and in that directory you
> have each file.
> 
> But let's loop in ARM folks as this is an ARM CPU after all and they'd
> care for that code.

Could you comment on the following two ideas for the sysfs interface directory structure to control hardware prefetch?

    1. /sys/devices/system/cpu/cpu<num>/cache/prefetcher
    2. /sys/devices/system/cpu/cpu<num>/prefetcher

We think that the Proposal 1 is better because it will be clear that it is a cache-related feature.

> To the ARM folks:
> Would you give me information about the current state of cpu<num>/cache implementation in ARM and the future plans?
> If it doesn't yet exist as a feature, we would like to contribute to the work to enable it.

About the above question, we thought the cpu<num>/cache directory(/sys/devices/system/cpu/cpu<num>/cache) is not yet implemented on ARM64 in the first place. 
Because the cpu<num>/cache directory does not exist on our ARM64 machine, FX700 with the A64FX processor.
However, we realized that even for ARM64, the cpu<num>/cache directory is created if "ACPI PPTT" or "devicetree" is supported.
This problem was caused by the FX700 firmware not supporting "ACPI PPTT" and was not a common problem with the ARM64.
Therefore, we withdraw the above question.

On the other hand, for the new sysfs interface to control hardware prefetch, we’d like to consider an environment that does not support "ACPI PPTT".
If we adopt Proposal 2, no special consideration is required.
However, if we adopt Proposal 1, it is necessary to consider, for example, creating an empty cpu<num>/cache directory in the environment that does not suport "ACPI PPTT".
We don't think to create empty directory is problem, because the hardware prefetch control sysfs interface does not depend on the contents of cpu<num>/cache/index<num>.
If we create an empty directory, are there any other issues to consider?

Best regards,
Kohei Tarumizu

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-07-08  2:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-07  1:39 [RFC] Adding A64FX hardware prefetch sysfs interface tarumizu.kohei
2021-06-07  8:11 ` Borislav Petkov
2021-06-09  9:40   ` tarumizu.kohei
2021-06-11 18:03   ` James Morse
2021-06-18  1:32     ` tarumizu.kohei
2021-07-08  1:59       ` tarumizu.kohei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).