linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
@ 2018-11-14 23:00 Aubrey Li
  2018-11-14 23:00 ` [PATCH v3 2/2] proc: add /proc/<pid>/arch_state Aubrey Li
  2018-11-15 15:40 ` [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks Dave Hansen
  0 siblings, 2 replies; 16+ messages in thread
From: Aubrey Li @ 2018-11-14 23:00 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, aubrey.li, linux-kernel, Aubrey Li

User space tools which do automated task placement need information
about AVX-512 usage of tasks, because AVX-512 usage could cause core
turbo frequency drop and impact the running task on the sibling CPU.

XSAVE header contains a state-component bitmap, which allows software
to discover the state of the init optimization used by XSAVEOPT and
XSAVES. Set bits in the bitmap denotes the usage of the components.

AVX-512 component has 3 states, only Hi16_ZMM state causes notable
frequency drop. Add per task Hi16_ZMM state tracking to context switch.

The tracking turns on the usage flag immediately, but requires 3
consecutive context switches with no usage to clear it. This decay is
required because of AVX-512 using tasks could set Hi16_ZMM state back
to the init state themselves.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86/include/asm/fpu/internal.h | 26 ++++++++++++++++++++++++++
 arch/x86/include/asm/fpu/types.h    |  9 +++++++++
 2 files changed, 35 insertions(+)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index a38bf5a..f382449 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -275,6 +275,31 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
 		     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)	\
 		     : "memory")
 
+#define	HI16ZMM_STATE_DECAY_COUNT	3
+/*
+ * This function is called during context switch to update Hi16_ZMM state
+ */
+static inline void update_hi16zmm_state(struct fpu *fpu)
+{
+	/*
+	 * XSAVE header contains a state-component bitmap(xfeatures),
+	 * which allows software to discover the state of the init
+	 * optimization used by XSAVEOPT and XSAVES.
+	 *
+	 * Hi16_ZMM state(one state of AVX-512 component) is tracked here
+	 * because its usage could cause notable core turbo frequency drop.
+	 *
+	 * AVX512-using tasks could set Hi16_ZMM state back to the init
+	 * state themselves. Thus, this tracking mechanism can miss.
+	 * The decay usage ensures that false-negatives do not immediately
+	 * make a task be considered as not using Hi16_ZMM registers.
+	 */
+	if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_Hi16_ZMM)
+		fpu->hi16zmm_usage = HI16ZMM_STATE_DECAY_COUNT;
+	else if (fpu->hi16zmm_usage)
+		fpu->hi16zmm_usage--;
+}
+
 /*
  * This function is called only during boot time when x86 caps are not set
  * up and alternative can not be used yet.
@@ -411,6 +436,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
 {
 	if (likely(use_xsave())) {
 		copy_xregs_to_kernel(&fpu->state.xsave);
+		update_hi16zmm_state(fpu);
 		return 1;
 	}
 
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 202c539..c0c7577 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -303,6 +303,15 @@ struct fpu {
 	unsigned char			initialized;
 
 	/*
+	 * @hi16zmm_usage:
+	 *
+	 * Records the usage of the upper 16 AVX512 registers: ZMM16-ZMM31.
+	 * A value of non-zero is used to indicate whether there is valid
+	 * state in these AVX512 registers.
+	 */
+	unsigned char			hi16zmm_usage;
+
+	/*
 	 * @state:
 	 *
 	 * In-memory copy of all FPU registers that we save/restore
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 2/2] proc: add /proc/<pid>/arch_state
  2018-11-14 23:00 [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks Aubrey Li
@ 2018-11-14 23:00 ` Aubrey Li
  2018-11-15 15:18   ` Dave Hansen
  2018-11-19 17:39   ` Peter Zijlstra
  2018-11-15 15:40 ` [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks Dave Hansen
  1 sibling, 2 replies; 16+ messages in thread
From: Aubrey Li @ 2018-11-14 23:00 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, aubrey.li, linux-kernel, Aubrey Li

Add a /proc/<pid>/arch_state interface to expose per-task cpu specific
state values.

Exposing AVX-512 Hi16_ZMM registers usage is for the user space job
scheduler to cluster AVX-512 using tasks together, because these tasks
could cause core turbo frequency drop.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86/kernel/fpu/xstate.c | 16 ++++++++++++++++
 fs/proc/base.c               | 13 +++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 87a57b7..10224ee 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,7 @@
 #include <linux/cpu.h>
 #include <linux/mman.h>
 #include <linux/pkeys.h>
+#include <linux/seq_file.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -1245,3 +1246,18 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
 
 	return 0;
 }
+
+/*
+ * report CPU specific thread state
+ */
+void arch_thread_state(struct seq_file *m, struct task_struct *task)
+{
+	/*
+	 * Report AVX-512 Hi16_ZMM registers usage
+	 */
+	if (task->thread.fpu.hi16zmm_usage)
+		seq_putc(m, '1');
+	else
+		seq_putc(m, '0');
+	seq_putc(m, '\n');
+}
diff --git a/fs/proc/base.c b/fs/proc/base.c
index aaffc0c..efd51ec 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2893,6 +2893,17 @@ static int proc_pid_patch_state(struct seq_file *m, struct pid_namespace *ns,
 }
 #endif /* CONFIG_LIVEPATCH */
 
+void __weak arch_thread_state(struct seq_file *m, struct task_struct *task)
+{
+}
+
+static int proc_pid_arch_state(struct seq_file *m, struct pid_namespace *ns,
+				 struct pid *pid, struct task_struct *task)
+{
+	arch_thread_state(m, task);
+	return 0;
+}
+
 /*
  * Thread groups
  */
@@ -2994,6 +3005,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_LIVEPATCH
 	ONE("patch_state",  S_IRUSR, proc_pid_patch_state),
 #endif
+	ONE("arch_state",  S_IRUSR, proc_pid_arch_state),
 };
 
 static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3372,6 +3384,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_LIVEPATCH
 	ONE("patch_state",  S_IRUSR, proc_pid_patch_state),
 #endif
+	ONE("arch_state",  S_IRUSR, proc_pid_arch_state),
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] proc: add /proc/<pid>/arch_state
  2018-11-14 23:00 ` [PATCH v3 2/2] proc: add /proc/<pid>/arch_state Aubrey Li
@ 2018-11-15 15:18   ` Dave Hansen
  2018-11-16  0:32     ` Li, Aubrey
  2018-11-19 17:39   ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2018-11-15 15:18 UTC (permalink / raw)
  To: Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel, Aubrey Li

On 11/14/18 3:00 PM, Aubrey Li wrote:
> +void arch_thread_state(struct seq_file *m, struct task_struct *task)
> +{
> +	/*
> +	 * Report AVX-512 Hi16_ZMM registers usage
> +	 */
> +	if (task->thread.fpu.hi16zmm_usage)
> +		seq_putc(m, '1');
> +	else
> +		seq_putc(m, '0');
> +	seq_putc(m, '\n');
> +}

Am I reading this right that this file just dumps out a plain 0 or 1
based on the internal kernel state?  BTW, there's no example of the
output in the changelog, so it's rather hard to tell if my guess is
right.  (Hint, hint).

If so, I'd really prefer you not do this.  We have /proc/$pid/stat to
stand as a disaster in this regard.  It is essentially
non-human-readable gibberish because it's impossible to tell what the
values mean without a decoder ring.

If we go down this road, we need a file along the lines of
/proc/$pid/status.

But, either way, this is a new ABI that we need to consider carefully.
It needs documentation.  For instance, will this really mean "Hi16_ZMM
user" from now until the end of time?  Or, does it just mean "group me
with other tasks that have this bit set"?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-14 23:00 [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks Aubrey Li
  2018-11-14 23:00 ` [PATCH v3 2/2] proc: add /proc/<pid>/arch_state Aubrey Li
@ 2018-11-15 15:40 ` Dave Hansen
  2018-11-16  0:21   ` Li, Aubrey
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2018-11-15 15:40 UTC (permalink / raw)
  To: Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel, Aubrey Li

On 11/14/18 3:00 PM, Aubrey Li wrote:
> AVX-512 component has 3 states, only Hi16_ZMM state causes notable
> frequency drop. Add per task Hi16_ZMM state tracking to context switch.

Just curious, but is there any public documentation of this?  It seems
really odd to me that something using the same AVX-512 instructions on
some low-numbered registers would behave differently than the same
instructions on some high-numbered registers.  I'm not saying this is
wrong, but it's certainly counter-intuitive and I think that begs for
some more explanation.

> The tracking turns on the usage flag immediately, but requires 3
> consecutive context switches with no usage to clear it. This decay is
> required because of AVX-512 using tasks could set Hi16_ZMM state back
> to the init state themselves.

It would be nice to not assume that folks reading this changelog know
what XSAVE 'init states' are.  In fact, that comment you have in the
function below would be great here, but probably shouldn't be in the
comment.

I would say it even more strongly than that:  Part of the best practices
for using AVX-512 is to use the VZEROUPPER instruction to zero out some
state when the AVX-512 operation is finished.  Unlike all the other FPU
and AVX state before it,  this means that the Hi16_ZMM is expected to
frequently transition back to the "init state" in normal use.  This
might cause this detection mechanism to frequently miss tasks that
actually use AVX-512.  To fix that, add a decay.

> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  arch/x86/include/asm/fpu/internal.h | 26 ++++++++++++++++++++++++++
>  arch/x86/include/asm/fpu/types.h    |  9 +++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index a38bf5a..f382449 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -275,6 +275,31 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
>  		     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)	\
>  		     : "memory")
>  
> +#define	HI16ZMM_STATE_DECAY_COUNT	3
> +/*
> + * This function is called during context switch to update Hi16_ZMM state
> + */
> +static inline void update_hi16zmm_state(struct fpu *fpu)
> +{
> +	/*
> +	 * XSAVE header contains a state-component bitmap(xfeatures),
> +	 * which allows software to discover the state of the init
> +	 * optimization used by XSAVEOPT and XSAVES.

I don't think we need the XSAVE background here.  Can you put this in
the changelog?

> +	 * Hi16_ZMM state(one state of AVX-512 component) is tracked here
> +	 * because its usage could cause notable core turbo frequency drop.

I'd leave just this part of the comment.

> +	 * AVX512-using tasks could set Hi16_ZMM state back to the init
> +	 * state themselves. Thus, this tracking mechanism can miss.

Can you make this a stronger statement, just like the changelog?

> +	 * The decay usage ensures that false-negatives do not immediately
> +	 * make a task be considered as not using Hi16_ZMM registers.
> +	 */

To ensure that false-negatives do not immediately show up, decay the
usage count over time.


> +	 *
> +	 * Records the usage of the upper 16 AVX512 registers: ZMM16-ZMM31.
> +	 * A value of non-zero is used to indicate whether there is valid
> +	 * state in these AVX512 registers.
> +	 */

> 
>  	/*
> +	 * @hi16zmm_usage:
> +	 *
> +	 * Records the usage of the upper 16 AVX512 registers: ZMM16-ZMM31.
> +	 * A value of non-zero is used to indicate whether there is valid
> +	 * state in these AVX512 registers.
> +	 */
> +	unsigned char			hi16zmm_usage;
> +

Nit: With the decay, this does not indicate register state.  It
indicates whether the registers recently had state.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-15 15:40 ` [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks Dave Hansen
@ 2018-11-16  0:21   ` Li, Aubrey
  2018-11-16  1:04     ` Dave Hansen
  2018-11-16 23:10     ` Dave Hansen
  0 siblings, 2 replies; 16+ messages in thread
From: Li, Aubrey @ 2018-11-16  0:21 UTC (permalink / raw)
  To: Dave Hansen, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 2018/11/15 23:40, Dave Hansen wrote:
> On 11/14/18 3:00 PM, Aubrey Li wrote:
>> AVX-512 component has 3 states, only Hi16_ZMM state causes notable
>> frequency drop. Add per task Hi16_ZMM state tracking to context switch.
> 
> Just curious, but is there any public documentation of this?  It seems
> really odd to me that something using the same AVX-512 instructions on
> some low-numbered registers would behave differently than the same
> instructions on some high-numbered registers.  I'm not saying this is
> wrong, but it's certainly counter-intuitive and I think that begs for
> some more explanation.

Yes, Intel 64 and IA-32 Architectures software developer's Manual mentioned
this in performance event CORE_POWER.LVL2_TURBO_LICENSE.

"Core cycles where the core was running with power delivery for license
level 2 (introduced in Skylake Server microarchitecture). This includes
high current AVX 512-bit instructions."

I translated license level 2 to frequency drop.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] proc: add /proc/<pid>/arch_state
  2018-11-15 15:18   ` Dave Hansen
@ 2018-11-16  0:32     ` Li, Aubrey
  0 siblings, 0 replies; 16+ messages in thread
From: Li, Aubrey @ 2018-11-16  0:32 UTC (permalink / raw)
  To: Dave Hansen, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 2018/11/15 23:18, Dave Hansen wrote:
> On 11/14/18 3:00 PM, Aubrey Li wrote:
>> +void arch_thread_state(struct seq_file *m, struct task_struct *task)
>> +{
>> +	/*
>> +	 * Report AVX-512 Hi16_ZMM registers usage
>> +	 */
>> +	if (task->thread.fpu.hi16zmm_usage)
>> +		seq_putc(m, '1');
>> +	else
>> +		seq_putc(m, '0');
>> +	seq_putc(m, '\n');
>> +}
> 
> Am I reading this right that this file just dumps out a plain 0 or 1
> based on the internal kernel state?  BTW, there's no example of the
> output in the changelog, so it's rather hard to tell if my guess is
> right.  (Hint, hint).
> 
> If so, I'd really prefer you not do this.  We have /proc/$pid/stat to
> stand as a disaster in this regard.  It is essentially
> non-human-readable gibberish because it's impossible to tell what the
> values mean without a decoder ring.

Yes, I'm following /proc/$pid/stat format, as I think this interface is
not for the end user, but for developer and user space job scheduler. So
I guess this style might be okay.

> 
> If we go down this road, we need a file along the lines of
> /proc/$pid/status.

I checked /proc/$pid/status, all common information to architectures.
That's why I want to open a new interface to CPU specific state.

> 
> But, either way, this is a new ABI that we need to consider carefully.
> It needs documentation.  For instance, will this really mean "Hi16_ZMM
> user" from now until the end of time?  Or, does it just mean "group me
> with other tasks that have this bit set"?
> 
I'm open to this interface. Let's wait to see if there are more comments
and suggestions.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-16  0:21   ` Li, Aubrey
@ 2018-11-16  1:04     ` Dave Hansen
  2018-11-16 23:10     ` Dave Hansen
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2018-11-16  1:04 UTC (permalink / raw)
  To: Li, Aubrey, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 11/15/18 4:21 PM, Li, Aubrey wrote:
> On 2018/11/15 23:40, Dave Hansen wrote:
>> On 11/14/18 3:00 PM, Aubrey Li wrote:
>>> AVX-512 component has 3 states, only Hi16_ZMM state causes notable
>>> frequency drop. Add per task Hi16_ZMM state tracking to context switch.
>>
>> Just curious, but is there any public documentation of this?  It seems
>> really odd to me that something using the same AVX-512 instructions on
>> some low-numbered registers would behave differently than the same
>> instructions on some high-numbered registers.  I'm not saying this is
>> wrong, but it's certainly counter-intuitive and I think that begs for
>> some more explanation.
> 
> Yes, Intel 64 and IA-32 Architectures software developer's Manual mentioned
> this in performance event CORE_POWER.LVL2_TURBO_LICENSE.
> 
> "Core cycles where the core was running with power delivery for license
> level 2 (introduced in Skylake Server microarchitecture). This includes
> high current AVX 512-bit instructions."
> 
> I translated license level 2 to frequency drop.

OK, but that talks about AVX 512 and not specifically about Hi16_ZMM's
impact which is what this patch measures.  Are the Hi16_ZMM intricacies
documented anywhere?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-16  0:21   ` Li, Aubrey
  2018-11-16  1:04     ` Dave Hansen
@ 2018-11-16 23:10     ` Dave Hansen
  2018-11-17  0:36       ` Li, Aubrey
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2018-11-16 23:10 UTC (permalink / raw)
  To: Li, Aubrey, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 11/15/18 4:21 PM, Li, Aubrey wrote:
> "Core cycles where the core was running with power delivery for license
> level 2 (introduced in Skylake Server microarchitecture). This includes
> high current AVX 512-bit instructions."
> 
> I translated license level 2 to frequency drop.

BTW, the "high" in that text: "high-current AVX 512-bit instructions" is
talking about high-current, not "high ... instructions" or high-numbered
registers.  I think that might be the source of some of the confusion
about which XSAVE state needs to be examined.

Just to be clear: there are 3 AVX-512 XSAVE states:

        XFEATURE_OPMASK,
        XFEATURE_ZMM_Hi256,
        XFEATURE_Hi16_ZMM,

I honestly don't know what XFEATURE_OPMASK does.  It does not appear to
be affected by VZEROUPPER (although VZEROUPPER's SDM documentation isn't
looking too great).

But, XFEATURE_ZMM_Hi256 is used for the upper 256 bits of the
registers ZMM0-ZMM15.  Those are AVX-512-only registers.  The only way
to get data into XFEATURE_ZMM_Hi256 state is by using AVX512 instructions.

XFEATURE_Hi16_ZMM is the same.  The only way to get state in there is
with AVX512 instructions.

So, first of all, I think you *MUST* check XFEATURE_ZMM_Hi256 and
XFEATURE_Hi16_ZMM.  That's without question.

It's probably *possible* to run AVX512 instructions by loading state
into the YMM register and then executing AVX512 instructions that only
write to memory and never to register state.  That *might* allow
XFEATURE_Hi16_ZMM and XFEATURE_ZMM_Hi256 to stay in the init state, but
for the frequency to be affected since AVX512 instructions _are_
executing.  But, there's no way to detect this situation from XSAVE
states themselves.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-16 23:10     ` Dave Hansen
@ 2018-11-17  0:36       ` Li, Aubrey
  0 siblings, 0 replies; 16+ messages in thread
From: Li, Aubrey @ 2018-11-17  0:36 UTC (permalink / raw)
  To: Dave Hansen, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 2018/11/17 7:10, Dave Hansen wrote:
> On 11/15/18 4:21 PM, Li, Aubrey wrote:
>> "Core cycles where the core was running with power delivery for license
>> level 2 (introduced in Skylake Server microarchitecture). This includes
>> high current AVX 512-bit instructions."
>>
>> I translated license level 2 to frequency drop.
> 
> BTW, the "high" in that text: "high-current AVX 512-bit instructions" is
> talking about high-current, not "high ... instructions" or high-numbered
> registers.  I think that might be the source of some of the confusion
> about which XSAVE state needs to be examined.
> 
> Just to be clear: there are 3 AVX-512 XSAVE states:
> 
>         XFEATURE_OPMASK,
>         XFEATURE_ZMM_Hi256,
>         XFEATURE_Hi16_ZMM,
> 
> I honestly don't know what XFEATURE_OPMASK does.  It does not appear to
> be affected by VZEROUPPER (although VZEROUPPER's SDM documentation isn't
> looking too great).
> 
> But, XFEATURE_ZMM_Hi256 is used for the upper 256 bits of the
> registers ZMM0-ZMM15.  Those are AVX-512-only registers.  The only way
> to get data into XFEATURE_ZMM_Hi256 state is by using AVX512 instructions.
> 
> XFEATURE_Hi16_ZMM is the same.  The only way to get state in there is
> with AVX512 instructions.
> 
> So, first of all, I think you *MUST* check XFEATURE_ZMM_Hi256 and
> XFEATURE_Hi16_ZMM.  That's without question.

No, XFEATURE_ZMM_Hi256 does not request turbo license 2, so it's less
interested to us.

> 
> It's probably *possible* to run AVX512 instructions by loading state
> into the YMM register and then executing AVX512 instructions that only
> write to memory and never to register state.  That *might* allow
> XFEATURE_Hi16_ZMM and XFEATURE_ZMM_Hi256 to stay in the init state, but
> for the frequency to be affected since AVX512 instructions _are_
> executing.  But, there's no way to detect this situation from XSAVE
> states themselves.
> 

Andi should have more details on this. FWICT, not all AVX512 instructions
has high current, those only touching memory do not cause notable frequency
drop.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] proc: add /proc/<pid>/arch_state
  2018-11-14 23:00 ` [PATCH v3 2/2] proc: add /proc/<pid>/arch_state Aubrey Li
  2018-11-15 15:18   ` Dave Hansen
@ 2018-11-19 17:39   ` Peter Zijlstra
  2018-11-21  1:39     ` Li, Aubrey
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2018-11-19 17:39 UTC (permalink / raw)
  To: Aubrey Li
  Cc: tglx, mingo, hpa, ak, tim.c.chen, dave.hansen, arjan,
	linux-kernel, Aubrey Li

On Thu, Nov 15, 2018 at 07:00:07AM +0800, Aubrey Li wrote:
> Add a /proc/<pid>/arch_state interface to expose per-task cpu specific
> state values.
> 
> Exposing AVX-512 Hi16_ZMM registers usage is for the user space job
> scheduler to cluster AVX-512 using tasks together, because these tasks
> could cause core turbo frequency drop.

I still don't much like the name; how about arch_simd_state ? Also,
since we're printing an integer, I still prefer we go print the turbo
license level. I know level 1 isn't too interesting atm, but consider
future hardware widening the thing again and us growing level 3 or
something.

Also; you were going to shop around with the other architectures to see
what they want/need for this interface. I see nothing on that.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] proc: add /proc/<pid>/arch_state
  2018-11-19 17:39   ` Peter Zijlstra
@ 2018-11-21  1:39     ` Li, Aubrey
  2018-11-21  8:19       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Li, Aubrey @ 2018-11-21  1:39 UTC (permalink / raw)
  To: Peter Zijlstra, Aubrey Li
  Cc: tglx, mingo, hpa, ak, tim.c.chen, dave.hansen, arjan, linux-kernel

On 2018/11/20 1:39, Peter Zijlstra wrote:
> On Thu, Nov 15, 2018 at 07:00:07AM +0800, Aubrey Li wrote:
>> Add a /proc/<pid>/arch_state interface to expose per-task cpu specific
>> state values.
>>
>> Exposing AVX-512 Hi16_ZMM registers usage is for the user space job
>> scheduler to cluster AVX-512 using tasks together, because these tasks
>> could cause core turbo frequency drop.
> 
> I still don't much like the name; how about arch_simd_state ? 

My intention is
#cat /proc/<pid>/arch_state
0 1 0

Here, the first "0" denotes simd_state, and the second "1" denotes another
cpu specific feature(may come soon), and can be extended.

But sure I can limit it to simd_state and change the name in the patch.
>Also,
> since we're printing an integer, I still prefer we go print the turbo
> license level. I know level 1 isn't too interesting atm, but consider
> future hardware widening the thing again and us growing level 3 or
> something.
The problem is, FWICT, the bits in XSAVE buffer is not exactly mapped to
the turbo license level, so we can't print turbo license level correctly,
or the first patch need rework for that.

> 
> Also; you were going to shop around with the other architectures to see
> what they want/need for this interface. I see nothing on that.
> 
I'm open for your suggestion, :)

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] proc: add /proc/<pid>/arch_state
  2018-11-21  1:39     ` Li, Aubrey
@ 2018-11-21  8:19       ` Peter Zijlstra
  2018-11-21  9:53         ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2018-11-21  8:19 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, tglx, mingo, hpa, ak, tim.c.chen, dave.hansen, arjan,
	linux-kernel

On Wed, Nov 21, 2018 at 09:39:00AM +0800, Li, Aubrey wrote:
> > Also; you were going to shop around with the other architectures to see
> > what they want/need for this interface. I see nothing on that.
> > 
> I'm open for your suggestion, :)

Well, we have linux-arch and the various maintainers are also listed in
MAINTAINERS. Go forth and ask..

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] proc: add /proc/<pid>/arch_state
  2018-11-21  8:19       ` Peter Zijlstra
@ 2018-11-21  9:53         ` Peter Zijlstra
  2018-11-21 17:12           ` Palmer Dabbelt
  2018-11-22  1:40           ` Li, Aubrey
  0 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2018-11-21  9:53 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Aubrey Li, tglx, mingo, hpa, ak, tim.c.chen, dave.hansen, arjan,
	linux-kernel

On Wed, Nov 21, 2018 at 09:19:36AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 21, 2018 at 09:39:00AM +0800, Li, Aubrey wrote:
> > > Also; you were going to shop around with the other architectures to see
> > > what they want/need for this interface. I see nothing on that.
> > > 
> > I'm open for your suggestion, :)
> 
> Well, we have linux-arch and the various maintainers are also listed in
> MAINTAINERS. Go forth and ask..

Ok, so I googled a wee bit (you could have too).

There's not that many architectures that build big hot chips
(powerpc,x86,arm64,s390) (mips, sparc64 and ia64 are pretty dead I
think, although the Fujitsu Sparc M10 X+/X SIMD looked like it could be
'fun').

Of those, powerpc altivec doesn't seem to be very wide, but you'd have
to ask the power folks. Same for s390 z13.

The Fujitsu/ARM64-SVE stuff looks like it can be big and hot.

And RISC-V has was vector extention, but I don't think anybody is
actually building big hot versions of that just yet.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] proc: add /proc/<pid>/arch_state
  2018-11-21  9:53         ` Peter Zijlstra
@ 2018-11-21 17:12           ` Palmer Dabbelt
  2018-11-22  1:40           ` Li, Aubrey
  1 sibling, 0 replies; 16+ messages in thread
From: Palmer Dabbelt @ 2018-11-21 17:12 UTC (permalink / raw)
  To: peterz
  Cc: aubrey.li, aubrey.li, tglx, mingo, hpa, ak, tim.c.chen,
	dave.hansen, arjan, linux-kernel

On Wed, 21 Nov 2018 01:53:50 PST (-0800), peterz@infradead.org wrote:
> On Wed, Nov 21, 2018 at 09:19:36AM +0100, Peter Zijlstra wrote:
>> On Wed, Nov 21, 2018 at 09:39:00AM +0800, Li, Aubrey wrote:
>> > > Also; you were going to shop around with the other architectures to see
>> > > what they want/need for this interface. I see nothing on that.
>> > >
>> > I'm open for your suggestion, :)
>>
>> Well, we have linux-arch and the various maintainers are also listed in
>> MAINTAINERS. Go forth and ask..
>
> Ok, so I googled a wee bit (you could have too).
>
> There's not that many architectures that build big hot chips
> (powerpc,x86,arm64,s390) (mips, sparc64 and ia64 are pretty dead I
> think, although the Fujitsu Sparc M10 X+/X SIMD looked like it could be
> 'fun').
>
> Of those, powerpc altivec doesn't seem to be very wide, but you'd have
> to ask the power folks. Same for s390 z13.
>
> The Fujitsu/ARM64-SVE stuff looks like it can be big and hot.
>
> And RISC-V has was vector extention, but I don't think anybody is
> actually building big hot versions of that just yet.

We don't actually have a vector extension yet, but there's supposed to be a 
draft out in 2 weeks.  The plan is that this draft will be sufficiently 
long-lived that we can start software implementation work.  While I don't 
believe it's intended that hardware implementations become available using 
this draft specification, these things tend to take a life of their own.  I'd 
be pretty surprised if we don't end up seeing hardware implementations of this 
draft specification.

I don't know if they'll be big and hot, though -- the whole point of the vector 
extension is that we can build chips that aren't that big or hot :)

On a more serious note, in RISC-V land we've attempted to make mcontext 
extensible and plan on shimming all the additional architectural state into 
there.  Thus, I don't think this interface is particularly useful for us.

I also don't like this "file full of 1s and 0s" interface, but it's certainly 
not my place to shoot it down.  In RISC-V land we're trying very hard to 
carefully examine any user-visible ABI to ensure it's something we're willing 
to keep around for ever, and this doesn't seem like something that fits that 
mold.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] proc: add /proc/<pid>/arch_state
  2018-11-21  9:53         ` Peter Zijlstra
  2018-11-21 17:12           ` Palmer Dabbelt
@ 2018-11-22  1:40           ` Li, Aubrey
  2018-11-23 17:11             ` Dave Martin
  1 sibling, 1 reply; 16+ messages in thread
From: Li, Aubrey @ 2018-11-22  1:40 UTC (permalink / raw)
  To: Peter Zijlstra, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, Heiko Carstens,
	Catalin Marinas, Will Deacon
  Cc: Aubrey Li, tglx, mingo, hpa, ak, tim.c.chen, dave.hansen, arjan,
	linux-kernel

On 2018/11/21 17:53, Peter Zijlstra wrote:
> On Wed, Nov 21, 2018 at 09:19:36AM +0100, Peter Zijlstra wrote:
>> On Wed, Nov 21, 2018 at 09:39:00AM +0800, Li, Aubrey wrote:
>>>> Also; you were going to shop around with the other architectures to see
>>>> what they want/need for this interface. I see nothing on that.
>>>>
>>> I'm open for your suggestion, :)
>>
>> Well, we have linux-arch and the various maintainers are also listed in
>> MAINTAINERS. Go forth and ask..
> 
> Ok, so I googled a wee bit (you could have too).
> 
> There's not that many architectures that build big hot chips
> (powerpc,x86,arm64,s390) (mips, sparc64 and ia64 are pretty dead I
> think, although the Fujitsu Sparc M10 X+/X SIMD looked like it could be
> 'fun').
> 
> Of those, powerpc altivec doesn't seem to be very wide, but you'd have
> to ask the power folks. Same for s390 z13.
> 
> The Fujitsu/ARM64-SVE stuff looks like it can be big and hot.
> 
> And RISC-V has was vector extention, but I don't think anybody is
> actually building big hot versions of that just yet.
> 
Thanks Peter. Add more maintainers here.

On some x86 architectures, the tasks using simd instruction(AVX512 particularly)
need to be dealt with specially against the tasks not using simd instruction.
I proposed an interface to expose such CPU specific information for the user
space tools to apply different scheduling policies.

The interface can be refined to be the format as /proc/<pid>/status. Not sure
if it's useful to any other architectures.

Welcome any comments.

Thanks,
-Aubrey



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] proc: add /proc/<pid>/arch_state
  2018-11-22  1:40           ` Li, Aubrey
@ 2018-11-23 17:11             ` Dave Martin
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Martin @ 2018-11-23 17:11 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, Heiko Carstens,
	Catalin Marinas, Will Deacon, Aubrey Li, tglx, mingo, hpa, ak,
	tim.c.chen, dave.hansen, arjan, linux-kernel

On Thu, Nov 22, 2018 at 09:40:24AM +0800, Li, Aubrey wrote:
> On 2018/11/21 17:53, Peter Zijlstra wrote:
> > On Wed, Nov 21, 2018 at 09:19:36AM +0100, Peter Zijlstra wrote:
> >> On Wed, Nov 21, 2018 at 09:39:00AM +0800, Li, Aubrey wrote:
> >>>> Also; you were going to shop around with the other architectures to see
> >>>> what they want/need for this interface. I see nothing on that.
> >>>>
> >>> I'm open for your suggestion, :)
> >>
> >> Well, we have linux-arch and the various maintainers are also listed in
> >> MAINTAINERS. Go forth and ask..
> > 
> > Ok, so I googled a wee bit (you could have too).
> > 
> > There's not that many architectures that build big hot chips
> > (powerpc,x86,arm64,s390) (mips, sparc64 and ia64 are pretty dead I
> > think, although the Fujitsu Sparc M10 X+/X SIMD looked like it could be
> > 'fun').
> > 
> > Of those, powerpc altivec doesn't seem to be very wide, but you'd have
> > to ask the power folks. Same for s390 z13.
> > 
> > The Fujitsu/ARM64-SVE stuff looks like it can be big and hot.
> > 
> > And RISC-V has was vector extention, but I don't think anybody is
> > actually building big hot versions of that just yet.
> > 
> Thanks Peter. Add more maintainers here.
> 
> On some x86 architectures, the tasks using simd instruction(AVX512 particularly)
> need to be dealt with specially against the tasks not using simd instruction.
> I proposed an interface to expose such CPU specific information for the user
> space tools to apply different scheduling policies.
> 
> The interface can be refined to be the format as /proc/<pid>/status. Not sure
> if it's useful to any other architectures.
> 
> Welcome any comments.

For SVE:

We currently monitor SVE use by trapping only.  We also made an ABI
decision that a syscall throws away the task's SVE state -- this
falls out naturally from the fact that the SVE state is caller-save
for regular function calls in the AArch64 ABI.

There isn't an explicit means like VZEROUPPER for userspace to
mark the SVE state as non-live without entering the kernel today.

Currently I expose as little detail to userspace as possible regarding
how/when SVE is enabled/disabled or used.


For the /proc interface:

It would be nice to expose some information to userspace about when/
where major hardware functional units are in use, but beyond the
information already supplied by hardware perf events, it's not
obvious what should be exposed.

AFAICT, the exposed flags would be partly an arbitrary artifact of
kernel implementation details: i.e., how often and when the kernel
saves/restores the task's state may affect the pattern of observed
values in non-trivial ways.

For SVE today, a task that does a lot of syscalls may appear to be using
SVE less than a second task that does fewer syscalls but is otherwise
identical -- simply because a syscall is our only way to detect that
SVE is not in use today.


This kind of issue means that userspace may struggle to make good
decisions using this data: instead it's going to rely on some kind of
tuning which may become wrong as soon as the workload, kernel version
or hardware changes.


A /proc/<pid>/file would need to be polled (which doesn't sound great)
and also suffers from all the usual /proc raciness.

Cheers
---Dave

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-11-23 17:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-14 23:00 [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks Aubrey Li
2018-11-14 23:00 ` [PATCH v3 2/2] proc: add /proc/<pid>/arch_state Aubrey Li
2018-11-15 15:18   ` Dave Hansen
2018-11-16  0:32     ` Li, Aubrey
2018-11-19 17:39   ` Peter Zijlstra
2018-11-21  1:39     ` Li, Aubrey
2018-11-21  8:19       ` Peter Zijlstra
2018-11-21  9:53         ` Peter Zijlstra
2018-11-21 17:12           ` Palmer Dabbelt
2018-11-22  1:40           ` Li, Aubrey
2018-11-23 17:11             ` Dave Martin
2018-11-15 15:40 ` [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks Dave Hansen
2018-11-16  0:21   ` Li, Aubrey
2018-11-16  1:04     ` Dave Hansen
2018-11-16 23:10     ` Dave Hansen
2018-11-17  0:36       ` Li, Aubrey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).