linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
@ 2018-12-11  0:24 Aubrey Li
  2018-12-11  0:24 ` [PATCH v4 2/2] proc: add AVX-512 usage to /proc/pid/status Aubrey Li
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Aubrey Li @ 2018-12-11  0:24 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, aubrey.li, linux-kernel, Aubrey Li

User space tools which do automated task placement need information
about AVX-512 usage of tasks, because AVX-512 usage could cause core
turbo frequency drop and impact the running task on the sibling CPU.

The XSAVE hardware structure has bits that indicate when valid state
is present in registers unique to AVX-512 use.  Use these bits to
indicate when AVX-512 has been in use and add per-task AVX-512 state
tracking to context switch.

The tracking turns on the usage flag at the next context switch of
the task, but requires 3 consecutive context switches with no usage
to clear it. This decay is required because well-written AVX-512
applications are expected to clear this state when not actively using
AVX-512 registers.

Although this mechanism is imprecise and can theoretically have both
false-positives and false-negatives, it has been measured to be precise
enough to be useful under real-world workloads like tensorflow and linpack.

If higher precision is required, suggest user space tools to use the
PMU-based mechanisms in combination.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86/include/asm/fpu/internal.h | 22 ++++++++++++++++++++++
 arch/x86/include/asm/fpu/types.h    |  8 ++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index a38bf5a1e37a..0da74d63ba14 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -275,6 +275,27 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
 		     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)	\
 		     : "memory")
 
+#define	AVX512_STATE_DECAY_COUNT	3
+/*
+ * This function is called during context switch to update AVX512 state
+ */
+static inline void update_avx512_state(struct fpu *fpu)
+{
+	/*
+	 * AVX512 state is tracked here because its use is known to slow
+	 * the max clock speed of the core.
+	 *
+	 * However, AVX512-using tasks are expected to clear this state when
+	 * not actively using these registers. Thus, this tracking mechanism
+	 * can miss. To ensure that false-negatives do not immediately show
+	 * up, decay the usage count over time.
+	 */
+	if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
+		fpu->avx512_usage = AVX512_STATE_DECAY_COUNT;
+	else if (fpu->avx512_usage)
+		fpu->avx512_usage--;
+}
+
 /*
  * This function is called only during boot time when x86 caps are not set
  * up and alternative can not be used yet.
@@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
 {
 	if (likely(use_xsave())) {
 		copy_xregs_to_kernel(&fpu->state.xsave);
+		update_avx512_state(fpu);
 		return 1;
 	}
 
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 202c53918ecf..313b134d3ca3 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -302,6 +302,14 @@ struct fpu {
 	 */
 	unsigned char			initialized;
 
+	/*
+	 * @avx512_usage:
+	 *
+	 * Records the usage of AVX512 registers. A value of non-zero is used
+	 * to indicate whether these AVX512 registers recently had valid state.
+	 */
+	unsigned char			avx512_usage;
+
 	/*
 	 * @state:
 	 *
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 2/2] proc: add AVX-512 usage to /proc/pid/status
  2018-12-11  0:24 [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks Aubrey Li
@ 2018-12-11  0:24 ` Aubrey Li
  2018-12-11 17:57   ` Tim Chen
  2018-12-11 17:18 ` [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks Dave Hansen
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Aubrey Li @ 2018-12-11  0:24 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, aubrey.li, linux-kernel, Aubrey Li

AVX-512 components usage could cause core turbo frequency drop. So
it's useful to expose AVX-512 components usage as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Example:
$ cat /proc/pid/status | grep AVX512_hint
AVX512_hint:   1

The hint number '0' indicates the task recently didn't use AVX-512
components thus unlikely has frequency drop issue. And the number '1'
indicates the task recently used AVX-512 components thus could cause
core frequency drop. User space tools may want to further check by:

$ perf stat --pid <pid> -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

     3,251,565,961      core_power.lvl2_turbo_license

       1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86/kernel/fpu/xstate.c | 19 +++++++++++++++++++
 fs/proc/array.c              |  5 +++++
 2 files changed, 24 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 87a57b7642d3..98baa47c97b0 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,7 @@
 #include <linux/cpu.h>
 #include <linux/mman.h>
 #include <linux/pkeys.h>
+#include <linux/seq_file.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -1245,3 +1246,21 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
 
 	return 0;
 }
+
+/*
+ * Report CPU specific thread state
+ */
+void arch_task_state(struct seq_file *m, struct task_struct *task)
+{
+	/*
+	 * Check the processor and build option if AVX512 is supported.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
+		return;
+	/*
+	 * Report AVX-512 components usage:
+	 */
+	seq_put_decimal_ull(m, "AVX512_hint:\t",
+				task->thread.fpu.avx512_usage ? 1 : 0);
+	seq_putc(m, '\n');
+}
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 0ceb3b6b37e7..dd88c2219f08 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -392,6 +392,10 @@ static inline void task_core_dumping(struct seq_file *m, struct mm_struct *mm)
 	seq_putc(m, '\n');
 }
 
+void __weak arch_task_state(struct seq_file *m, struct task_struct *task)
+{
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task)
 {
@@ -414,6 +418,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	task_cpus_allowed(m, task);
 	cpuset_task_status_allowed(m, task);
 	task_context_switch_counts(m, task);
+	arch_task_state(m, task);
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-12-11  0:24 [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks Aubrey Li
  2018-12-11  0:24 ` [PATCH v4 2/2] proc: add AVX-512 usage to /proc/pid/status Aubrey Li
@ 2018-12-11 17:18 ` Dave Hansen
  2018-12-11 17:52   ` Tim Chen
                     ` (2 more replies)
  2018-12-11 17:20 ` Dave Hansen
  2018-12-12 16:55 ` David Laight
  3 siblings, 3 replies; 15+ messages in thread
From: Dave Hansen @ 2018-12-11 17:18 UTC (permalink / raw)
  To: Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel, Aubrey Li

On 12/10/18 4:24 PM, Aubrey Li wrote:
> The tracking turns on the usage flag at the next context switch of
> the task, but requires 3 consecutive context switches with no usage
> to clear it. This decay is required because well-written AVX-512
> applications are expected to clear this state when not actively using
> AVX-512 registers.

One concern about this:  Given a HZ=1000 system, this means that the
flag needs to get scanned every ~3ms.  That's a pretty good amount of
scanning on a system with hundreds or thousands of tasks running around.

How many tasks does this scale to until you're eating up an entire CPU
or two just scanning /proc?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-12-11  0:24 [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks Aubrey Li
  2018-12-11  0:24 ` [PATCH v4 2/2] proc: add AVX-512 usage to /proc/pid/status Aubrey Li
  2018-12-11 17:18 ` [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks Dave Hansen
@ 2018-12-11 17:20 ` Dave Hansen
  2018-12-12  0:34   ` Li, Aubrey
  2018-12-12 16:55 ` David Laight
  3 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2018-12-11 17:20 UTC (permalink / raw)
  To: Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel, Aubrey Li

 to update AVX512 state
> + */
> +static inline void update_avx512_state(struct fpu *fpu)
> +{
> +	/*
> +	 * AVX512 state is tracked here because its use is known to slow
> +	 * the max clock speed of the core.
> +	 *
> +	 * However, AVX512-using tasks are expected to clear this state when
> +	 * not actively using these registers. Thus, this tracking mechanism
> +	 * can miss. To ensure that false-negatives do not immediately show
> +	 * up, decay the usage count over time.
> +	 */
> +	if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
> +		fpu->avx512_usage = AVX512_STATE_DECAY_COUNT;
> +	else if (fpu->avx512_usage)
> +		fpu->avx512_usage--;
> +}
> +
>  /*
>   * This function is called only during boot time when x86 caps are not set
>   * up and alternative can not be used yet.
> @@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
>  {
>  	if (likely(use_xsave())) {
>  		copy_xregs_to_kernel(&fpu->state.xsave);
> +		update_avx512_state(fpu);
>  		return 1;
>  	}


Is there a reason we shouldn't do:

	if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
		update_avx512_state(fpu);

?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-12-11 17:18 ` [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks Dave Hansen
@ 2018-12-11 17:52   ` Tim Chen
  2018-12-11 17:53   ` Andi Kleen
  2018-12-11 23:46   ` Li, Aubrey
  2 siblings, 0 replies; 15+ messages in thread
From: Tim Chen @ 2018-12-11 17:52 UTC (permalink / raw)
  To: Dave Hansen, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, arjan, linux-kernel, Aubrey Li



On 12/11/18 9:18 AM, Dave Hansen wrote:
> On 12/10/18 4:24 PM, Aubrey Li wrote:
>> The tracking turns on the usage flag at the next context switch of
>> the task, but requires 3 consecutive context switches with no usage
>> to clear it. This decay is required because well-written AVX-512
>> applications are expected to clear this state when not actively using
>> AVX-512 registers.
> 
> One concern about this:  Given a HZ=1000 system, this means that the
> flag needs to get scanned every ~3ms.  That's a pretty good amount of
> scanning on a system with hundreds or thousands of tasks running around.
> 
> How many tasks does this scale to until you're eating up an entire CPU
> or two just scanning /proc?
> 

The reason why I believe that an exponential decay average, or some
other kind of weighted decay of the AVX512 usage is better.

Tim


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-12-11 17:18 ` [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks Dave Hansen
  2018-12-11 17:52   ` Tim Chen
@ 2018-12-11 17:53   ` Andi Kleen
  2018-12-11 23:46   ` Li, Aubrey
  2 siblings, 0 replies; 15+ messages in thread
From: Andi Kleen @ 2018-12-11 17:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Aubrey Li, tglx, mingo, peterz, hpa, tim.c.chen, arjan,
	linux-kernel, Aubrey Li

On Tue, Dec 11, 2018 at 09:18:25AM -0800, Dave Hansen wrote:
> On 12/10/18 4:24 PM, Aubrey Li wrote:
> > The tracking turns on the usage flag at the next context switch of
> > the task, but requires 3 consecutive context switches with no usage
> > to clear it. This decay is required because well-written AVX-512
> > applications are expected to clear this state when not actively using
> > AVX-512 registers.
> 
> One concern about this:  Given a HZ=1000 system, this means that the
> flag needs to get scanned every ~3ms.  That's a pretty good amount of
> scanning on a system with hundreds or thousands of tasks running around.
> 
> How many tasks does this scale to until you're eating up an entire CPU
> or two just scanning /proc?

Yes that's why we may need to propagate it to cgroups in the kernel,
because user daemons don't really want to track every TID.

But per pid is a start so that people can start experimenting with this.

Then with some experience fancier interfaces for it can be implemented.

-Andi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 2/2] proc: add AVX-512 usage to /proc/pid/status
  2018-12-11  0:24 ` [PATCH v4 2/2] proc: add AVX-512 usage to /proc/pid/status Aubrey Li
@ 2018-12-11 17:57   ` Tim Chen
  0 siblings, 0 replies; 15+ messages in thread
From: Tim Chen @ 2018-12-11 17:57 UTC (permalink / raw)
  To: Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, dave.hansen, arjan, linux-kernel, Aubrey Li



On 12/10/18 4:24 PM, Aubrey Li wrote:
> AVX-512 components usage could cause core turbo frequency drop. So
> it's useful to expose AVX-512 components usage as a heuristic hint
> for the user space job scheduler to cluster the AVX-512 using tasks
> together.
> 
> Example:
> $ cat /proc/pid/status | grep AVX512_hint
> AVX512_hint:   1
> 
> The hint number '0' indicates the task recently didn't use AVX-512
> components thus unlikely has frequency drop issue. And the number '1'
> indicates the task recently used AVX-512 components thus could cause
> core frequency drop. User space tools may want to further check by:
> 
> $ perf stat --pid <pid> -e core_power.lvl2_turbo_license -- sleep 1
> 
>  Performance counter stats for process id '3558':
> 
>      3,251,565,961      core_power.lvl2_turbo_license
> 
>        1.004031387 seconds time elapsed
> 
> Non-zero counter value confirms that the task causes frequency drop.
> 
> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  arch/x86/kernel/fpu/xstate.c | 19 +++++++++++++++++++
>  fs/proc/array.c              |  5 +++++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 87a57b7642d3..98baa47c97b0 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -7,6 +7,7 @@
>  #include <linux/cpu.h>
>  #include <linux/mman.h>
>  #include <linux/pkeys.h>
> +#include <linux/seq_file.h>
>  
>  #include <asm/fpu/api.h>
>  #include <asm/fpu/internal.h>
> @@ -1245,3 +1246,21 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
>  
>  	return 0;
>  }
> +
> +/*
> + * Report CPU specific thread state
> + */
> +void arch_task_state(struct seq_file *m, struct task_struct *task)
> +{
> +	/*
> +	 * Check the processor and build option if AVX512 is supported.
> +	 */
> +	if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
> +		return;
> +	/*
> +	 * Report AVX-512 components usage:
> +	 */
> +	seq_put_decimal_ull(m, "AVX512_hint:\t",
> +				task->thread.fpu.avx512_usage ? 1 : 0);

I believe you need some kind of documentation of this new interface
in kernel's Documentation.  So an admin doesn't need to look into the
kernel code to figure out what this /proc/pid/status is supposed to do.

Tim



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-12-11 17:18 ` [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks Dave Hansen
  2018-12-11 17:52   ` Tim Chen
  2018-12-11 17:53   ` Andi Kleen
@ 2018-12-11 23:46   ` Li, Aubrey
  2018-12-12  0:14     ` Arjan van de Ven
  2 siblings, 1 reply; 15+ messages in thread
From: Li, Aubrey @ 2018-12-11 23:46 UTC (permalink / raw)
  To: Dave Hansen, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 2018/12/12 1:18, Dave Hansen wrote:
> On 12/10/18 4:24 PM, Aubrey Li wrote:
>> The tracking turns on the usage flag at the next context switch of
>> the task, but requires 3 consecutive context switches with no usage
>> to clear it. This decay is required because well-written AVX-512
>> applications are expected to clear this state when not actively using
>> AVX-512 registers.
> 
> One concern about this:  Given a HZ=1000 system, this means that the
> flag needs to get scanned every ~3ms.  That's a pretty good amount of
> scanning on a system with hundreds or thousands of tasks running around.
> 
> How many tasks does this scale to until you're eating up an entire CPU
> or two just scanning /proc?
> 

Do we have a real requirement to do this in practical environment?
AFAIK, 1s or even 5s is good enough in some customers environment.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-12-11 23:46   ` Li, Aubrey
@ 2018-12-12  0:14     ` Arjan van de Ven
  2018-12-12  0:59       ` Li, Aubrey
  0 siblings, 1 reply; 15+ messages in thread
From: Arjan van de Ven @ 2018-12-12  0:14 UTC (permalink / raw)
  To: Li, Aubrey, Dave Hansen, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, linux-kernel

On 12/11/2018 3:46 PM, Li, Aubrey wrote:
> On 2018/12/12 1:18, Dave Hansen wrote:
>> On 12/10/18 4:24 PM, Aubrey Li wrote:
>>> The tracking turns on the usage flag at the next context switch of
>>> the task, but requires 3 consecutive context switches with no usage
>>> to clear it. This decay is required because well-written AVX-512
>>> applications are expected to clear this state when not actively using
>>> AVX-512 registers.
>>
>> One concern about this:  Given a HZ=1000 system, this means that the
>> flag needs to get scanned every ~3ms.  That's a pretty good amount of
>> scanning on a system with hundreds or thousands of tasks running around.
>>
>> How many tasks does this scale to until you're eating up an entire CPU
>> or two just scanning /proc?
>>
> 
> Do we have a real requirement to do this in practical environment?
> AFAIK, 1s or even 5s is good enough in some customers environment.

maybe instead of a 1/0 bit, it's useful to store the timestamp of the last
time we found the task to use avx? (need to find a good time unit)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-12-11 17:20 ` Dave Hansen
@ 2018-12-12  0:34   ` Li, Aubrey
  2018-12-12  0:39     ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Li, Aubrey @ 2018-12-12  0:34 UTC (permalink / raw)
  To: Dave Hansen, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 2018/12/12 1:20, Dave Hansen wrote:
>  to update AVX512 state
>> + */
>> +static inline void update_avx512_state(struct fpu *fpu)
>> +{
>> +	/*
>> +	 * AVX512 state is tracked here because its use is known to slow
>> +	 * the max clock speed of the core.
>> +	 *
>> +	 * However, AVX512-using tasks are expected to clear this state when
>> +	 * not actively using these registers. Thus, this tracking mechanism
>> +	 * can miss. To ensure that false-negatives do not immediately show
>> +	 * up, decay the usage count over time.
>> +	 */
>> +	if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
>> +		fpu->avx512_usage = AVX512_STATE_DECAY_COUNT;
>> +	else if (fpu->avx512_usage)
>> +		fpu->avx512_usage--;
>> +}
>> +
>>  /*
>>   * This function is called only during boot time when x86 caps are not set
>>   * up and alternative can not be used yet.
>> @@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
>>  {
>>  	if (likely(use_xsave())) {
>>  		copy_xregs_to_kernel(&fpu->state.xsave);
>> +		update_avx512_state(fpu);
>>  		return 1;
>>  	}
> 
> 
> Is there a reason we shouldn't do:
> 
> 	if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
> 		update_avx512_state(fpu);
> 
> ?
> 

Why _!_ ?



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-12-12  0:34   ` Li, Aubrey
@ 2018-12-12  0:39     ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2018-12-12  0:39 UTC (permalink / raw)
  To: Li, Aubrey, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 12/11/18 4:34 PM, Li, Aubrey wrote:
>> Is there a reason we shouldn't do:
>>
>> 	if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
>> 		update_avx512_state(fpu);
>>
>> ?
>>
> Why _!_ ?

Sorry, got it backwards.  I think I was considering having you do a

	if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
		return;

inside of update_avx512_state(), but I got the state mixed up in my head.

You don't need the '!'.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-12-12  0:14     ` Arjan van de Ven
@ 2018-12-12  0:59       ` Li, Aubrey
  2018-12-12  1:06         ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Li, Aubrey @ 2018-12-12  0:59 UTC (permalink / raw)
  To: Arjan van de Ven, Dave Hansen, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, linux-kernel

On 2018/12/12 8:14, Arjan van de Ven wrote:
> On 12/11/2018 3:46 PM, Li, Aubrey wrote:
>> On 2018/12/12 1:18, Dave Hansen wrote:
>>> On 12/10/18 4:24 PM, Aubrey Li wrote:
>>>> The tracking turns on the usage flag at the next context switch of
>>>> the task, but requires 3 consecutive context switches with no usage
>>>> to clear it. This decay is required because well-written AVX-512
>>>> applications are expected to clear this state when not actively using
>>>> AVX-512 registers.
>>>
>>> One concern about this:  Given a HZ=1000 system, this means that the
>>> flag needs to get scanned every ~3ms.  That's a pretty good amount of
>>> scanning on a system with hundreds or thousands of tasks running around.
>>>
>>> How many tasks does this scale to until you're eating up an entire CPU
>>> or two just scanning /proc?
>>>
>>
>> Do we have a real requirement to do this in practical environment?
>> AFAIK, 1s or even 5s is good enough in some customers environment.
> 
> maybe instead of a 1/0 bit, it's useful to store the timestamp of the last
> time we found the task to use avx? (need to find a good time unit)
> 
> 
Are you suggesting kernel does not do any translation, just provide a fact
to the user space tool and let user space tool to decide how to use this info?

So how does user space tool use this timestamp in your mind?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-12-12  0:59       ` Li, Aubrey
@ 2018-12-12  1:06         ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2018-12-12  1:06 UTC (permalink / raw)
  To: Li, Aubrey, Arjan van de Ven, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, linux-kernel

On 12/11/18 4:59 PM, Li, Aubrey wrote:
>> maybe instead of a 1/0 bit, it's useful to store the timestamp of the last
>> time we found the task to use avx? (need to find a good time unit)
>>
>>
> Are you suggesting kernel does not do any translation, just provide a fact
> to the user space tool and let user space tool to decide how to use this info?
> 
> So how does user space tool use this timestamp in your mind?

Couple of things...

If we have a timestamp, we don't need a decay.  That means less tuning
and also less work at runtime in the common case.

Let's say we report milliseconds.  The app itself looks at the number
and could now say the following:

1. task A used AVX512 1ms ago, I might need to segregate it
2. task B used AVX512 10000ms ago, time to stop segregating it
3. task C used AVX512 1ms ago, and was using it 100ms ago (during the
   last scan), it's a regular AVX512 user.

That way, you don't have to *catch* tasks in the window between use and
the end of the decay period.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-12-11  0:24 [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks Aubrey Li
                   ` (2 preceding siblings ...)
  2018-12-11 17:20 ` Dave Hansen
@ 2018-12-12 16:55 ` David Laight
  2018-12-12 18:00   ` Andi Kleen
  3 siblings, 1 reply; 15+ messages in thread
From: David Laight @ 2018-12-12 16:55 UTC (permalink / raw)
  To: 'Aubrey Li', tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, linux-kernel, Aubrey Li

From: Aubrey Li
> Sent: 11 December 2018 00:25
> 
> User space tools which do automated task placement need information
> about AVX-512 usage of tasks, because AVX-512 usage could cause core
> turbo frequency drop and impact the running task on the sibling CPU.
> 
> The XSAVE hardware structure has bits that indicate when valid state
> is present in registers unique to AVX-512 use.  Use these bits to
> indicate when AVX-512 has been in use and add per-task AVX-512 state
> tracking to context switch.

Isn't a thread likely to clear the AVX registers at the end of a function
that uses them.
In particular this removes the massive overhead on certain cpus of
switching between two AVX modes.
So it is actually unlikely that XSAVE will need to save them at all?

As I've also said before the registers are caller saved and since
systems calls are normal function calls the application code
would have to save them across a system call.
This allows the kernel to zero the registers on system call entry
again meaning that XSAVE won't normally have to save them.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-12-12 16:55 ` David Laight
@ 2018-12-12 18:00   ` Andi Kleen
  0 siblings, 0 replies; 15+ messages in thread
From: Andi Kleen @ 2018-12-12 18:00 UTC (permalink / raw)
  To: David Laight
  Cc: 'Aubrey Li',
	tglx, mingo, peterz, hpa, tim.c.chen, dave.hansen, arjan,
	linux-kernel, Aubrey Li

> Isn't a thread likely to clear the AVX registers at the end of a function
> that uses them.
> In particular this removes the massive overhead on certain cpus of
> switching between two AVX modes.
> So it is actually unlikely that XSAVE will need to save them at all?

Only if context switches only happened on function boundaries, which
is obviously not the case.

Yes the detection mechanism is not 100% accurate, but if AVX
is used significantly it should eventually detect it. 
Think of it as a statistical sampling heuristic.

> 
> As I've also said before the registers are caller saved and since
> systems calls are normal function calls the application code
> would have to save them across a system call.
> This allows the kernel to zero the registers on system call entry
> again meaning that XSAVE won't normally have to save them.

While I agree this would be nice, the Linux system call ABI 
wasn't defined like this, so it cannot be done at this point.

-Andi


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-12-12 18:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-11  0:24 [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks Aubrey Li
2018-12-11  0:24 ` [PATCH v4 2/2] proc: add AVX-512 usage to /proc/pid/status Aubrey Li
2018-12-11 17:57   ` Tim Chen
2018-12-11 17:18 ` [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks Dave Hansen
2018-12-11 17:52   ` Tim Chen
2018-12-11 17:53   ` Andi Kleen
2018-12-11 23:46   ` Li, Aubrey
2018-12-12  0:14     ` Arjan van de Ven
2018-12-12  0:59       ` Li, Aubrey
2018-12-12  1:06         ` Dave Hansen
2018-12-11 17:20 ` Dave Hansen
2018-12-12  0:34   ` Li, Aubrey
2018-12-12  0:39     ` Dave Hansen
2018-12-12 16:55 ` David Laight
2018-12-12 18:00   ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).