linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 1/3] /proc/pid/status: Add support for architecture specific output
@ 2019-02-13  2:37 Aubrey Li
  2019-02-13  2:37 ` [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time Aubrey Li
  2019-02-13  2:37 ` [PATCH v11 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms Aubrey Li
  0 siblings, 2 replies; 9+ messages in thread
From: Aubrey Li @ 2019-02-13  2:37 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, aubrey.li, linux-kernel, Aubrey Li

The architecture specific information of the running processes could
be useful to the userland. Add support to examine process architecture
specific information externally.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/proc/array.c         | 5 +++++
 include/linux/proc_fs.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 9d428d5a0ac8..ea7a981f289c 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
 	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
+void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task)
 {
@@ -424,6 +428,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	task_cpus_allowed(m, task);
 	cpuset_task_status_allowed(m, task);
 	task_context_switch_counts(m, task);
+	arch_proc_pid_status(m, task);
 	return 0;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index d0e1f1522a78..1de9ba1b064f 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -73,6 +73,8 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
 						    int (*show)(struct seq_file *, void *),
 						    proc_write_t write,
 						    void *data);
+/* Add support for architecture specific output in /proc/pid/status */
+extern void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
 
 #else /* CONFIG_PROC_FS */
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
  2019-02-13  2:37 [PATCH v11 1/3] /proc/pid/status: Add support for architecture specific output Aubrey Li
@ 2019-02-13  2:37 ` Aubrey Li
  2019-02-14 11:29   ` Thomas Gleixner
  2019-02-13  2:37 ` [PATCH v11 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms Aubrey Li
  1 sibling, 1 reply; 9+ messages in thread
From: Aubrey Li @ 2019-02-13  2:37 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, aubrey.li, linux-kernel, Aubrey Li

AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Example:
$ cat /proc/pid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:      1020

The number '1020' denotes 1020 millisecond elapsed since last time
context switch the off-CPU task using AVX-512 components, thus the
task could cause core frequency drop.

Or:
$ cat /proc/pid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:      -1

The number '-1' indicates the task didn't use AVX-512 components
before thus unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid <pid> -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

     3,251,565,961      core_power.lvl2_turbo_license

       1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86/kernel/fpu/xstate.c | 42 ++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9cc108456d0b..e480a535eeb2 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,8 @@
 #include <linux/cpu.h>
 #include <linux/mman.h>
 #include <linux/pkeys.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -1243,3 +1245,43 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
 
 	return 0;
 }
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+	unsigned long timestamp = task->thread.fpu.avx512_timestamp;
+	long delta;
+
+	if (!timestamp) {
+		/*
+		 * Report -1 if no AVX512 usage
+		 */
+		delta = -1;
+	} else {
+		delta = (long)(jiffies - timestamp);
+		/*
+		 * Cap to LONG_MAX if time difference > LONG_MAX
+		 */
+		if (delta < 0)
+			delta = LONG_MAX;
+		delta = jiffies_to_msecs(delta);
+	}
+
+	seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+	seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+	/*
+	 * Report AVX512 state if the processor and build option supported.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+		avx512_status(m, task);
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v11 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms
  2019-02-13  2:37 [PATCH v11 1/3] /proc/pid/status: Add support for architecture specific output Aubrey Li
  2019-02-13  2:37 ` [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time Aubrey Li
@ 2019-02-13  2:37 ` Aubrey Li
  1 sibling, 0 replies; 9+ messages in thread
From: Aubrey Li @ 2019-02-13  2:37 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, aubrey.li, linux-kernel, Aubrey Li

Added AVX512_elapsed_ms in /proc/<pid>/status. Report it
in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 Documentation/filesystems/proc.txt | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 66cad5c86171..8da60ddcda7f 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -207,6 +207,7 @@ read the file /proc/PID/status:
   Speculation_Store_Bypass:       thread vulnerable
   voluntary_ctxt_switches:        0
   nonvoluntary_ctxt_switches:     1
+  AVX512_elapsed_ms:	1020
 
 This shows you nearly the same information you would get if you viewed it with
 the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
@@ -224,7 +225,7 @@ asynchronous manner and the value may not be very precise. To see a precise
 snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
 It's slow but very precise.
 
-Table 1-2: Contents of the status files (as of 4.19)
+Table 1-2: Contents of the status files (as of 5.1)
 ..............................................................................
  Field                       Content
  Name                        filename of the executable
@@ -289,6 +290,7 @@ Table 1-2: Contents of the status files (as of 4.19)
  Mems_allowed_list           Same as previous, but in "list format"
  voluntary_ctxt_switches     number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
+ AVX512_elapsed_ms           time elapsed since last AVX512 use in millisecond
 ..............................................................................
 
 Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
  2019-02-13  2:37 ` [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time Aubrey Li
@ 2019-02-14 11:29   ` Thomas Gleixner
  2019-02-15  4:35     ` Li, Aubrey
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2019-02-14 11:29 UTC (permalink / raw)
  To: Aubrey Li
  Cc: mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, arjan,
	aubrey.li, linux-kernel

On Wed, 13 Feb 2019, Aubrey Li wrote:

> AVX-512 components use could cause core turbo frequency drop. So
> it's useful to expose AVX-512 usage elapsed time as a heuristic hint
> for the user space job scheduler to cluster the AVX-512 using tasks
> together.
> 
> Example:
> $ cat /proc/pid/status | grep AVX512_elapsed_ms
> AVX512_elapsed_ms:      1020
> 
> The number '1020' denotes 1020 millisecond elapsed since last time
> context switch the off-CPU task using AVX-512 components, thus the

I know what you are trying to say, but this sentence does not parse. So
what you want to say is:

  This means that 1020 milliseconds have elapsed since the AVX512 usage of
  the task was detected when the task was scheduled out.

Aside of that 1020ms is hardly evidence for real AVX512 usage, so you want
to come up with a better example than that.

But that makes me think about the usefulness of this hint in general.

A AVX512 using task which runs alone on a CPU, is going to have either no
AVX512 usage recorded at all or the time elapsed since the last recording
is absurdly long. IOW, this needs crystal ball magic to decode because
there is no correlation between that elapsed time and the time when the
last context switch happened simply because that time is not available in
/proc/$PID/status. Sure you can oracle it out from /proc/$PID/stat with
even more crystal ball magic, but there is no explanation at all.

There may be use case scenarios where this crystal ball prediction is
actually useful, but the inaccuracy of that information and the possible
pitfalls for any user space application which uses it need to be documented
in detail. Without that, this is going to cause more trouble and confusion
than benefit.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
  2019-02-14 11:29   ` Thomas Gleixner
@ 2019-02-15  4:35     ` Li, Aubrey
  2019-02-16 12:55       ` Thomas Gleixner
  0 siblings, 1 reply; 9+ messages in thread
From: Li, Aubrey @ 2019-02-15  4:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, arjan,
	aubrey.li, linux-kernel

On 2019/2/14 19:29, Thomas Gleixner wrote:
> On Wed, 13 Feb 2019, Aubrey Li wrote:
> 
>> AVX-512 components use could cause core turbo frequency drop. So
>> it's useful to expose AVX-512 usage elapsed time as a heuristic hint
>> for the user space job scheduler to cluster the AVX-512 using tasks
>> together.
>>
>> Example:
>> $ cat /proc/pid/status | grep AVX512_elapsed_ms
>> AVX512_elapsed_ms:      1020
>>
>> The number '1020' denotes 1020 millisecond elapsed since last time
>> context switch the off-CPU task using AVX-512 components, thus the
> 
> I know what you are trying to say, but this sentence does not parse. So
> what you want to say is:
> 
>   This means that 1020 milliseconds have elapsed since the AVX512 usage of
>   the task was detected when the task was scheduled out.

Thanks, will refine this.

> 
> Aside of that 1020ms is hardly evidence for real AVX512 usage, so you want
> to come up with a better example than that.

Oh, I wrote a simple benchmark to loop {AVX ops a while and non-AVX ops a while},
So this is expected. Yeah, I should use real AVX512 usage. Below is tensorflow
output to train a neural network model to classify images (HZ = 250 on my side).
Will change to use this example.

$ while [ 1 ]; do cat /proc/83226/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	16
AVX512_elapsed_ms:	12
AVX512_elapsed_ms:	12
AVX512_elapsed_ms:	16
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	12
AVX512_elapsed_ms:	0
AVX512_elapsed_ms:	16
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	0
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	4

> 
> But that makes me think about the usefulness of this hint in general.
> 
> A AVX512 using task which runs alone on a CPU, is going to have either no
> AVX512 usage recorded at all or the time elapsed since the last recording
> is absurdly long.

I did an experiment of this, please correct me if I was wrong.

I isolate CPU103, and run a AVX512 micro benchmark(spin AVX512 ops) on it.

$ cat /proc/cmdline 
root=UUID=e6503b72-57d7-433a-ab09-a4b9a39e9128 ro isolcpus=103

I still saw context switch
aubrey@aubrey-skl:~$ sudo trace-cmd report --cpu 103
cpus=104
        avx_demo-6985  [103]  5055.442432: sched_switch:         avx_demo:6985 [120] R ==> migration/103:527 [0]
   migration/103-527   [103]  5055.442434: sched_switch:         migration/103:527 [0] S ==> avx_demo:6985 [120]
        avx_demo-6985  [103]  5059.442430: sched_switch:         avx_demo:6985 [120] R ==> migration/103:527 [0]
   migration/103-527   [103]  5059.442432: sched_switch:         migration/103:527 [0] S ==> avx_demo:6985 [120]
        avx_demo-6985  [103]  5063.442430: sched_switch:         avx_demo:6985 [120] R ==> migration/103:527 [0]
   migration/103-527   [103]  5063.442431: sched_switch:         migration/103:527 [0] S ==> avx_demo:6985 [120]
        avx_demo-6985  [103]  5067.442430: sched_switch:         avx_demo:6985 [120] R ==> migration/103:527 [0]
   migration/103-527   [103]  5067.442431: sched_switch:         migration/103:527 [0] S ==> avx_demo:6985 [120]

It looks like some kernel threads still participant context switch on the isolated
CPU, like above one, each CPU has one migration daemon to do migration jobs.

Under this scenario, the elapsed time becomes longer than normal indeed, see below:

$ while [ 1 ]; do cat /proc/6985/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:	3432
AVX512_elapsed_ms:	440
AVX512_elapsed_ms:	1444
AVX512_elapsed_ms:	2448
AVX512_elapsed_ms:	3456
AVX512_elapsed_ms:	460
AVX512_elapsed_ms:	1464
AVX512_elapsed_ms:	2468

But AFAIK, google's Heracles do a 15s polling, so this worst case is still acceptable.?

>IOW, this needs crystal ball magic to decode because
> there is no correlation between that elapsed time and the time when the
> last context switch happened simply because that time is not available in
> /proc/$PID/status. Sure you can oracle it out from /proc/$PID/stat with
> even more crystal ball magic, but there is no explanation at all.
> 
> There may be use case scenarios where this crystal ball prediction is
> actually useful, but the inaccuracy of that information and the possible
> pitfalls for any user space application which uses it need to be documented
> in detail. Without that, this is going to cause more trouble and confusion
> than benefit.
> 
Not sure if the above experiment addressed your concern, please correct me if
I totally misunderstood.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
  2019-02-15  4:35     ` Li, Aubrey
@ 2019-02-16 12:55       ` Thomas Gleixner
  2019-02-16 17:05         ` Li, Aubrey
  2019-02-20 15:35         ` David Laight
  0 siblings, 2 replies; 9+ messages in thread
From: Thomas Gleixner @ 2019-02-16 12:55 UTC (permalink / raw)
  To: Li, Aubrey
  Cc: mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, arjan,
	aubrey.li, linux-kernel

On Fri, 15 Feb 2019, Li, Aubrey wrote:
> On 2019/2/14 19:29, Thomas Gleixner wrote:
> Under this scenario, the elapsed time becomes longer than normal indeed, see below:
> 
> $ while [ 1 ]; do cat /proc/6985/status | grep AVX; sleep 1; done
> AVX512_elapsed_ms:	3432
> AVX512_elapsed_ms:	440
> AVX512_elapsed_ms:	1444
> AVX512_elapsed_ms:	2448
> AVX512_elapsed_ms:	3456
> AVX512_elapsed_ms:	460
> AVX512_elapsed_ms:	1464
> AVX512_elapsed_ms:	2468
> 
> But AFAIK, google's Heracles do a 15s polling, so this worst case is still acceptable.?

I have no idea what Google's thingy does and you surely have to ask those
people who want to use this whether they are OK with that. I personally
think the numbers are largely useless, but I don't know the use case.

> >IOW, this needs crystal ball magic to decode because
> > there is no correlation between that elapsed time and the time when the
> > last context switch happened simply because that time is not available in
> > /proc/$PID/status. Sure you can oracle it out from /proc/$PID/stat with
> > even more crystal ball magic, but there is no explanation at all.
> > 
> > There may be use case scenarios where this crystal ball prediction is
> > actually useful, but the inaccuracy of that information and the possible
> > pitfalls for any user space application which uses it need to be documented
> > in detail. Without that, this is going to cause more trouble and confusion
> > than benefit.
> > 
> Not sure if the above experiment addressed your concern, please correct me if
> I totally misunderstood.

The above experiment just confirms what I said: The numbers are inaccurate
and potentially misleading to a large extent when the AVX using task is not
scheduled out for a longer time.

So what I'm asking for is proper documentation which explains how this
'hint' is generated in the kernel and why it can be completely inaccurate
and misleading. We don't want to end up in a situation where people start
to rely on this information and then have to go and read kernel code to
understand why the numbers do not make sense.

I'm not convinced that this interface in the current form is actually
useful. Even if you ignore the single task example, then on a loaded
machine where tasks are scheduled in and out by time slices, then the
calculation is:

	    delta = (long)(jiffies - timestamp);

delta is what you expose as elapsed_ms. Now assume that the task is seen as
using AVX when being scheduled out. So depending on the time it is
scheduled out, whether it's due lots of other tasks occupying the CPU or
due to a blocking syscall, the result can be completely misleading. The job
scheduler will see for example: 80ms ago was last AVX usage recorded and
decide that this is just an occasional usage and migrate it away. Now the
task gets on the other CPU and starts using AVX again, which makes the
scheduler see a short delta and decide to move it back.

So interpreting the value is voodoo which becomes even harder when there is
no documentation.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
  2019-02-16 12:55       ` Thomas Gleixner
@ 2019-02-16 17:05         ` Li, Aubrey
  2019-02-20 15:35         ` David Laight
  1 sibling, 0 replies; 9+ messages in thread
From: Li, Aubrey @ 2019-02-16 17:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, arjan,
	aubrey.li, linux-kernel

On 2019/2/16 20:55, Thomas Gleixner wrote:
> On Fri, 15 Feb 2019, Li, Aubrey wrote:
>> On 2019/2/14 19:29, Thomas Gleixner wrote:
>> Under this scenario, the elapsed time becomes longer than normal indeed, see below:
>>
>> $ while [ 1 ]; do cat /proc/6985/status | grep AVX; sleep 1; done
>> AVX512_elapsed_ms:	3432
>> AVX512_elapsed_ms:	440
>> AVX512_elapsed_ms:	1444
>> AVX512_elapsed_ms:	2448
>> AVX512_elapsed_ms:	3456
>> AVX512_elapsed_ms:	460
>> AVX512_elapsed_ms:	1464
>> AVX512_elapsed_ms:	2468
>>
>> But AFAIK, google's Heracles do a 15s polling, so this worst case is still acceptable.?
> 
> I have no idea what Google's thingy does and you surely have to ask those
> people who want to use this whether they are OK with that. I personally
> think the numbers are largely useless, but I don't know the use case.
> 
>>> IOW, this needs crystal ball magic to decode because
>>> there is no correlation between that elapsed time and the time when the
>>> last context switch happened simply because that time is not available in
>>> /proc/$PID/status. Sure you can oracle it out from /proc/$PID/stat with
>>> even more crystal ball magic, but there is no explanation at all.
>>>
>>> There may be use case scenarios where this crystal ball prediction is
>>> actually useful, but the inaccuracy of that information and the possible
>>> pitfalls for any user space application which uses it need to be documented
>>> in detail. Without that, this is going to cause more trouble and confusion
>>> than benefit.
>>>
>> Not sure if the above experiment addressed your concern, please correct me if
>> I totally misunderstood.
> 
> The above experiment just confirms what I said: The numbers are inaccurate
> and potentially misleading to a large extent when the AVX using task is not
> scheduled out for a longer time.
> 
> So what I'm asking for is proper documentation which explains how this
> 'hint' is generated in the kernel and why it can be completely inaccurate
> and misleading. We don't want to end up in a situation where people start
> to rely on this information and then have to go and read kernel code to
> understand why the numbers do not make sense.
> 
> I'm not convinced that this interface in the current form is actually
> useful. Even if you ignore the single task example, then on a loaded
> machine where tasks are scheduled in and out by time slices, then the
> calculation is:
> 
> 	    delta = (long)(jiffies - timestamp);
> 
> delta is what you expose as elapsed_ms. Now assume that the task is seen as
> using AVX when being scheduled out. So depending on the time it is
> scheduled out, whether it's due lots of other tasks occupying the CPU or
> due to a blocking syscall, the result can be completely misleading. The job
> scheduler will see for example: 80ms ago was last AVX usage recorded and
> decide that this is just an occasional usage and migrate it away. Now the
> task gets on the other CPU and starts using AVX again, which makes the
> scheduler see a short delta and decide to move it back.

All in all AVX usage elapsed time threshold is the matter. I guess you'll be
more comfortable if the job scheduler decides to migrate task depends on the
condition if AVX512_elapsed_time > 10000 (10s), ;)

That is, every time the interface is queried, if AVX512 usage is recorded 
within 10s, the task is classified as an AVX512 task and should be grouped.
This is reasonable because usually the AVX512 using task is the long running
task, for example, I have an intern who trained a deep reinforcement learning
model for 21 days...

> 
> So interpreting the value is voodoo which becomes even harder when there is
> no documentation.

Sure, let me try to put more details in Documentation/filesystems/proc.txt

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
  2019-02-16 12:55       ` Thomas Gleixner
  2019-02-16 17:05         ` Li, Aubrey
@ 2019-02-20 15:35         ` David Laight
  2019-02-20 15:38           ` Arjan van de Ven
  1 sibling, 1 reply; 9+ messages in thread
From: David Laight @ 2019-02-20 15:35 UTC (permalink / raw)
  To: 'Thomas Gleixner', Li, Aubrey
  Cc: mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, arjan,
	aubrey.li, linux-kernel

From: Sent: 16 February 2019 12:56
> To: Li, Aubrey
...
> The above experiment just confirms what I said: The numbers are inaccurate
> and potentially misleading to a large extent when the AVX using task is not
> scheduled out for a longer time.

Not only that, they won't detect programs that use AVX-512 but never
context switch with live AVX-512 registers.

Since the registers are all caller-saved they can be zeroed after being
saved (or used) in order to speed up any context switches.

I believe that all system calls are now done through libc 'stub'
functions (I haven't seen inlined system calls for ages).
So by the time the kernel is entered (by a system call) none of the AVX
(or even SSE2?) registers can contain live values.
This should mean it is save for the system call entry code to
zero all the AVX registers - so none will be saved by a context switch.

Of course, is the kernel is entered by a trap or hardware interrupt
the registers are live and do need saving.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
  2019-02-20 15:35         ` David Laight
@ 2019-02-20 15:38           ` Arjan van de Ven
  0 siblings, 0 replies; 9+ messages in thread
From: Arjan van de Ven @ 2019-02-20 15:38 UTC (permalink / raw)
  To: David Laight, 'Thomas Gleixner', Li, Aubrey
  Cc: mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, aubrey.li, linux-kernel

On 2/20/2019 7:35 AM, David Laight wrote:
> From: Sent: 16 February 2019 12:56
>> To: Li, Aubrey
> ...
>> The above experiment just confirms what I said: The numbers are inaccurate
>> and potentially misleading to a large extent when the AVX using task is not
>> scheduled out for a longer time.
> 
> Not only that, they won't detect programs that use AVX-512 but never
> context switch with live AVX-512 registers.


you are completely correct in stating that this approach is basically sampling
at a relatively course level and such sampling will give false negatives

the alternative is not sampling, and not knowing anything at all,
unless you have a better suggestion on how to help find tasks that use avx512
in a low overhead way

(the typical use case is trying to find workloads that use avx512 to help
scheduling those workloads in the future in the cloud orchestrator, for example
to help them favor machines that support avx512 over machines that don't)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-02-20 15:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-13  2:37 [PATCH v11 1/3] /proc/pid/status: Add support for architecture specific output Aubrey Li
2019-02-13  2:37 ` [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time Aubrey Li
2019-02-14 11:29   ` Thomas Gleixner
2019-02-15  4:35     ` Li, Aubrey
2019-02-16 12:55       ` Thomas Gleixner
2019-02-16 17:05         ` Li, Aubrey
2019-02-20 15:35         ` David Laight
2019-02-20 15:38           ` Arjan van de Ven
2019-02-13  2:37 ` [PATCH v11 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms Aubrey Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).