linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 1/3] /proc/pid/status: Add support for architecture specific output
@ 2019-02-21  1:17 Aubrey Li
  2019-02-21  1:17 ` [PATCH v12 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time Aubrey Li
  2019-02-21  1:17 ` [PATCH v12 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms Aubrey Li
  0 siblings, 2 replies; 6+ messages in thread
From: Aubrey Li @ 2019-02-21  1:17 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, aubrey.li, linux-kernel, Aubrey Li

The architecture specific information of the running processes could
be useful to the userland. Add support to examine process architecture
specific information externally.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/proc/array.c         | 5 +++++
 include/linux/proc_fs.h | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 9d428d5a0ac8..ea7a981f289c 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
 	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
+void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task)
 {
@@ -424,6 +428,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	task_cpus_allowed(m, task);
 	cpuset_task_status_allowed(m, task);
 	task_context_switch_counts(m, task);
+	arch_proc_pid_status(m, task);
 	return 0;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index d0e1f1522a78..1de9ba1b064f 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -73,6 +73,8 @@ struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mo
 						    int (*show)(struct seq_file *, void *),
 						    proc_write_t write,
 						    void *data);
+/* Add support for architecture specific output in /proc/pid/status */
+extern void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
 
 #else /* CONFIG_PROC_FS */
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v12 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time
  2019-02-21  1:17 [PATCH v12 1/3] /proc/pid/status: Add support for architecture specific output Aubrey Li
@ 2019-02-21  1:17 ` Aubrey Li
  2019-02-21  1:17 ` [PATCH v12 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms Aubrey Li
  1 sibling, 0 replies; 6+ messages in thread
From: Aubrey Li @ 2019-02-21  1:17 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, aubrey.li, linux-kernel, Aubrey Li

AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Tensorflow example:
$ while [ 1 ]; do cat /proc/pid/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	4

This means that 4 milliseconds have elapsed since the AVX512 usage
of tensorflow task was detected when the task was scheduled out.

Or:
$ cat /proc/pid/status | grep AVX512_elapsed_ms
AVX512_elapsed_ms:      -1

The number '-1' indicates the task didn't use AVX-512 components
before thus unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid <pid> -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

     3,251,565,961      core_power.lvl2_turbo_license

       1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86/kernel/fpu/xstate.c | 42 ++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9cc108456d0b..e480a535eeb2 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,8 @@
 #include <linux/cpu.h>
 #include <linux/mman.h>
 #include <linux/pkeys.h>
+#include <linux/seq_file.h>
+#include <linux/proc_fs.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -1243,3 +1245,43 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
 
 	return 0;
 }
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+	unsigned long timestamp = task->thread.fpu.avx512_timestamp;
+	long delta;
+
+	if (!timestamp) {
+		/*
+		 * Report -1 if no AVX512 usage
+		 */
+		delta = -1;
+	} else {
+		delta = (long)(jiffies - timestamp);
+		/*
+		 * Cap to LONG_MAX if time difference > LONG_MAX
+		 */
+		if (delta < 0)
+			delta = LONG_MAX;
+		delta = jiffies_to_msecs(delta);
+	}
+
+	seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+	seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task)
+{
+	/*
+	 * Report AVX512 state if the processor and build option supported.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+		avx512_status(m, task);
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v12 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms
  2019-02-21  1:17 [PATCH v12 1/3] /proc/pid/status: Add support for architecture specific output Aubrey Li
  2019-02-21  1:17 ` [PATCH v12 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time Aubrey Li
@ 2019-02-21  1:17 ` Aubrey Li
  2019-02-23 18:16   ` Thomas Gleixner
  1 sibling, 1 reply; 6+ messages in thread
From: Aubrey Li @ 2019-02-21  1:17 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, aubrey.li, linux-kernel, Aubrey Li

Added AVX512_elapsed_ms in /proc/<pid>/status. Report it
in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 Documentation/filesystems/proc.txt | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 66cad5c86171..425f2f09c9aa 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -45,6 +45,7 @@ Table of Contents
   3.9   /proc/<pid>/map_files - Information about memory mapped files
   3.10  /proc/<pid>/timerslack_ns - Task timerslack value
   3.11	/proc/<pid>/patch_state - Livepatch patch operation state
+  3.12	/proc/<pid>/AVX512_elapsed_ms - time elapsed since last AVX512 use
 
   4	Configuring procfs
   4.1	Mount options
@@ -207,6 +208,7 @@ read the file /proc/PID/status:
   Speculation_Store_Bypass:       thread vulnerable
   voluntary_ctxt_switches:        0
   nonvoluntary_ctxt_switches:     1
+  AVX512_elapsed_ms:	8
 
 This shows you nearly the same information you would get if you viewed it with
 the ps  command.  In  fact,  ps  uses  the  proc  file  system  to  obtain its
@@ -224,7 +226,7 @@ asynchronous manner and the value may not be very precise. To see a precise
 snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
 It's slow but very precise.
 
-Table 1-2: Contents of the status files (as of 4.19)
+Table 1-2: Contents of the status files (as of 5.1)
 ..............................................................................
  Field                       Content
  Name                        filename of the executable
@@ -289,6 +291,7 @@ Table 1-2: Contents of the status files (as of 4.19)
  Mems_allowed_list           Same as previous, but in "list format"
  voluntary_ctxt_switches     number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
+ AVX512_elapsed_ms           time elapsed since last AVX512 use in millisecond
 ..............................................................................
 
 Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
@@ -1948,6 +1951,29 @@ patched.  If the patch is being enabled, then the task has already been
 patched.  If the patch is being disabled, then the task hasn't been
 unpatched yet.
 
+3.12	/proc/<pid>/AVX512_elapsed_ms - time elapsed since last AVX512 use
+--------------------------------------------------------------------------
+If AVX512 is supported on the machine, this file displays time elapsed since
+last AVX512 usage of the task in millisecond.
+
+The per-task AVX512 usage tracking mechanism is added during context switch.
+When the task is scheduled out, the AVX512 timestamp of the task is tagged
+by jiffies if AVX512 usage is detected.
+
+When this interface is queried, AVX512_elapsed_ms is calculated as follows:
+
+	delta = (long)(jiffies_now - AVX512_timestamp);
+	AVX512_elpased_ms = jiffies_to_msecs(delta);
+
+Because this tracking mechanism depends on context switch, the number of
+AVX512_elapsed_ms could be inaccurate if the AVX512 using task runs alone on
+a CPU and not scheduled out for a long time. An extreme experiment shows a
+task is spinning on the AVX512 ops on an isolated CPU, but the longest elapsed
+time is close to 4 seconds(HZ = 250).
+
+So 5s or even longer is an appropriate threshold for the job scheduler to poll
+and decide if the task should be classifed as an AVX512 task and migrated
+away from the core on which a Non-AVX512 task is running.
 
 ------------------------------------------------------------------------------
 Configuring procfs
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v12 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms
  2019-02-21  1:17 ` [PATCH v12 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms Aubrey Li
@ 2019-02-23 18:16   ` Thomas Gleixner
  2019-02-23 18:17     ` Thomas Gleixner
  2019-02-24  1:41     ` Li, Aubrey
  0 siblings, 2 replies; 6+ messages in thread
From: Thomas Gleixner @ 2019-02-23 18:16 UTC (permalink / raw)
  To: Aubrey Li
  Cc: mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, arjan,
	aubrey.li, linux-kernel

On Thu, 21 Feb 2019, Aubrey Li wrote:
> @@ -45,6 +45,7 @@ Table of Contents
>    3.9   /proc/<pid>/map_files - Information about memory mapped files
>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> +  3.12	/proc/<pid>/AVX512_elapsed_ms - time elapsed since last AVX512 use

So is this a separate file now?
  
> +3.12	/proc/<pid>/AVX512_elapsed_ms - time elapsed since last AVX512 use
> +--------------------------------------------------------------------------
> +If AVX512 is supported on the machine, this file displays time elapsed since

This is not a file and this documentation wants to be where the status file
is described.

> +last AVX512 usage of the task in millisecond.

Since last usage is misleading. What you want to say is:

  The entry shows the milliseconds elapsed since the last time AVX512 usage
  was recorded.

> +The per-task AVX512 usage tracking mechanism is added during context switch.
> +When the task is scheduled out, the AVX512 timestamp of the task is tagged
> +by jiffies if AVX512 usage is detected.
> +
> +When this interface is queried, AVX512_elapsed_ms is calculated as follows:
> +
> +	delta = (long)(jiffies_now - AVX512_timestamp);
> +	AVX512_elpased_ms = jiffies_to_msecs(delta);

This information is not really helpful for someone who wants to use that
field.

> +
> +Because this tracking mechanism depends on context switch, the number of
> +AVX512_elapsed_ms could be inaccurate if the AVX512 using task runs alone on
> +a CPU and not scheduled out for a long time. An extreme experiment shows a
> +task is spinning on the AVX512 ops on an isolated CPU, but the longest elapsed
> +time is close to 4 seconds(HZ = 250).
> +
> +So 5s or even longer is an appropriate threshold for the job scheduler to poll
> +and decide if the task should be classifed as an AVX512 task and migrated
> +away from the core on which a Non-AVX512 task is running.

5 seconds or long is appropriate? No. It really depends on the workload and
the scheduling scenarios. What the documentation has to provide is the
information that this value is a crystal ball estimate and what the reasons
are why its inaccurate.

Something like this instead of this conglomorate of useful, irrelevant and
misleading information:

  The AVX512_elapsed_ms entry shows the milliseconds elapsed since the last
  time AVX512 usage was recorded. The recording happens on a best effort
  basis when a task is scheduled out. This means that the value depends on
  two factors:

    1) The time which the task spent on the CPU without being scheduled
       out. With CPU isolation and a single runnable task this can take
       several seconds.

    2) The time since the task was scheduled out last. Depending on the
       reason for being scheduled out (time slice exhausted, syscall ...)
       this can be arbitrary long time.

  As a consequence the value cannot be considered precise and authoritive
  information. The application which uses this information has to be aware
  of the overall scenario on the system in order to determine whether a
  task is a real AVX512 user or not.

See? No jiffies, no code snippets, no absolute numbers and no magic
recommendation which might be correct for your test scenario, but
completely bogus for some other scenario.

Instead it contains the things which a application programmer who wants to
use that value needs to know. He then has to map it to his scenario and
build the crystal ball logic which makes it perhaps useful.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v12 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms
  2019-02-23 18:16   ` Thomas Gleixner
@ 2019-02-23 18:17     ` Thomas Gleixner
  2019-02-24  1:41     ` Li, Aubrey
  1 sibling, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2019-02-23 18:17 UTC (permalink / raw)
  To: Aubrey Li
  Cc: mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, arjan,
	aubrey.li, linux-kernel

On Sat, 23 Feb 2019, Thomas Gleixner wrote:
> On Thu, 21 Feb 2019, Aubrey Li wrote:
> Something like this instead of this conglomorate of useful, irrelevant and
> misleading information:
> 
>   The AVX512_elapsed_ms entry shows the milliseconds elapsed since the last
>   time AVX512 usage was recorded. The recording happens on a best effort
>   basis when a task is scheduled out. This means that the value depends on
>   two factors:
> 
>     1) The time which the task spent on the CPU without being scheduled
>        out. With CPU isolation and a single runnable task this can take
>        several seconds.
> 
>     2) The time since the task was scheduled out last. Depending on the
>        reason for being scheduled out (time slice exhausted, syscall ...)
>        this can be arbitrary long time.
> 
>   As a consequence the value cannot be considered precise and authoritive
>   information. The application which uses this information has to be aware
>   of the overall scenario on the system in order to determine whether a
>   task is a real AVX512 user or not.
> 
> See? No jiffies, no code snippets, no absolute numbers and no magic
> recommendation which might be correct for your test scenario, but
> completely bogus for some other scenario.
> 
> Instead it contains the things which a application programmer who wants to
> use that value needs to know. He then has to map it to his scenario and
> build the crystal ball logic which makes it perhaps useful.

And of course the special value -1 needs to be documented as well....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v12 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms
  2019-02-23 18:16   ` Thomas Gleixner
  2019-02-23 18:17     ` Thomas Gleixner
@ 2019-02-24  1:41     ` Li, Aubrey
  1 sibling, 0 replies; 6+ messages in thread
From: Li, Aubrey @ 2019-02-24  1:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, arjan,
	aubrey.li, linux-kernel

On 2019/2/24 2:16, Thomas Gleixner wrote:
> On Thu, 21 Feb 2019, Aubrey Li wrote:
>> @@ -45,6 +45,7 @@ Table of Contents
>>    3.9   /proc/<pid>/map_files - Information about memory mapped files
>>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
>> +  3.12	/proc/<pid>/AVX512_elapsed_ms - time elapsed since last AVX512 use
> 
> So is this a separate file now?
>   
>> +3.12	/proc/<pid>/AVX512_elapsed_ms - time elapsed since last AVX512 use
>> +--------------------------------------------------------------------------
>> +If AVX512 is supported on the machine, this file displays time elapsed since
> 
> This is not a file and this documentation wants to be where the status file
> is described.
> 
>> +last AVX512 usage of the task in millisecond.
> 
> Since last usage is misleading. What you want to say is:
> 
>   The entry shows the milliseconds elapsed since the last time AVX512 usage
>   was recorded.
> 
>> +The per-task AVX512 usage tracking mechanism is added during context switch.
>> +When the task is scheduled out, the AVX512 timestamp of the task is tagged
>> +by jiffies if AVX512 usage is detected.
>> +
>> +When this interface is queried, AVX512_elapsed_ms is calculated as follows:
>> +
>> +	delta = (long)(jiffies_now - AVX512_timestamp);
>> +	AVX512_elpased_ms = jiffies_to_msecs(delta);
> 
> This information is not really helpful for someone who wants to use that
> field.
> 
>> +
>> +Because this tracking mechanism depends on context switch, the number of
>> +AVX512_elapsed_ms could be inaccurate if the AVX512 using task runs alone on
>> +a CPU and not scheduled out for a long time. An extreme experiment shows a
>> +task is spinning on the AVX512 ops on an isolated CPU, but the longest elapsed
>> +time is close to 4 seconds(HZ = 250).
>> +
>> +So 5s or even longer is an appropriate threshold for the job scheduler to poll
>> +and decide if the task should be classifed as an AVX512 task and migrated
>> +away from the core on which a Non-AVX512 task is running.
> 
> 5 seconds or long is appropriate? No. It really depends on the workload and
> the scheduling scenarios. What the documentation has to provide is the
> information that this value is a crystal ball estimate and what the reasons
> are why its inaccurate.
> 
> Something like this instead of this conglomorate of useful, irrelevant and
> misleading information:
> 
>   The AVX512_elapsed_ms entry shows the milliseconds elapsed since the last
>   time AVX512 usage was recorded. The recording happens on a best effort
>   basis when a task is scheduled out. This means that the value depends on
>   two factors:
> 
>     1) The time which the task spent on the CPU without being scheduled
>        out. With CPU isolation and a single runnable task this can take
>        several seconds.
> 
>     2) The time since the task was scheduled out last. Depending on the
>        reason for being scheduled out (time slice exhausted, syscall ...)
>        this can be arbitrary long time.
> 
>   As a consequence the value cannot be considered precise and authoritive
>   information. The application which uses this information has to be aware
>   of the overall scenario on the system in order to determine whether a
>   task is a real AVX512 user or not.
> 
> See? No jiffies, no code snippets, no absolute numbers and no magic
> recommendation which might be correct for your test scenario, but
> completely bogus for some other scenario.
> 
> Instead it contains the things which a application programmer who wants to
> use that value needs to know. He then has to map it to his scenario and
> build the crystal ball logic which makes it perhaps useful.

Thanks a lot, I'll try to refine it again.

Regards,
-Aubrey

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-02-24  1:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-21  1:17 [PATCH v12 1/3] /proc/pid/status: Add support for architecture specific output Aubrey Li
2019-02-21  1:17 ` [PATCH v12 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed time Aubrey Li
2019-02-21  1:17 ` [PATCH v12 3/3] Documentation/filesystems/proc.txt: add AVX512_elapsed_ms Aubrey Li
2019-02-23 18:16   ` Thomas Gleixner
2019-02-23 18:17     ` Thomas Gleixner
2019-02-24  1:41     ` Li, Aubrey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).