linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
@ 2018-07-09 14:58 Xunlei Pang
  2018-07-15 23:36 ` [tip:sched/core] " tip-bot for Xunlei Pang
  2018-07-16 13:35 ` [PATCH v2] " Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Xunlei Pang @ 2018-07-09 14:58 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Tejun Heo, baoyou.xie
  Cc: linux-kernel

If users access "/proc/pid/stat", the utime and stime ratio in the
current SAMPLE period are excepted, but currently cputime_adjust()
always calculates with the ratio of the WHOLE lifetime of the process.

This results in inaccurate utime and stime in "/proc/pid/stat". For
example,  a process runs for a while with "50% usr, 0% sys", then
followed by "100% sys". For later while, the following is excepted:
  0.0 usr,  100.0 sys
but we got:
  10.0 usr,  90.0 sys

This patch uses the accurate ratio in cputime_adjust() to address the
issue. A new task_cputime type field is added in prev_cputime to record
previous task_cputime so that we can get the elapsed times as the accurate
ratio.

Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
---
v1->v2:
 - Rewrite the changelog.

 include/linux/sched.h         | 34 ++++++++++++------------
 include/linux/sched/cputime.h | 12 ++++++++-
 kernel/sched/cputime.c        | 61 ++++++++++++++++---------------------------
 3 files changed, 52 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 87bf02d93a27..9cb76005b638 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,10 +223,27 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+/**
+ * struct task_cputime - collected CPU time counts
+ * @utime:		time spent in user mode, in nanoseconds
+ * @stime:		time spent in kernel mode, in nanoseconds
+ * @sum_exec_runtime:	total time spent on the CPU, in nanoseconds
+ *
+ * This structure groups together three kinds of CPU time that are tracked for
+ * threads and thread groups.  Most things considering CPU time want to group
+ * these counts together and treat all three of them in parallel.
+ */
+struct task_cputime {
+	u64				utime;
+	u64				stime;
+	unsigned long long		sum_exec_runtime;
+};
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
  * @stime: time spent in system mode
+ * @cputime: previous task_cputime to calculate utime/stime
  * @lock: protects the above two fields
  *
  * Stores previous user/system time values such that we can guarantee
@@ -236,26 +253,11 @@ struct prev_cputime {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	u64				utime;
 	u64				stime;
+	struct task_cputime		cputime;
 	raw_spinlock_t			lock;
 #endif
 };
 
-/**
- * struct task_cputime - collected CPU time counts
- * @utime:		time spent in user mode, in nanoseconds
- * @stime:		time spent in kernel mode, in nanoseconds
- * @sum_exec_runtime:	total time spent on the CPU, in nanoseconds
- *
- * This structure groups together three kinds of CPU time that are tracked for
- * threads and thread groups.  Most things considering CPU time want to group
- * these counts together and treat all three of them in parallel.
- */
-struct task_cputime {
-	u64				utime;
-	u64				stime;
-	unsigned long long		sum_exec_runtime;
-};
-
 /* Alternate field names when used on cache expirations: */
 #define virt_exp			utime
 #define prof_exp			stime
diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 53f883f5a2fd..49f8fd2564ed 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct task_struct *tsk,
 	atomic64_add(ns, &cputimer->cputime_atomic.sum_exec_runtime);
 }
 
-static inline void prev_cputime_init(struct prev_cputime *prev)
+static inline void prev_cputime_clear(struct prev_cputime *prev)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	prev->utime = prev->stime = 0;
+	prev->cputime.utime = 0;
+	prev->cputime.stime = 0;
+	prev->cputime.sum_exec_runtime = 0;
+#endif
+}
+
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+	prev_cputime_clear(prev);
 	raw_spin_lock_init(&prev->lock);
 #endif
 }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..a68483ee3ad7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -590,69 +590,54 @@ static u64 scale_stime(u64 stime, u64 rtime, u64 total)
 void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
 		    u64 *ut, u64 *st)
 {
-	u64 rtime, stime, utime;
+	u64 rtime_delta, stime_delta, utime_delta;
 	unsigned long flags;
 
 	/* Serialize concurrent callers such that we can honour our guarantees */
 	raw_spin_lock_irqsave(&prev->lock, flags);
-	rtime = curr->sum_exec_runtime;
 
 	/*
 	 * This is possible under two circumstances:
-	 *  - rtime isn't monotonic after all (a bug);
+	 *  - task_cputime isn't monotonic after all (a bug);
 	 *  - we got reordered by the lock.
 	 *
 	 * In both cases this acts as a filter such that the rest of the code
 	 * can assume it is monotonic regardless of anything else.
 	 */
-	if (prev->stime + prev->utime >= rtime)
+	if (prev->cputime.utime > curr->utime ||
+	    prev->cputime.stime > curr->stime ||
+	    prev->cputime.sum_exec_runtime >= curr->sum_exec_runtime)
 		goto out;
 
-	stime = curr->stime;
-	utime = curr->utime;
+	stime_delta = curr->stime - prev->cputime.stime;
+	utime_delta = curr->utime - prev->cputime.utime;
+	rtime_delta = curr->sum_exec_runtime - prev->cputime.sum_exec_runtime;
 
 	/*
-	 * If either stime or utime are 0, assume all runtime is userspace.
-	 * Once a task gets some ticks, the monotonicy code at 'update:'
-	 * will ensure things converge to the observed ratio.
+	 * If either stime or utime increase are 0, assume all runtime
+	 * is userspace. Once a task gets some ticks, the monotonicy code
+	 * at 'update:' will ensure things converge to the observed ratio.
 	 */
-	if (stime == 0) {
-		utime = rtime;
+	if (stime_delta == 0) {
+		utime_delta = rtime_delta;
 		goto update;
 	}
 
-	if (utime == 0) {
-		stime = rtime;
+	if (utime_delta == 0) {
+		stime_delta = rtime_delta;
 		goto update;
 	}
 
-	stime = scale_stime(stime, rtime, stime + utime);
+	stime_delta = scale_stime(stime_delta, rtime_delta,
+				stime_delta + utime_delta);
+	if (stime_delta > rtime_delta)
+		stime_delta = rtime_delta;
+	utime_delta = rtime_delta - stime_delta;
 
 update:
-	/*
-	 * Make sure stime doesn't go backwards; this preserves monotonicity
-	 * for utime because rtime is monotonic.
-	 *
-	 *  utime_i+1 = rtime_i+1 - stime_i
-	 *            = rtime_i+1 - (rtime_i - utime_i)
-	 *            = (rtime_i+1 - rtime_i) + utime_i
-	 *            >= utime_i
-	 */
-	if (stime < prev->stime)
-		stime = prev->stime;
-	utime = rtime - stime;
-
-	/*
-	 * Make sure utime doesn't go backwards; this still preserves
-	 * monotonicity for stime, analogous argument to above.
-	 */
-	if (utime < prev->utime) {
-		utime = prev->utime;
-		stime = rtime - utime;
-	}
-
-	prev->stime = stime;
-	prev->utime = utime;
+	prev->cputime = *curr;
+	prev->utime += utime_delta;
+	prev->stime += stime_delta;
 out:
 	*ut = prev->utime;
 	*st = prev->stime;
-- 
2.14.1.40.g8e62ba1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
  2018-07-09 14:58 [PATCH v2] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust() Xunlei Pang
@ 2018-07-15 23:36 ` tip-bot for Xunlei Pang
  2018-07-16 13:37   ` Peter Zijlstra
  2018-07-16 13:35 ` [PATCH v2] " Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: tip-bot for Xunlei Pang @ 2018-07-15 23:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, torvalds, linux-kernel, mingo, tj, lcapitulino, frederic,
	peterz, tglx, xlpang

Commit-ID:  8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
Gitweb:     https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
Author:     Xunlei Pang <xlpang@linux.alibaba.com>
AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 16 Jul 2018 00:28:31 +0200

sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()

If users access "/proc/pid/stat", the utime and stime ratio in the
current SAMPLE period are excepted, but currently cputime_adjust()
always calculates with the ratio of the WHOLE lifetime of the process.

This results in inaccurate utime and stime in "/proc/pid/stat". For
example,  a process runs for a while with "50% usr, 0% sys", then
followed by "100% sys". For later while, the following is excepted:

  0.0 usr,  100.0 sys

but we get:

  10.0 usr,  90.0 sys

This patch uses the accurate ratio in cputime_adjust() to address the
issue. A new 'task_cputime' type field is added in prev_cputime to record
previous 'task_cputime' so that we can get the elapsed times as the accurate
ratio.

Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: baoyou.xie@gmail.com
Link: http://lkml.kernel.org/r/20180709145843.126583-1-xlpang@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h         | 34 ++++++++++++------------
 include/linux/sched/cputime.h | 12 ++++++++-
 kernel/sched/cputime.c        | 61 ++++++++++++++++---------------------------
 3 files changed, 52 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 43731fe51c97..fedc69d4a425 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,10 +223,27 @@ extern void io_schedule_finish(int token);
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+/**
+ * struct task_cputime - collected CPU time counts
+ * @utime:		time spent in user mode, in nanoseconds
+ * @stime:		time spent in kernel mode, in nanoseconds
+ * @sum_exec_runtime:	total time spent on the CPU, in nanoseconds
+ *
+ * This structure groups together three kinds of CPU time that are tracked for
+ * threads and thread groups.  Most things considering CPU time want to group
+ * these counts together and treat all three of them in parallel.
+ */
+struct task_cputime {
+	u64				utime;
+	u64				stime;
+	unsigned long long		sum_exec_runtime;
+};
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
  * @stime: time spent in system mode
+ * @cputime: previous task_cputime to calculate utime/stime
  * @lock: protects the above two fields
  *
  * Stores previous user/system time values such that we can guarantee
@@ -236,26 +253,11 @@ struct prev_cputime {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	u64				utime;
 	u64				stime;
+	struct task_cputime		cputime;
 	raw_spinlock_t			lock;
 #endif
 };
 
-/**
- * struct task_cputime - collected CPU time counts
- * @utime:		time spent in user mode, in nanoseconds
- * @stime:		time spent in kernel mode, in nanoseconds
- * @sum_exec_runtime:	total time spent on the CPU, in nanoseconds
- *
- * This structure groups together three kinds of CPU time that are tracked for
- * threads and thread groups.  Most things considering CPU time want to group
- * these counts together and treat all three of them in parallel.
- */
-struct task_cputime {
-	u64				utime;
-	u64				stime;
-	unsigned long long		sum_exec_runtime;
-};
-
 /* Alternate field names when used on cache expirations: */
 #define virt_exp			utime
 #define prof_exp			stime
diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 53f883f5a2fd..49f8fd2564ed 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -175,10 +175,20 @@ static inline void account_group_exec_runtime(struct task_struct *tsk,
 	atomic64_add(ns, &cputimer->cputime_atomic.sum_exec_runtime);
 }
 
-static inline void prev_cputime_init(struct prev_cputime *prev)
+static inline void prev_cputime_clear(struct prev_cputime *prev)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 	prev->utime = prev->stime = 0;
+	prev->cputime.utime = 0;
+	prev->cputime.stime = 0;
+	prev->cputime.sum_exec_runtime = 0;
+#endif
+}
+
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+	prev_cputime_clear(prev);
 	raw_spin_lock_init(&prev->lock);
 #endif
 }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..a68483ee3ad7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -590,69 +590,54 @@ drop_precision:
 void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
 		    u64 *ut, u64 *st)
 {
-	u64 rtime, stime, utime;
+	u64 rtime_delta, stime_delta, utime_delta;
 	unsigned long flags;
 
 	/* Serialize concurrent callers such that we can honour our guarantees */
 	raw_spin_lock_irqsave(&prev->lock, flags);
-	rtime = curr->sum_exec_runtime;
 
 	/*
 	 * This is possible under two circumstances:
-	 *  - rtime isn't monotonic after all (a bug);
+	 *  - task_cputime isn't monotonic after all (a bug);
 	 *  - we got reordered by the lock.
 	 *
 	 * In both cases this acts as a filter such that the rest of the code
 	 * can assume it is monotonic regardless of anything else.
 	 */
-	if (prev->stime + prev->utime >= rtime)
+	if (prev->cputime.utime > curr->utime ||
+	    prev->cputime.stime > curr->stime ||
+	    prev->cputime.sum_exec_runtime >= curr->sum_exec_runtime)
 		goto out;
 
-	stime = curr->stime;
-	utime = curr->utime;
+	stime_delta = curr->stime - prev->cputime.stime;
+	utime_delta = curr->utime - prev->cputime.utime;
+	rtime_delta = curr->sum_exec_runtime - prev->cputime.sum_exec_runtime;
 
 	/*
-	 * If either stime or utime are 0, assume all runtime is userspace.
-	 * Once a task gets some ticks, the monotonicy code at 'update:'
-	 * will ensure things converge to the observed ratio.
+	 * If either stime or utime increase are 0, assume all runtime
+	 * is userspace. Once a task gets some ticks, the monotonicy code
+	 * at 'update:' will ensure things converge to the observed ratio.
 	 */
-	if (stime == 0) {
-		utime = rtime;
+	if (stime_delta == 0) {
+		utime_delta = rtime_delta;
 		goto update;
 	}
 
-	if (utime == 0) {
-		stime = rtime;
+	if (utime_delta == 0) {
+		stime_delta = rtime_delta;
 		goto update;
 	}
 
-	stime = scale_stime(stime, rtime, stime + utime);
+	stime_delta = scale_stime(stime_delta, rtime_delta,
+				stime_delta + utime_delta);
+	if (stime_delta > rtime_delta)
+		stime_delta = rtime_delta;
+	utime_delta = rtime_delta - stime_delta;
 
 update:
-	/*
-	 * Make sure stime doesn't go backwards; this preserves monotonicity
-	 * for utime because rtime is monotonic.
-	 *
-	 *  utime_i+1 = rtime_i+1 - stime_i
-	 *            = rtime_i+1 - (rtime_i - utime_i)
-	 *            = (rtime_i+1 - rtime_i) + utime_i
-	 *            >= utime_i
-	 */
-	if (stime < prev->stime)
-		stime = prev->stime;
-	utime = rtime - stime;
-
-	/*
-	 * Make sure utime doesn't go backwards; this still preserves
-	 * monotonicity for stime, analogous argument to above.
-	 */
-	if (utime < prev->utime) {
-		utime = prev->utime;
-		stime = rtime - utime;
-	}
-
-	prev->stime = stime;
-	prev->utime = utime;
+	prev->cputime = *curr;
+	prev->utime += utime_delta;
+	prev->stime += stime_delta;
 out:
 	*ut = prev->utime;
 	*st = prev->stime;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
  2018-07-09 14:58 [PATCH v2] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust() Xunlei Pang
  2018-07-15 23:36 ` [tip:sched/core] " tip-bot for Xunlei Pang
@ 2018-07-16 13:35 ` Peter Zijlstra
  1 sibling, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2018-07-16 13:35 UTC (permalink / raw)
  To: Xunlei Pang
  Cc: Ingo Molnar, Frederic Weisbecker, Tejun Heo, baoyou.xie, linux-kernel

On Mon, Jul 09, 2018 at 10:58:43PM +0800, Xunlei Pang wrote:
> If users access "/proc/pid/stat", the utime and stime ratio in the
> current SAMPLE period are excepted, but currently cputime_adjust()
> always calculates with the ratio of the WHOLE lifetime of the process.
> 
> This results in inaccurate utime and stime in "/proc/pid/stat". For
> example,  a process runs for a while with "50% usr, 0% sys", then
> followed by "100% sys". For later while, the following is excepted:
>   0.0 usr,  100.0 sys
> but we got:
>   10.0 usr,  90.0 sys
> 
> This patch uses the accurate ratio in cputime_adjust() to address the
> issue. A new task_cputime type field is added in prev_cputime to record
> previous task_cputime so that we can get the elapsed times as the accurate
> ratio.

This still does not explain anything...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
  2018-07-15 23:36 ` [tip:sched/core] " tip-bot for Xunlei Pang
@ 2018-07-16 13:37   ` Peter Zijlstra
  2018-07-16 16:11     ` Frederic Weisbecker
  2018-07-16 17:41     ` Ingo Molnar
  0 siblings, 2 replies; 9+ messages in thread
From: Peter Zijlstra @ 2018-07-16 13:37 UTC (permalink / raw)
  To: xlpang, tglx, frederic, lcapitulino, torvalds, linux-kernel, hpa,
	tj, mingo
  Cc: linux-tip-commits

On Sun, Jul 15, 2018 at 04:36:17PM -0700, tip-bot for Xunlei Pang wrote:
> Commit-ID:  8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
> Gitweb:     https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
> Author:     Xunlei Pang <xlpang@linux.alibaba.com>
> AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Mon, 16 Jul 2018 00:28:31 +0200
> 
> sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
> 
> If users access "/proc/pid/stat", the utime and stime ratio in the
> current SAMPLE period are excepted, but currently cputime_adjust()
> always calculates with the ratio of the WHOLE lifetime of the process.
> 
> This results in inaccurate utime and stime in "/proc/pid/stat". For
> example,  a process runs for a while with "50% usr, 0% sys", then
> followed by "100% sys". For later while, the following is excepted:
> 
>   0.0 usr,  100.0 sys
> 
> but we get:
> 
>   10.0 usr,  90.0 sys
> 
> This patch uses the accurate ratio in cputime_adjust() to address the
> issue. A new 'task_cputime' type field is added in prev_cputime to record
> previous 'task_cputime' so that we can get the elapsed times as the accurate
> ratio.

Ingo, please make this one go away. I still have no idae what the
problem is and I've not had time to reverse engineer the patch.

The previous (v1) Changelog was a pile of incoherent rambling and the
above doesn't explain anything much.

I want a clear description of the problem and a coherent explanation of
the proposed solution without having to reverse engineer the actual
patch.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
  2018-07-16 13:37   ` Peter Zijlstra
@ 2018-07-16 16:11     ` Frederic Weisbecker
  2018-07-16 17:41     ` Ingo Molnar
  1 sibling, 0 replies; 9+ messages in thread
From: Frederic Weisbecker @ 2018-07-16 16:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: xlpang, tglx, lcapitulino, torvalds, linux-kernel, hpa, tj,
	mingo, linux-tip-commits

On Mon, Jul 16, 2018 at 03:37:51PM +0200, Peter Zijlstra wrote:
> On Sun, Jul 15, 2018 at 04:36:17PM -0700, tip-bot for Xunlei Pang wrote:
> > Commit-ID:  8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
> > Gitweb:     https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
> > Author:     Xunlei Pang <xlpang@linux.alibaba.com>
> > AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800
> > Committer:  Ingo Molnar <mingo@kernel.org>
> > CommitDate: Mon, 16 Jul 2018 00:28:31 +0200
> > 
> > sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
> > 
> > If users access "/proc/pid/stat", the utime and stime ratio in the
> > current SAMPLE period are excepted, but currently cputime_adjust()
> > always calculates with the ratio of the WHOLE lifetime of the process.
> > 
> > This results in inaccurate utime and stime in "/proc/pid/stat". For
> > example,  a process runs for a while with "50% usr, 0% sys", then
> > followed by "100% sys". For later while, the following is excepted:
> > 
> >   0.0 usr,  100.0 sys
> > 
> > but we get:
> > 
> >   10.0 usr,  90.0 sys
> > 
> > This patch uses the accurate ratio in cputime_adjust() to address the
> > issue. A new 'task_cputime' type field is added in prev_cputime to record
> > previous 'task_cputime' so that we can get the elapsed times as the accurate
> > ratio.
> 
> Ingo, please make this one go away. I still have no idae what the
> problem is and I've not had time to reverse engineer the patch.
> 
> The previous (v1) Changelog was a pile of incoherent rambling and the
> above doesn't explain anything much.
> 
> I want a clear description of the problem and a coherent explanation of
> the proposed solution without having to reverse engineer the actual
> patch.

I must confess I'm having trouble to understand what the real problem is.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
  2018-07-16 13:37   ` Peter Zijlstra
  2018-07-16 16:11     ` Frederic Weisbecker
@ 2018-07-16 17:41     ` Ingo Molnar
  2018-07-17  4:08       ` Xunlei Pang
  1 sibling, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2018-07-16 17:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: xlpang, tglx, frederic, lcapitulino, torvalds, linux-kernel, hpa,
	tj, linux-tip-commits


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sun, Jul 15, 2018 at 04:36:17PM -0700, tip-bot for Xunlei Pang wrote:
> > Commit-ID:  8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
> > Gitweb:     https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
> > Author:     Xunlei Pang <xlpang@linux.alibaba.com>
> > AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800
> > Committer:  Ingo Molnar <mingo@kernel.org>
> > CommitDate: Mon, 16 Jul 2018 00:28:31 +0200
> > 
> > sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
> > 
> > If users access "/proc/pid/stat", the utime and stime ratio in the
> > current SAMPLE period are excepted, but currently cputime_adjust()
> > always calculates with the ratio of the WHOLE lifetime of the process.
> > 
> > This results in inaccurate utime and stime in "/proc/pid/stat". For
> > example,  a process runs for a while with "50% usr, 0% sys", then
> > followed by "100% sys". For later while, the following is excepted:
> > 
> >   0.0 usr,  100.0 sys
> > 
> > but we get:
> > 
> >   10.0 usr,  90.0 sys
> > 
> > This patch uses the accurate ratio in cputime_adjust() to address the
> > issue. A new 'task_cputime' type field is added in prev_cputime to record
> > previous 'task_cputime' so that we can get the elapsed times as the accurate
> > ratio.
> 
> Ingo, please make this one go away.

Sure, I've removed it from sched/core.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
  2018-07-16 17:41     ` Ingo Molnar
@ 2018-07-17  4:08       ` Xunlei Pang
  2018-07-23  9:21         ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Xunlei Pang @ 2018-07-17  4:08 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: tglx, frederic, lcapitulino, torvalds, linux-kernel, hpa, tj,
	linux-tip-commits

On 7/17/18 1:41 AM, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Sun, Jul 15, 2018 at 04:36:17PM -0700, tip-bot for Xunlei Pang wrote:
>>> Commit-ID:  8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
>>> Gitweb:     https://git.kernel.org/tip/8d4c00dc38a8aa30dae8402955e55e7b34e74bc8
>>> Author:     Xunlei Pang <xlpang@linux.alibaba.com>
>>> AuthorDate: Mon, 9 Jul 2018 22:58:43 +0800
>>> Committer:  Ingo Molnar <mingo@kernel.org>
>>> CommitDate: Mon, 16 Jul 2018 00:28:31 +0200
>>>
>>> sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
>>>
>>> If users access "/proc/pid/stat", the utime and stime ratio in the
>>> current SAMPLE period are excepted, but currently cputime_adjust()
>>> always calculates with the ratio of the WHOLE lifetime of the process.
>>>
>>> This results in inaccurate utime and stime in "/proc/pid/stat". For
>>> example,  a process runs for a while with "50% usr, 0% sys", then
>>> followed by "100% sys". For later while, the following is excepted:
>>>
>>>   0.0 usr,  100.0 sys
>>>
>>> but we get:
>>>
>>>   10.0 usr,  90.0 sys
>>>
>>> This patch uses the accurate ratio in cputime_adjust() to address the
>>> issue. A new 'task_cputime' type field is added in prev_cputime to record
>>> previous 'task_cputime' so that we can get the elapsed times as the accurate
>>> ratio.
>>
>> Ingo, please make this one go away.
> 
> Sure, I've removed it from sched/core.
> 

Hi Ingo, Peter, Frederic,

I captured some runtime data using trace to explain it, hope this can
illustrate the motive behind my patch. Anyone could help improve my
changelog is appreciated if you think so after reading.


Here are the simple trace_printk I added to capture the real data:

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..b65c1f250941 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -611,6 +611,9 @@ void cputime_adjust(struct task_cputime *curr,
struct prev_cputime *prev,
        stime = curr->stime;
        utime = curr->utime;

+       if (!strncmp(current->comm, "cat", 3))
+               trace_printk("task tick-based utime %lld stime %lld,
scheduler rtime %lld\n", utime, stime, rtime);
+
        /*
         * If either stime or utime are 0, assume all runtime is userspace.
         * Once a task gets some ticks, the monotonicy code at 'update:'
@@ -651,9 +654,14 @@ void cputime_adjust(struct task_cputime *curr,
struct prev_cputime *prev,
                stime = rtime - utime;
        }

+       if (!strncmp(current->comm, "cat", 3))
+               trace_printk("result: old utime %lld stime %lld, new
utime %lld stime %lld\n",
+                               prev->utime, prev->stime, utime, stime);
+
        prev->stime = stime;
        prev->utime = utime;
 out:
        *ut = prev->utime;
        *st = prev->stime;
        raw_spin_unlock_irqrestore(&prev->lock, flags);


Using the reproducer I described in the changelog, it runs for a while
with "50% usr, 0% sys", then followed by "100% sys". A shell script
accesses its /proc/pid/stat at the interval of one second, and got:
50.0 usr,  0.0 sys
51.0 usr,  0.0 sys
50.0 usr,  0.0 sys
...
9.0 usr,  91.0 sys
9.0 usr,  91.0 sys


The trace data corresponds to the last sample period:
trace entry 1:
             cat-20755 [022] d...  1370.106496: cputime_adjust: task
tick-based utime 362560000000 stime 2551000000, scheduler rtime 333060702626
             cat-20755 [022] d...  1370.106497: cputime_adjust: result:
old utime 330729718142 stime 2306983867, new utime 330733635372 stime
2327067254

trace entry 2:
             cat-20773 [005] d...  1371.109825: cputime_adjust: task
tick-based utime 362567000000 stime 3547000000, scheduler rtime 334063718912
             cat-20773 [005] d...  1371.109826: cputime_adjust: result:
old utime 330733635372 stime 2327067254, new utime 330827229702 stime
3236489210

1) expected behaviour
Let's compare the last two trace entries(all the data below is in ns):
task tick-based utime: 362560000000->362567000000 increased 7000000
task tick-based stime: 2551000000  ->3547000000   increased 996000000
scheduler rtime:       333060702626->334063718912 increased 1003016286

The application actually runs almost 100%sys at the moment, we can
use the task tick-based utime and stime increased to double check:
996000000/(7000000+996000000) > 99%sys

2) the current cputime_adjust() inaccurate result
But for the current cputime_adjust(), we get the following adjusted
utime and stime increase in this sample period:
adjusted utime: 330733635372->330827229702 increased 93594330
adjusted stime: 2327067254  ->3236489210   increased 909421956

so 909421956/(93594330+909421956)=91%sys as the shell script shows above.

3) root cause
The root cause of the issue is that the current cputime_adjust() always
passes the whole times to scale_stime() to split the whole utime and
stime. In this patch, we pass all the increased deltas in 1) within
user's sample period to scale_stime() instead and accumulate the
corresponding results to the previous saved adjusted utime and stime,
so guarantee the accurate usr and sys increase within the user sample
period.


Thanks,
Xunlei

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
  2018-07-17  4:08       ` Xunlei Pang
@ 2018-07-23  9:21         ` Peter Zijlstra
  2018-07-24 13:28           ` Xunlei Pang
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-07-23  9:21 UTC (permalink / raw)
  To: Xunlei Pang
  Cc: Ingo Molnar, tglx, frederic, lcapitulino, torvalds, linux-kernel,
	hpa, tj, linux-tip-commits

On Tue, Jul 17, 2018 at 12:08:36PM +0800, Xunlei Pang wrote:
> The trace data corresponds to the last sample period:
> trace entry 1:
>              cat-20755 [022] d...  1370.106496: cputime_adjust: task
> tick-based utime 362560000000 stime 2551000000, scheduler rtime 333060702626
>              cat-20755 [022] d...  1370.106497: cputime_adjust: result:
> old utime 330729718142 stime 2306983867, new utime 330733635372 stime
> 2327067254
> 
> trace entry 2:
>              cat-20773 [005] d...  1371.109825: cputime_adjust: task
> tick-based utime 362567000000 stime 3547000000, scheduler rtime 334063718912
>              cat-20773 [005] d...  1371.109826: cputime_adjust: result:
> old utime 330733635372 stime 2327067254, new utime 330827229702 stime
> 3236489210
> 
> 1) expected behaviour
> Let's compare the last two trace entries(all the data below is in ns):
> task tick-based utime: 362560000000->362567000000 increased 7000000
> task tick-based stime: 2551000000  ->3547000000   increased 996000000
> scheduler rtime:       333060702626->334063718912 increased 1003016286
> 
> The application actually runs almost 100%sys at the moment, we can
> use the task tick-based utime and stime increased to double check:
> 996000000/(7000000+996000000) > 99%sys
> 
> 2) the current cputime_adjust() inaccurate result
> But for the current cputime_adjust(), we get the following adjusted
> utime and stime increase in this sample period:
> adjusted utime: 330733635372->330827229702 increased 93594330
> adjusted stime: 2327067254  ->3236489210   increased 909421956
> 
> so 909421956/(93594330+909421956)=91%sys as the shell script shows above.
> 
> 3) root cause
> The root cause of the issue is that the current cputime_adjust() always
> passes the whole times to scale_stime() to split the whole utime and
> stime. In this patch, we pass all the increased deltas in 1) within
> user's sample period to scale_stime() instead and accumulate the
> corresponding results to the previous saved adjusted utime and stime,
> so guarantee the accurate usr and sys increase within the user sample
> period.

But why it this a problem?

Since its sample based there's really nothing much you can guarantee.
What if your test program were to run in userspace for 50% of the time
but is so constructed to always be in kernel space when the tick
happens?

Then you would 'expect' it to be 50% user and 50% sys, but you're also
not getting that.

This stuff cannot be perfect, and the current code provides 'sensible'
numbers over the long run for most programs. Why muck with it?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust()
  2018-07-23  9:21         ` Peter Zijlstra
@ 2018-07-24 13:28           ` Xunlei Pang
  0 siblings, 0 replies; 9+ messages in thread
From: Xunlei Pang @ 2018-07-24 13:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, tglx, frederic, lcapitulino, torvalds, linux-kernel,
	hpa, tj, linux-tip-commits

On 7/23/18 5:21 PM, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 12:08:36PM +0800, Xunlei Pang wrote:
>> The trace data corresponds to the last sample period:
>> trace entry 1:
>>              cat-20755 [022] d...  1370.106496: cputime_adjust: task
>> tick-based utime 362560000000 stime 2551000000, scheduler rtime 333060702626
>>              cat-20755 [022] d...  1370.106497: cputime_adjust: result:
>> old utime 330729718142 stime 2306983867, new utime 330733635372 stime
>> 2327067254
>>
>> trace entry 2:
>>              cat-20773 [005] d...  1371.109825: cputime_adjust: task
>> tick-based utime 362567000000 stime 3547000000, scheduler rtime 334063718912
>>              cat-20773 [005] d...  1371.109826: cputime_adjust: result:
>> old utime 330733635372 stime 2327067254, new utime 330827229702 stime
>> 3236489210
>>
>> 1) expected behaviour
>> Let's compare the last two trace entries(all the data below is in ns):
>> task tick-based utime: 362560000000->362567000000 increased 7000000
>> task tick-based stime: 2551000000  ->3547000000   increased 996000000
>> scheduler rtime:       333060702626->334063718912 increased 1003016286
>>
>> The application actually runs almost 100%sys at the moment, we can
>> use the task tick-based utime and stime increased to double check:
>> 996000000/(7000000+996000000) > 99%sys
>>
>> 2) the current cputime_adjust() inaccurate result
>> But for the current cputime_adjust(), we get the following adjusted
>> utime and stime increase in this sample period:
>> adjusted utime: 330733635372->330827229702 increased 93594330
>> adjusted stime: 2327067254  ->3236489210   increased 909421956
>>
>> so 909421956/(93594330+909421956)=91%sys as the shell script shows above.
>>
>> 3) root cause
>> The root cause of the issue is that the current cputime_adjust() always
>> passes the whole times to scale_stime() to split the whole utime and
>> stime. In this patch, we pass all the increased deltas in 1) within
>> user's sample period to scale_stime() instead and accumulate the
>> corresponding results to the previous saved adjusted utime and stime,
>> so guarantee the accurate usr and sys increase within the user sample
>> period.
> 
> But why it this a problem?
> 
> Since its sample based there's really nothing much you can guarantee.
> What if your test program were to run in userspace for 50% of the time
> but is so constructed to always be in kernel space when the tick
> happens?
> 
> Then you would 'expect' it to be 50% user and 50% sys, but you're also
> not getting that.
> 
> This stuff cannot be perfect, and the current code provides 'sensible'
> numbers over the long run for most programs. Why muck with it?
> 

Basically I am ok with the current implementation, except for one
scenario we've met: when kernel went wrong for some reason with 100% sys
suddenly for seconds(even trigger softlockup), the statistics monitor
didn't reflect the fact, which confused people. One example with our
per-cgroup top, we ever noticed "20% usr, 80% sys" displayed while in
fact the kernel was in some busy loop(100% sys) at that moment, and the
tick based time are of course all sys samples in such case.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-07-24 13:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-09 14:58 [PATCH v2] sched/cputime: Ensure accurate utime and stime ratio in cputime_adjust() Xunlei Pang
2018-07-15 23:36 ` [tip:sched/core] " tip-bot for Xunlei Pang
2018-07-16 13:37   ` Peter Zijlstra
2018-07-16 16:11     ` Frederic Weisbecker
2018-07-16 17:41     ` Ingo Molnar
2018-07-17  4:08       ` Xunlei Pang
2018-07-23  9:21         ` Peter Zijlstra
2018-07-24 13:28           ` Xunlei Pang
2018-07-16 13:35 ` [PATCH v2] " Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).