linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting
@ 2017-06-29 17:15 Frederic Weisbecker
  2017-06-29 17:15 ` [PATCH 1/5] vtime: Remove vtime_account_user() Frederic Weisbecker
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2017-06-29 17:15 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Luiz Capitulino, Ingo Molnar, Wanpeng Li, Rik van Riel

Hi,

This is a proposition to fix
"[BUG nohz]: wrong user and system time accounting":
        http://lkml.kernel.org/r/20170323165512.60945ac6@redhat.com

I took Wanpeng Li's last patch and enhanced around it.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	sched/core

HEAD: 9c7442613755e0ee0fc915ac876d88d4d2c7385e

Thanks,
	Frederic
---

Frederic Weisbecker (4):
      vtime: Remove vtime_account_user()
      sched: Always set vtime_snap_whence after accounting vtime
      sched: Rename vtime fields
      sched: Move vtime task fields to their own struct

Wanpeng Li (1):
      sched: Accumulate vtime on top of nsec clocksource


 include/linux/init_task.h |   6 +-
 include/linux/sched.h     |  29 ++++++---
 include/linux/vtime.h     |   9 +--
 kernel/fork.c             |   6 +-
 kernel/sched/cputime.c    | 158 ++++++++++++++++++++++++++++------------------
 5 files changed, 123 insertions(+), 85 deletions(-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/5] vtime: Remove vtime_account_user()
  2017-06-29 17:15 [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Frederic Weisbecker
@ 2017-06-29 17:15 ` Frederic Weisbecker
  2017-06-29 23:01   ` Rik van Riel
  2017-07-05 10:28   ` [tip:sched/urgent] vtime, sched/cputime: " tip-bot for Frederic Weisbecker
  2017-06-29 17:15 ` [PATCH 2/5] sched: Always set vtime_snap_whence after accounting vtime Frederic Weisbecker
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2017-06-29 17:15 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Luiz Capitulino, Ingo Molnar, Wanpeng Li, Rik van Riel

It's an unnecessary function between vtime_user_exit() and
account_user_time().

Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/vtime.h  |  9 +--------
 kernel/sched/cputime.c | 18 +++++++++---------
 2 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 0681fe2..18b405e 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -67,19 +67,12 @@ static inline void vtime_account_system(struct task_struct *tsk) { }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-extern void vtime_account_user(struct task_struct *tsk);
 extern void vtime_user_enter(struct task_struct *tsk);
-
-static inline void vtime_user_exit(struct task_struct *tsk)
-{
-	vtime_account_user(tsk);
-}
-
+extern void vtime_user_exit(struct task_struct *tsk);
 extern void vtime_guest_enter(struct task_struct *tsk);
 extern void vtime_guest_exit(struct task_struct *tsk);
 extern void vtime_init_idle(struct task_struct *tsk, int cpu);
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN  */
-static inline void vtime_account_user(struct task_struct *tsk) { }
 static inline void vtime_user_enter(struct task_struct *tsk) { }
 static inline void vtime_user_exit(struct task_struct *tsk) { }
 static inline void vtime_guest_enter(struct task_struct *tsk) { }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index aea3135..5e080ca 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -724,15 +724,6 @@ void vtime_account_system(struct task_struct *tsk)
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-void vtime_account_user(struct task_struct *tsk)
-{
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	tsk->vtime_snap_whence = VTIME_SYS;
-	if (vtime_delta(tsk))
-		account_user_time(tsk, get_vtime_delta(tsk));
-	write_seqcount_end(&tsk->vtime_seqcount);
-}
-
 void vtime_user_enter(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
@@ -742,6 +733,15 @@ void vtime_user_enter(struct task_struct *tsk)
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
+void vtime_user_exit(struct task_struct *tsk)
+{
+	write_seqcount_begin(&tsk->vtime_seqcount);
+	tsk->vtime_snap_whence = VTIME_SYS;
+	if (vtime_delta(tsk))
+		account_user_time(tsk, get_vtime_delta(tsk));
+	write_seqcount_end(&tsk->vtime_seqcount);
+}
+
 void vtime_guest_enter(struct task_struct *tsk)
 {
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/5] sched: Always set vtime_snap_whence after accounting vtime
  2017-06-29 17:15 [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Frederic Weisbecker
  2017-06-29 17:15 ` [PATCH 1/5] vtime: Remove vtime_account_user() Frederic Weisbecker
@ 2017-06-29 17:15 ` Frederic Weisbecker
  2017-06-29 23:01   ` Rik van Riel
  2017-07-05 10:28   ` [tip:sched/urgent] sched/cputime: Always set tsk->vtime_snap_whence " tip-bot for Frederic Weisbecker
  2017-06-29 17:15 ` [PATCH 3/5] sched: Rename vtime fields Frederic Weisbecker
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2017-06-29 17:15 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Luiz Capitulino, Ingo Molnar, Wanpeng Li, Rik van Riel

Even though it doesn't have functional consequences, setting
the task's new context state after we actually accounted the pending
vtime from the old context state makes more sense from a review
perspective.

vtime_user_exit() is the only function that doesn't follow that rule
and that can bug the reviewer for a little while until he realizes there
is no reason for this special case.

Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/cputime.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5e080ca..db7ef10 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -736,9 +736,9 @@ void vtime_user_enter(struct task_struct *tsk)
 void vtime_user_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	tsk->vtime_snap_whence = VTIME_SYS;
 	if (vtime_delta(tsk))
 		account_user_time(tsk, get_vtime_delta(tsk));
+	tsk->vtime_snap_whence = VTIME_SYS;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/5] sched: Rename vtime fields
  2017-06-29 17:15 [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Frederic Weisbecker
  2017-06-29 17:15 ` [PATCH 1/5] vtime: Remove vtime_account_user() Frederic Weisbecker
  2017-06-29 17:15 ` [PATCH 2/5] sched: Always set vtime_snap_whence after accounting vtime Frederic Weisbecker
@ 2017-06-29 17:15 ` Frederic Weisbecker
  2017-06-29 23:02   ` Rik van Riel
  2017-07-05 10:28   ` [tip:sched/urgent] sched/cputime: " tip-bot for Frederic Weisbecker
  2017-06-29 17:15 ` [PATCH 4/5] sched: Move vtime task fields to their own struct Frederic Weisbecker
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2017-06-29 17:15 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Luiz Capitulino, Ingo Molnar, Wanpeng Li, Rik van Riel

The current "snapshot" based naming on vtime fields suggests we record
some past event but that's a low level picture of their actual purpose
which comes out blurry. The real point of these fields is to run a basic
state machine that tracks down cputime entry while switching between
contexts.

So lets reflect that with more meaningful names.

Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/init_task.h |  4 ++--
 include/linux/sched.h     |  4 ++--
 kernel/fork.c             |  4 ++--
 kernel/sched/cputime.c    | 30 +++++++++++++++---------------
 4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index e049526..3d53733 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -171,8 +171,8 @@ extern struct cred init_cred;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 # define INIT_VTIME(tsk)						\
 	.vtime_seqcount = SEQCNT_ZERO(tsk.vtime_seqcount),	\
-	.vtime_snap = 0,				\
-	.vtime_snap_whence = VTIME_SYS,
+	.vtime_starttime = 0,				\
+	.vtime_state = VTIME_SYS,
 #else
 # define INIT_VTIME(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1f0f427..22d2d9b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -689,7 +689,7 @@ struct task_struct {
 	struct prev_cputime		prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_t			vtime_seqcount;
-	unsigned long long		vtime_snap;
+	unsigned long long		vtime_starttime;
 	enum {
 		/* Task is sleeping or running in a CPU with VTIME inactive: */
 		VTIME_INACTIVE = 0,
@@ -697,7 +697,7 @@ struct task_struct {
 		VTIME_USER,
 		/* Task runs in kernelspace in a CPU with VTIME active: */
 		VTIME_SYS,
-	} vtime_snap_whence;
+	} vtime_state;
 #endif
 
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d..83c4f9b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1638,8 +1638,8 @@ static __latent_entropy struct task_struct *copy_process(
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_init(&p->vtime_seqcount);
-	p->vtime_snap = 0;
-	p->vtime_snap_whence = VTIME_INACTIVE;
+	p->vtime_starttime = 0;
+	p->vtime_state = VTIME_INACTIVE;
 #endif
 
 #if defined(SPLIT_RSS_COUNTING)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index db7ef10..6b152c2 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -683,10 +683,10 @@ static u64 vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
 
-	if (time_before(now, (unsigned long)tsk->vtime_snap))
+	if (time_before(now, (unsigned long)tsk->vtime_starttime))
 		return 0;
 
-	return jiffies_to_nsecs(now - tsk->vtime_snap);
+	return jiffies_to_nsecs(now - tsk->vtime_starttime);
 }
 
 static u64 get_vtime_delta(struct task_struct *tsk)
@@ -701,10 +701,10 @@ static u64 get_vtime_delta(struct task_struct *tsk)
 	 * elapsed time. Limit account_other_time to prevent rounding
 	 * errors from causing elapsed vtime to go negative.
 	 */
-	delta = jiffies_to_nsecs(now - tsk->vtime_snap);
+	delta = jiffies_to_nsecs(now - tsk->vtime_starttime);
 	other = account_other_time(delta);
-	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
-	tsk->vtime_snap = now;
+	WARN_ON_ONCE(tsk->vtime_state == VTIME_INACTIVE);
+	tsk->vtime_starttime = now;
 
 	return delta - other;
 }
@@ -746,7 +746,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 {
 	/*
 	 * The flags must be updated under the lock with
-	 * the vtime_snap flush and update.
+	 * the vtime_starttime flush and update.
 	 * That enforces a right ordering and update sequence
 	 * synchronization against the reader (task_gtime())
 	 * that can thus safely catch up with a tickless delta.
@@ -776,12 +776,12 @@ void vtime_account_idle(struct task_struct *tsk)
 void arch_vtime_task_switch(struct task_struct *prev)
 {
 	write_seqcount_begin(&prev->vtime_seqcount);
-	prev->vtime_snap_whence = VTIME_INACTIVE;
+	prev->vtime_state = VTIME_INACTIVE;
 	write_seqcount_end(&prev->vtime_seqcount);
 
 	write_seqcount_begin(&current->vtime_seqcount);
-	current->vtime_snap_whence = VTIME_SYS;
-	current->vtime_snap = jiffies;
+	current->vtime_state = VTIME_SYS;
+	current->vtime_starttime = jiffies;
 	write_seqcount_end(&current->vtime_seqcount);
 }
 
@@ -791,8 +791,8 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 
 	local_irq_save(flags);
 	write_seqcount_begin(&t->vtime_seqcount);
-	t->vtime_snap_whence = VTIME_SYS;
-	t->vtime_snap = jiffies;
+	t->vtime_state = VTIME_SYS;
+	t->vtime_starttime = jiffies;
 	write_seqcount_end(&t->vtime_seqcount);
 	local_irq_restore(flags);
 }
@@ -809,7 +809,7 @@ u64 task_gtime(struct task_struct *t)
 		seq = read_seqcount_begin(&t->vtime_seqcount);
 
 		gtime = t->gtime;
-		if (t->vtime_snap_whence == VTIME_SYS && t->flags & PF_VCPU)
+		if (t->vtime_state == VTIME_SYS && t->flags & PF_VCPU)
 			gtime += vtime_delta(t);
 
 	} while (read_seqcount_retry(&t->vtime_seqcount, seq));
@@ -840,7 +840,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		*stime = t->stime;
 
 		/* Task is sleeping, nothing to add */
-		if (t->vtime_snap_whence == VTIME_INACTIVE || is_idle_task(t))
+		if (t->vtime_state == VTIME_INACTIVE || is_idle_task(t))
 			continue;
 
 		delta = vtime_delta(t);
@@ -849,9 +849,9 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		 * Task runs either in user or kernel space, add pending nohz time to
 		 * the right place.
 		 */
-		if (t->vtime_snap_whence == VTIME_USER || t->flags & PF_VCPU)
+		if (t->vtime_state == VTIME_USER || t->flags & PF_VCPU)
 			*utime += delta;
-		else if (t->vtime_snap_whence == VTIME_SYS)
+		else if (t->vtime_state == VTIME_SYS)
 			*stime += delta;
 	} while (read_seqcount_retry(&t->vtime_seqcount, seq));
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/5] sched: Move vtime task fields to their own struct
  2017-06-29 17:15 [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2017-06-29 17:15 ` [PATCH 3/5] sched: Rename vtime fields Frederic Weisbecker
@ 2017-06-29 17:15 ` Frederic Weisbecker
  2017-06-29 23:05   ` Rik van Riel
  2017-07-05 10:29   ` [tip:sched/urgent] sched/cputime: Move the " tip-bot for Frederic Weisbecker
  2017-06-29 17:15 ` [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource Frederic Weisbecker
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2017-06-29 17:15 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Peter Zijlstra, Thomas Gleixner,
	Luiz Capitulino, Ingo Molnar, Wanpeng Li, Rik van Riel

We are about to add vtime accumulation fields to the task struct. Let's
avoid more bloatification and gather vtime informations to their own
struct.

Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/init_task.h |   6 +--
 include/linux/sched.h     |  26 ++++++-----
 kernel/fork.c             |   6 +--
 kernel/sched/cputime.c    | 112 ++++++++++++++++++++++++++--------------------
 4 files changed, 86 insertions(+), 64 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 3d53733..a2f6707 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -170,9 +170,9 @@ extern struct cred init_cred;
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 # define INIT_VTIME(tsk)						\
-	.vtime_seqcount = SEQCNT_ZERO(tsk.vtime_seqcount),	\
-	.vtime_starttime = 0,				\
-	.vtime_state = VTIME_SYS,
+	.vtime.seqcount = SEQCNT_ZERO(tsk.vtime.seqcount),		\
+	.vtime.starttime = 0,						\
+	.vtime.state = VTIME_SYS,
 #else
 # define INIT_VTIME(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 22d2d9b..4a48c3f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,6 +223,21 @@ struct task_cputime {
 #define prof_exp			stime
 #define sched_exp			sum_exec_runtime
 
+enum vtime_state {
+	/* Task is sleeping or running in a CPU with VTIME inactive: */
+	VTIME_INACTIVE = 0,
+	/* Task runs in userspace in a CPU with VTIME active: */
+	VTIME_USER,
+	/* Task runs in kernelspace in a CPU with VTIME active: */
+	VTIME_SYS,
+};
+
+struct vtime {
+	seqcount_t		seqcount;
+	unsigned long long	starttime;
+	enum vtime_state	state;
+};
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
@@ -688,16 +703,7 @@ struct task_struct {
 	u64				gtime;
 	struct prev_cputime		prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-	seqcount_t			vtime_seqcount;
-	unsigned long long		vtime_starttime;
-	enum {
-		/* Task is sleeping or running in a CPU with VTIME inactive: */
-		VTIME_INACTIVE = 0,
-		/* Task runs in userspace in a CPU with VTIME active: */
-		VTIME_USER,
-		/* Task runs in kernelspace in a CPU with VTIME active: */
-		VTIME_SYS,
-	} vtime_state;
+	struct vtime			vtime;
 #endif
 
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/fork.c b/kernel/fork.c
index 83c4f9b..d927ec1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1637,9 +1637,9 @@ static __latent_entropy struct task_struct *copy_process(
 	prev_cputime_init(&p->prev_cputime);
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-	seqcount_init(&p->vtime_seqcount);
-	p->vtime_starttime = 0;
-	p->vtime_state = VTIME_INACTIVE;
+	seqcount_init(&p->vtime.seqcount);
+	p->vtime.starttime = 0;
+	p->vtime.state = VTIME_INACTIVE;
 #endif
 
 #if defined(SPLIT_RSS_COUNTING)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 6b152c2..b28d312 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -679,17 +679,17 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-static u64 vtime_delta(struct task_struct *tsk)
+static u64 vtime_delta(struct vtime *vtime)
 {
 	unsigned long now = READ_ONCE(jiffies);
 
-	if (time_before(now, (unsigned long)tsk->vtime_starttime))
+	if (time_before(now, (unsigned long)vtime->starttime))
 		return 0;
 
-	return jiffies_to_nsecs(now - tsk->vtime_starttime);
+	return jiffies_to_nsecs(now - vtime->starttime);
 }
 
-static u64 get_vtime_delta(struct task_struct *tsk)
+static u64 get_vtime_delta(struct vtime *vtime)
 {
 	unsigned long now = READ_ONCE(jiffies);
 	u64 delta, other;
@@ -701,49 +701,56 @@ static u64 get_vtime_delta(struct task_struct *tsk)
 	 * elapsed time. Limit account_other_time to prevent rounding
 	 * errors from causing elapsed vtime to go negative.
 	 */
-	delta = jiffies_to_nsecs(now - tsk->vtime_starttime);
+	delta = jiffies_to_nsecs(now - vtime->starttime);
 	other = account_other_time(delta);
-	WARN_ON_ONCE(tsk->vtime_state == VTIME_INACTIVE);
-	tsk->vtime_starttime = now;
+	WARN_ON_ONCE(vtime->state == VTIME_INACTIVE);
+	vtime->starttime = now;
 
 	return delta - other;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
 {
-	account_system_time(tsk, irq_count(), get_vtime_delta(tsk));
+	account_system_time(tsk, irq_count(), get_vtime_delta(&tsk->vtime));
 }
 
 void vtime_account_system(struct task_struct *tsk)
 {
-	if (!vtime_delta(tsk))
+	struct vtime *vtime = &tsk->vtime;
+
+	if (!vtime_delta(vtime))
 		return;
 
-	write_seqcount_begin(&tsk->vtime_seqcount);
+	write_seqcount_begin(&vtime->seqcount);
 	__vtime_account_system(tsk);
-	write_seqcount_end(&tsk->vtime_seqcount);
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_user_enter(struct task_struct *tsk)
 {
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
+	struct vtime *vtime = &tsk->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
+	if (vtime_delta(vtime))
 		__vtime_account_system(tsk);
-	tsk->vtime_snap_whence = VTIME_USER;
-	write_seqcount_end(&tsk->vtime_seqcount);
+	vtime->state = VTIME_USER;
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_user_exit(struct task_struct *tsk)
 {
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
-		account_user_time(tsk, get_vtime_delta(tsk));
-	tsk->vtime_snap_whence = VTIME_SYS;
-	write_seqcount_end(&tsk->vtime_seqcount);
+	struct vtime *vtime = &tsk->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
+	if (vtime_delta(vtime))
+		account_user_time(tsk, get_vtime_delta(vtime));
+	vtime->state = VTIME_SYS;
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_guest_enter(struct task_struct *tsk)
 {
+	struct vtime *vtime = &tsk->vtime;
 	/*
 	 * The flags must be updated under the lock with
 	 * the vtime_starttime flush and update.
@@ -751,54 +758,62 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * synchronization against the reader (task_gtime())
 	 * that can thus safely catch up with a tickless delta.
 	 */
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
+	write_seqcount_begin(&vtime->seqcount);
+	if (vtime_delta(vtime))
 		__vtime_account_system(tsk);
 	current->flags |= PF_VCPU;
-	write_seqcount_end(&tsk->vtime_seqcount);
+	write_seqcount_end(&vtime->seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_enter);
 
 void vtime_guest_exit(struct task_struct *tsk)
 {
-	write_seqcount_begin(&tsk->vtime_seqcount);
+	struct vtime *vtime = &tsk->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
 	__vtime_account_system(tsk);
 	current->flags &= ~PF_VCPU;
-	write_seqcount_end(&tsk->vtime_seqcount);
+	write_seqcount_end(&vtime->seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-	account_idle_time(get_vtime_delta(tsk));
+	account_idle_time(get_vtime_delta(&tsk->vtime));
 }
 
 void arch_vtime_task_switch(struct task_struct *prev)
 {
-	write_seqcount_begin(&prev->vtime_seqcount);
-	prev->vtime_state = VTIME_INACTIVE;
-	write_seqcount_end(&prev->vtime_seqcount);
+	struct vtime *vtime = &prev->vtime;
 
-	write_seqcount_begin(&current->vtime_seqcount);
-	current->vtime_state = VTIME_SYS;
-	current->vtime_starttime = jiffies;
-	write_seqcount_end(&current->vtime_seqcount);
+	write_seqcount_begin(&vtime->seqcount);
+	vtime->state = VTIME_INACTIVE;
+	write_seqcount_end(&vtime->seqcount);
+
+	vtime = &current->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
+	vtime->state = VTIME_SYS;
+	vtime->starttime = jiffies;
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_init_idle(struct task_struct *t, int cpu)
 {
+	struct vtime *vtime = &t->vtime;
 	unsigned long flags;
 
 	local_irq_save(flags);
-	write_seqcount_begin(&t->vtime_seqcount);
-	t->vtime_state = VTIME_SYS;
-	t->vtime_starttime = jiffies;
-	write_seqcount_end(&t->vtime_seqcount);
+	write_seqcount_begin(&vtime->seqcount);
+	vtime->state = VTIME_SYS;
+	vtime->starttime = jiffies;
+	write_seqcount_end(&vtime->seqcount);
 	local_irq_restore(flags);
 }
 
 u64 task_gtime(struct task_struct *t)
 {
+	struct vtime *vtime = &t->vtime;
 	unsigned int seq;
 	u64 gtime;
 
@@ -806,13 +821,13 @@ u64 task_gtime(struct task_struct *t)
 		return t->gtime;
 
 	do {
-		seq = read_seqcount_begin(&t->vtime_seqcount);
+		seq = read_seqcount_begin(&vtime->seqcount);
 
 		gtime = t->gtime;
-		if (t->vtime_state == VTIME_SYS && t->flags & PF_VCPU)
-			gtime += vtime_delta(t);
+		if (vtime->state == VTIME_SYS && t->flags & PF_VCPU)
+			gtime += vtime_delta(vtime);
 
-	} while (read_seqcount_retry(&t->vtime_seqcount, seq));
+	} while (read_seqcount_retry(&vtime->seqcount, seq));
 
 	return gtime;
 }
@@ -824,8 +839,9 @@ u64 task_gtime(struct task_struct *t)
  */
 void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 {
-	u64 delta;
+	struct vtime *vtime = &t->vtime;
 	unsigned int seq;
+	u64 delta;
 
 	if (!vtime_accounting_enabled()) {
 		*utime = t->utime;
@@ -834,25 +850,25 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 	}
 
 	do {
-		seq = read_seqcount_begin(&t->vtime_seqcount);
+		seq = read_seqcount_begin(&vtime->seqcount);
 
 		*utime = t->utime;
 		*stime = t->stime;
 
 		/* Task is sleeping, nothing to add */
-		if (t->vtime_state == VTIME_INACTIVE || is_idle_task(t))
+		if (vtime->state == VTIME_INACTIVE || is_idle_task(t))
 			continue;
 
-		delta = vtime_delta(t);
+		delta = vtime_delta(vtime);
 
 		/*
 		 * Task runs either in user or kernel space, add pending nohz time to
 		 * the right place.
 		 */
-		if (t->vtime_state == VTIME_USER || t->flags & PF_VCPU)
+		if (vtime->state == VTIME_USER || t->flags & PF_VCPU)
 			*utime += delta;
-		else if (t->vtime_state == VTIME_SYS)
+		else if (vtime->state == VTIME_SYS)
 			*stime += delta;
-	} while (read_seqcount_retry(&t->vtime_seqcount, seq));
+	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource
  2017-06-29 17:15 [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2017-06-29 17:15 ` [PATCH 4/5] sched: Move vtime task fields to their own struct Frederic Weisbecker
@ 2017-06-29 17:15 ` Frederic Weisbecker
  2017-06-29 23:27   ` Rik van Riel
                     ` (3 more replies)
  2017-06-30  1:41 ` [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Wanpeng Li
                   ` (2 subsequent siblings)
  7 siblings, 4 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2017-06-29 17:15 UTC (permalink / raw)
  To: LKML
  Cc: Wanpeng Li, Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Ingo Molnar, Rik van Riel, Frederic Weisbecker

From: Wanpeng Li <kernellwp@gmail.com>

Currently the cputime source used by vtime is jiffies. When we cross
a context boundary and jiffies have changed since the last snapshot, the
pending cputime is accounted to the switching out context.

This system works ok if the ticks are not aligned across CPUs. If they
instead are aligned (ie: all fire at the same time) and the CPUs run in
userspace, the jiffies change is only observed on tick exit and therefore
the user cputime is accounted as system cputime. This is because the
CPU that maintains timekeeping fires its tick at the same time as the
others. It updates jiffies in the middle of the tick and the other CPUs
see that update on IRQ exit:

    CPU 0 (timekeeper)                  CPU 1
    -------------------              -------------
                      jiffies = N
    ...                              run in userspace for a jiffy
    tick entry                       tick entry (sees jiffies = N)
    set jiffies = N + 1
    tick exit                        tick exit (sees jiffies = N + 1)
                                                account 1 jiffy as stime

Fix this with using a nanosec clock source instead of jiffies. The
cputime is then accumulated and flushed everytime the pending delta
reaches a jiffy in order to mitigate the accounting overhead.

[fweisbec: changelog, rebase on struct vtime, field renames, add delta
on cputime readers, keep idle vtime as-is (low overhead accounting),
harmonize clock sources]

Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Not-Yet-Signed-off-by: Wanpeng Li <kernellwp@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/sched.h  |  3 +++
 kernel/sched/cputime.c | 64 +++++++++++++++++++++++++++++++++-----------------
 2 files changed, 45 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4a48c3f..85c5cb2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -236,6 +236,9 @@ struct vtime {
 	seqcount_t		seqcount;
 	unsigned long long	starttime;
 	enum vtime_state	state;
+	u64			utime;
+	u64			stime;
+	u64			gtime;
 };
 
 struct sched_info {
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b28d312..3dafea5 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -681,18 +681,19 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static u64 vtime_delta(struct vtime *vtime)
 {
-	unsigned long now = READ_ONCE(jiffies);
+	unsigned long long clock;
 
-	if (time_before(now, (unsigned long)vtime->starttime))
+	clock = sched_clock_cpu(smp_processor_id());
+	if (clock < vtime->starttime)
 		return 0;
 
-	return jiffies_to_nsecs(now - vtime->starttime);
+	return clock - vtime->starttime;
 }
 
 static u64 get_vtime_delta(struct vtime *vtime)
 {
-	unsigned long now = READ_ONCE(jiffies);
-	u64 delta, other;
+	u64 delta = vtime_delta(vtime);
+	u64 other;
 
 	/*
 	 * Unlike tick based timing, vtime based timing never has lost
@@ -701,17 +702,31 @@ static u64 get_vtime_delta(struct vtime *vtime)
 	 * elapsed time. Limit account_other_time to prevent rounding
 	 * errors from causing elapsed vtime to go negative.
 	 */
-	delta = jiffies_to_nsecs(now - vtime->starttime);
 	other = account_other_time(delta);
 	WARN_ON_ONCE(vtime->state == VTIME_INACTIVE);
-	vtime->starttime = now;
+	vtime->starttime += delta;
 
 	return delta - other;
 }
 
-static void __vtime_account_system(struct task_struct *tsk)
+static void __vtime_account_system(struct task_struct *tsk,
+				   struct vtime *vtime)
 {
-	account_system_time(tsk, irq_count(), get_vtime_delta(&tsk->vtime));
+	vtime->stime += get_vtime_delta(vtime);
+	if (vtime->stime >= TICK_NSEC) {
+		account_system_time(tsk, irq_count(), vtime->stime);
+		vtime->stime = 0;
+	}
+}
+
+static void vtime_account_guest(struct task_struct *tsk,
+				struct vtime *vtime)
+{
+	vtime->gtime += get_vtime_delta(vtime);
+	if (vtime->gtime >= TICK_NSEC) {
+		account_guest_time(tsk, vtime->gtime);
+		vtime->gtime = 0;
+	}
 }
 
 void vtime_account_system(struct task_struct *tsk)
@@ -722,7 +737,11 @@ void vtime_account_system(struct task_struct *tsk)
 		return;
 
 	write_seqcount_begin(&vtime->seqcount);
-	__vtime_account_system(tsk);
+	/* We might have scheduled out from guest path */
+	if (current->flags & PF_VCPU)
+		vtime_account_guest(tsk, vtime);
+	else
+		__vtime_account_system(tsk, vtime);
 	write_seqcount_end(&vtime->seqcount);
 }
 
@@ -731,8 +750,7 @@ void vtime_user_enter(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	if (vtime_delta(vtime))
-		__vtime_account_system(tsk);
+	__vtime_account_system(tsk, vtime);
 	vtime->state = VTIME_USER;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -742,8 +760,11 @@ void vtime_user_exit(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	if (vtime_delta(vtime))
-		account_user_time(tsk, get_vtime_delta(vtime));
+	vtime->utime += get_vtime_delta(vtime);
+	if (vtime->utime >= TICK_NSEC) {
+		account_user_time(tsk, vtime->utime);
+		vtime->utime = 0;
+	}
 	vtime->state = VTIME_SYS;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -759,8 +780,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&vtime->seqcount);
-	if (vtime_delta(vtime))
-		__vtime_account_system(tsk);
+	__vtime_account_system(tsk, vtime);
 	current->flags |= PF_VCPU;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -771,7 +791,7 @@ void vtime_guest_exit(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	__vtime_account_system(tsk);
+	vtime_account_guest(tsk, vtime);
 	current->flags &= ~PF_VCPU;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -794,7 +814,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 
 	write_seqcount_begin(&vtime->seqcount);
 	vtime->state = VTIME_SYS;
-	vtime->starttime = jiffies;
+	vtime->starttime = sched_clock_cpu(smp_processor_id());
 	write_seqcount_end(&vtime->seqcount);
 }
 
@@ -806,7 +826,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	local_irq_save(flags);
 	write_seqcount_begin(&vtime->seqcount);
 	vtime->state = VTIME_SYS;
-	vtime->starttime = jiffies;
+	vtime->starttime = sched_clock_cpu(cpu);
 	write_seqcount_end(&vtime->seqcount);
 	local_irq_restore(flags);
 }
@@ -825,7 +845,7 @@ u64 task_gtime(struct task_struct *t)
 
 		gtime = t->gtime;
 		if (vtime->state == VTIME_SYS && t->flags & PF_VCPU)
-			gtime += vtime_delta(vtime);
+			gtime += vtime->gtime + vtime_delta(vtime);
 
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 
@@ -866,9 +886,9 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		 * the right place.
 		 */
 		if (vtime->state == VTIME_USER || t->flags & PF_VCPU)
-			*utime += delta;
+			*utime += vtime->utime + delta;
 		else if (vtime->state == VTIME_SYS)
-			*stime += delta;
+			*stime += vtime->stime + delta;
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] vtime: Remove vtime_account_user()
  2017-06-29 17:15 ` [PATCH 1/5] vtime: Remove vtime_account_user() Frederic Weisbecker
@ 2017-06-29 23:01   ` Rik van Riel
  2017-07-05 10:28   ` [tip:sched/urgent] vtime, sched/cputime: " tip-bot for Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: Rik van Riel @ 2017-06-29 23:01 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Peter Zijlstra, Thomas Gleixner, Luiz Capitulino, Ingo Molnar,
	Wanpeng Li

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> It's an unnecessary function between vtime_user_exit() and
> account_user_time().
> 
> Cc: Wanpeng Li <kernellwp@gmail.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] sched: Always set vtime_snap_whence after accounting vtime
  2017-06-29 17:15 ` [PATCH 2/5] sched: Always set vtime_snap_whence after accounting vtime Frederic Weisbecker
@ 2017-06-29 23:01   ` Rik van Riel
  2017-07-05 10:28   ` [tip:sched/urgent] sched/cputime: Always set tsk->vtime_snap_whence " tip-bot for Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: Rik van Riel @ 2017-06-29 23:01 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Peter Zijlstra, Thomas Gleixner, Luiz Capitulino, Ingo Molnar,
	Wanpeng Li

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> Even though it doesn't have functional consequences, setting
> the task's new context state after we actually accounted the pending
> vtime from the old context state makes more sense from a review
> perspective.
> 
> vtime_user_exit() is the only function that doesn't follow that rule
> and that can bug the reviewer for a little while until he realizes
> there
> is no reason for this special case.
> 
> Cc: Wanpeng Li <kernellwp@gmail.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] sched: Rename vtime fields
  2017-06-29 17:15 ` [PATCH 3/5] sched: Rename vtime fields Frederic Weisbecker
@ 2017-06-29 23:02   ` Rik van Riel
  2017-07-05 10:28   ` [tip:sched/urgent] sched/cputime: " tip-bot for Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: Rik van Riel @ 2017-06-29 23:02 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Peter Zijlstra, Thomas Gleixner, Luiz Capitulino, Ingo Molnar,
	Wanpeng Li

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> The current "snapshot" based naming on vtime fields suggests we
> record
> some past event but that's a low level picture of their actual
> purpose
> which comes out blurry. The real point of these fields is to run a
> basic
> state machine that tracks down cputime entry while switching between
> contexts.
> 
> So lets reflect that with more meaningful names.
> 
> Cc: Wanpeng Li <kernellwp@gmail.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] sched: Move vtime task fields to their own struct
  2017-06-29 17:15 ` [PATCH 4/5] sched: Move vtime task fields to their own struct Frederic Weisbecker
@ 2017-06-29 23:05   ` Rik van Riel
  2017-07-05 10:29   ` [tip:sched/urgent] sched/cputime: Move the " tip-bot for Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: Rik van Riel @ 2017-06-29 23:05 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Peter Zijlstra, Thomas Gleixner, Luiz Capitulino, Ingo Molnar,
	Wanpeng Li

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> We are about to add vtime accumulation fields to the task struct.
> Let's
> avoid more bloatification and gather vtime informations to their own
> struct.
> 
> Cc: Wanpeng Li <kernellwp@gmail.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource
  2017-06-29 17:15 ` [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource Frederic Weisbecker
@ 2017-06-29 23:27   ` Rik van Riel
  2017-07-05 13:20     ` Frederic Weisbecker
  2017-06-30  1:52   ` Wanpeng Li
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Rik van Riel @ 2017-06-29 23:27 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Wanpeng Li, Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Ingo Molnar

On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> From: Wanpeng Li <kernellwp@gmail.com>
> 
> Currently the cputime source used by vtime is jiffies. When we cross
> a context boundary and jiffies have changed since the last snapshot,
> the
> pending cputime is accounted to the switching out context.
> 
> This system works ok if the ticks are not aligned across CPUs. If
> they
> instead are aligned (ie: all fire at the same time) and the CPUs run
> in
> userspace, the jiffies change is only observed on tick exit and
> therefore
> the user cputime is accounted as system cputime. This is because the
> CPU that maintains timekeeping fires its tick at the same time as the
> others. It updates jiffies in the middle of the tick and the other
> CPUs
> see that update on IRQ exit:
> 
>     CPU 0 (timekeeper)                  CPU 1
>     -------------------              -------------
>                       jiffies = N
>     ...                              run in userspace for a jiffy
>     tick entry                       tick entry (sees jiffies = N)
>     set jiffies = N + 1
>     tick exit                        tick exit (sees jiffies = N + 1)
>                                                 account 1 jiffy as
> stime
> 
> Fix this with using a nanosec clock source instead of jiffies. The
> cputime is then accumulated and flushed everytime the pending delta
> reaches a jiffy in order to mitigate the accounting overhead.

Glad to hear this could be done without dramatically
increasing the accounting overhead!

> [fweisbec: changelog, rebase on struct vtime, field renames, add
> delta
> on cputime readers, keep idle vtime as-is (low overhead accounting),
> harmonize clock sources]
> 
> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Not-Yet-Signed-off-by: Wanpeng Li <kernellwp@gmail.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Wanpeng Li <kernellwp@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting
  2017-06-29 17:15 [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2017-06-29 17:15 ` [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource Frederic Weisbecker
@ 2017-06-30  1:41 ` Wanpeng Li
  2017-06-30 17:32   ` Luiz Capitulino
  2017-07-03 10:28 ` Thomas Gleixner
  2017-07-04 16:52 ` Luiz Capitulino
  7 siblings, 1 reply; 25+ messages in thread
From: Wanpeng Li @ 2017-06-30  1:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Ingo Molnar, Rik van Riel

Hi Luiz,

2017-06-30 1:15 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
> Hi,
>
> This is a proposition to fix
> "[BUG nohz]: wrong user and system time accounting":
>         http://lkml.kernel.org/r/20170323165512.60945ac6@redhat.com
>
> I took Wanpeng Li's last patch and enhanced around it.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
>         sched/core
>
> HEAD: 9c7442613755e0ee0fc915ac876d88d4d2c7385e

Could you have a try?

Regards,
Wanpeng Li

>
> Thanks,
>         Frederic
> ---
>
> Frederic Weisbecker (4):
>       vtime: Remove vtime_account_user()
>       sched: Always set vtime_snap_whence after accounting vtime
>       sched: Rename vtime fields
>       sched: Move vtime task fields to their own struct
>
> Wanpeng Li (1):
>       sched: Accumulate vtime on top of nsec clocksource
>
>
>  include/linux/init_task.h |   6 +-
>  include/linux/sched.h     |  29 ++++++---
>  include/linux/vtime.h     |   9 +--
>  kernel/fork.c             |   6 +-
>  kernel/sched/cputime.c    | 158 ++++++++++++++++++++++++++++------------------
>  5 files changed, 123 insertions(+), 85 deletions(-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource
  2017-06-29 17:15 ` [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource Frederic Weisbecker
  2017-06-29 23:27   ` Rik van Riel
@ 2017-06-30  1:52   ` Wanpeng Li
  2017-07-05 10:29   ` [tip:sched/urgent] sched/cputime: " tip-bot for Wanpeng Li
  2017-07-15  3:37   ` [PATCH 5/5] sched: " Levin, Alexander (Sasha Levin)
  3 siblings, 0 replies; 25+ messages in thread
From: Wanpeng Li @ 2017-06-30  1:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Luiz Capitulino,
	Ingo Molnar, Rik van Riel

2017-06-30 1:15 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
> From: Wanpeng Li <kernellwp@gmail.com>

From: Wanpeng Li <wanpeng.li@hotmail.com>

>
> Currently the cputime source used by vtime is jiffies. When we cross
> a context boundary and jiffies have changed since the last snapshot, the
> pending cputime is accounted to the switching out context.
>
> This system works ok if the ticks are not aligned across CPUs. If they
> instead are aligned (ie: all fire at the same time) and the CPUs run in
> userspace, the jiffies change is only observed on tick exit and therefore
> the user cputime is accounted as system cputime. This is because the
> CPU that maintains timekeeping fires its tick at the same time as the
> others. It updates jiffies in the middle of the tick and the other CPUs
> see that update on IRQ exit:
>
>     CPU 0 (timekeeper)                  CPU 1
>     -------------------              -------------
>                       jiffies = N
>     ...                              run in userspace for a jiffy
>     tick entry                       tick entry (sees jiffies = N)
>     set jiffies = N + 1
>     tick exit                        tick exit (sees jiffies = N + 1)
>                                                 account 1 jiffy as stime
>
> Fix this with using a nanosec clock source instead of jiffies. The
> cputime is then accumulated and flushed everytime the pending delta
> reaches a jiffy in order to mitigate the accounting overhead.
>
> [fweisbec: changelog, rebase on struct vtime, field renames, add delta
> on cputime readers, keep idle vtime as-is (low overhead accounting),
> harmonize clock sources]
>
> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Not-Yet-Signed-off-by: Wanpeng Li <kernellwp@gmail.com>

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>

Thanks for the patchset, Frederic! :)

Regards,
Wanpeng Li


> Cc: Rik van Riel <riel@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Wanpeng Li <kernellwp@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> ---
>  include/linux/sched.h  |  3 +++
>  kernel/sched/cputime.c | 64 +++++++++++++++++++++++++++++++++-----------------
>  2 files changed, 45 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4a48c3f..85c5cb2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -236,6 +236,9 @@ struct vtime {
>         seqcount_t              seqcount;
>         unsigned long long      starttime;
>         enum vtime_state        state;
> +       u64                     utime;
> +       u64                     stime;
> +       u64                     gtime;
>  };
>
>  struct sched_info {
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index b28d312..3dafea5 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -681,18 +681,19 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>  static u64 vtime_delta(struct vtime *vtime)
>  {
> -       unsigned long now = READ_ONCE(jiffies);
> +       unsigned long long clock;
>
> -       if (time_before(now, (unsigned long)vtime->starttime))
> +       clock = sched_clock_cpu(smp_processor_id());
> +       if (clock < vtime->starttime)
>                 return 0;
>
> -       return jiffies_to_nsecs(now - vtime->starttime);
> +       return clock - vtime->starttime;
>  }
>
>  static u64 get_vtime_delta(struct vtime *vtime)
>  {
> -       unsigned long now = READ_ONCE(jiffies);
> -       u64 delta, other;
> +       u64 delta = vtime_delta(vtime);
> +       u64 other;
>
>         /*
>          * Unlike tick based timing, vtime based timing never has lost
> @@ -701,17 +702,31 @@ static u64 get_vtime_delta(struct vtime *vtime)
>          * elapsed time. Limit account_other_time to prevent rounding
>          * errors from causing elapsed vtime to go negative.
>          */
> -       delta = jiffies_to_nsecs(now - vtime->starttime);
>         other = account_other_time(delta);
>         WARN_ON_ONCE(vtime->state == VTIME_INACTIVE);
> -       vtime->starttime = now;
> +       vtime->starttime += delta;
>
>         return delta - other;
>  }
>
> -static void __vtime_account_system(struct task_struct *tsk)
> +static void __vtime_account_system(struct task_struct *tsk,
> +                                  struct vtime *vtime)
>  {
> -       account_system_time(tsk, irq_count(), get_vtime_delta(&tsk->vtime));
> +       vtime->stime += get_vtime_delta(vtime);
> +       if (vtime->stime >= TICK_NSEC) {
> +               account_system_time(tsk, irq_count(), vtime->stime);
> +               vtime->stime = 0;
> +       }
> +}
> +
> +static void vtime_account_guest(struct task_struct *tsk,
> +                               struct vtime *vtime)
> +{
> +       vtime->gtime += get_vtime_delta(vtime);
> +       if (vtime->gtime >= TICK_NSEC) {
> +               account_guest_time(tsk, vtime->gtime);
> +               vtime->gtime = 0;
> +       }
>  }
>
>  void vtime_account_system(struct task_struct *tsk)
> @@ -722,7 +737,11 @@ void vtime_account_system(struct task_struct *tsk)
>                 return;
>
>         write_seqcount_begin(&vtime->seqcount);
> -       __vtime_account_system(tsk);
> +       /* We might have scheduled out from guest path */
> +       if (current->flags & PF_VCPU)
> +               vtime_account_guest(tsk, vtime);
> +       else
> +               __vtime_account_system(tsk, vtime);
>         write_seqcount_end(&vtime->seqcount);
>  }
>
> @@ -731,8 +750,7 @@ void vtime_user_enter(struct task_struct *tsk)
>         struct vtime *vtime = &tsk->vtime;
>
>         write_seqcount_begin(&vtime->seqcount);
> -       if (vtime_delta(vtime))
> -               __vtime_account_system(tsk);
> +       __vtime_account_system(tsk, vtime);
>         vtime->state = VTIME_USER;
>         write_seqcount_end(&vtime->seqcount);
>  }
> @@ -742,8 +760,11 @@ void vtime_user_exit(struct task_struct *tsk)
>         struct vtime *vtime = &tsk->vtime;
>
>         write_seqcount_begin(&vtime->seqcount);
> -       if (vtime_delta(vtime))
> -               account_user_time(tsk, get_vtime_delta(vtime));
> +       vtime->utime += get_vtime_delta(vtime);
> +       if (vtime->utime >= TICK_NSEC) {
> +               account_user_time(tsk, vtime->utime);
> +               vtime->utime = 0;
> +       }
>         vtime->state = VTIME_SYS;
>         write_seqcount_end(&vtime->seqcount);
>  }
> @@ -759,8 +780,7 @@ void vtime_guest_enter(struct task_struct *tsk)
>          * that can thus safely catch up with a tickless delta.
>          */
>         write_seqcount_begin(&vtime->seqcount);
> -       if (vtime_delta(vtime))
> -               __vtime_account_system(tsk);
> +       __vtime_account_system(tsk, vtime);
>         current->flags |= PF_VCPU;
>         write_seqcount_end(&vtime->seqcount);
>  }
> @@ -771,7 +791,7 @@ void vtime_guest_exit(struct task_struct *tsk)
>         struct vtime *vtime = &tsk->vtime;
>
>         write_seqcount_begin(&vtime->seqcount);
> -       __vtime_account_system(tsk);
> +       vtime_account_guest(tsk, vtime);
>         current->flags &= ~PF_VCPU;
>         write_seqcount_end(&vtime->seqcount);
>  }
> @@ -794,7 +814,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
>
>         write_seqcount_begin(&vtime->seqcount);
>         vtime->state = VTIME_SYS;
> -       vtime->starttime = jiffies;
> +       vtime->starttime = sched_clock_cpu(smp_processor_id());
>         write_seqcount_end(&vtime->seqcount);
>  }
>
> @@ -806,7 +826,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
>         local_irq_save(flags);
>         write_seqcount_begin(&vtime->seqcount);
>         vtime->state = VTIME_SYS;
> -       vtime->starttime = jiffies;
> +       vtime->starttime = sched_clock_cpu(cpu);
>         write_seqcount_end(&vtime->seqcount);
>         local_irq_restore(flags);
>  }
> @@ -825,7 +845,7 @@ u64 task_gtime(struct task_struct *t)
>
>                 gtime = t->gtime;
>                 if (vtime->state == VTIME_SYS && t->flags & PF_VCPU)
> -                       gtime += vtime_delta(vtime);
> +                       gtime += vtime->gtime + vtime_delta(vtime);
>
>         } while (read_seqcount_retry(&vtime->seqcount, seq));
>
> @@ -866,9 +886,9 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
>                  * the right place.
>                  */
>                 if (vtime->state == VTIME_USER || t->flags & PF_VCPU)
> -                       *utime += delta;
> +                       *utime += vtime->utime + delta;
>                 else if (vtime->state == VTIME_SYS)
> -                       *stime += delta;
> +                       *stime += vtime->stime + delta;
>         } while (read_seqcount_retry(&vtime->seqcount, seq));
>  }
>  #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting
  2017-06-30  1:41 ` [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Wanpeng Li
@ 2017-06-30 17:32   ` Luiz Capitulino
  0 siblings, 0 replies; 25+ messages in thread
From: Luiz Capitulino @ 2017-06-30 17:32 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Rik van Riel

On Fri, 30 Jun 2017 09:41:18 +0800
Wanpeng Li <kernellwp@gmail.com> wrote:

> Hi Luiz,
> 
> 2017-06-30 1:15 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
> > Hi,
> >
> > This is a proposition to fix
> > "[BUG nohz]: wrong user and system time accounting":
> >         http://lkml.kernel.org/r/20170323165512.60945ac6@redhat.com
> >
> > I took Wanpeng Li's last patch and enhanced around it.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> >         sched/core
> >
> > HEAD: 9c7442613755e0ee0fc915ac876d88d4d2c7385e  
> 
> Could you have a try?

Yes, I'll do this shortly. But I'll be a few days off, so I may
have the results only on next week.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting
  2017-06-29 17:15 [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2017-06-30  1:41 ` [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Wanpeng Li
@ 2017-07-03 10:28 ` Thomas Gleixner
  2017-07-04 16:52 ` Luiz Capitulino
  7 siblings, 0 replies; 25+ messages in thread
From: Thomas Gleixner @ 2017-07-03 10:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Luiz Capitulino, Ingo Molnar, Wanpeng Li,
	Rik van Riel

On Thu, 29 Jun 2017, Frederic Weisbecker wrote:

> Hi,
> 
> This is a proposition to fix
> "[BUG nohz]: wrong user and system time accounting":
>         http://lkml.kernel.org/r/20170323165512.60945ac6@redhat.com
> 
> I took Wanpeng Li's last patch and enhanced around it.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> 	sched/core

For the series:

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting
  2017-06-29 17:15 [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2017-07-03 10:28 ` Thomas Gleixner
@ 2017-07-04 16:52 ` Luiz Capitulino
  2017-07-05 13:16   ` Frederic Weisbecker
  7 siblings, 1 reply; 25+ messages in thread
From: Luiz Capitulino @ 2017-07-04 16:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Wanpeng Li,
	Rik van Riel

On Thu, 29 Jun 2017 19:15:06 +0200
Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Hi,
> 
> This is a proposition to fix
> "[BUG nohz]: wrong user and system time accounting":
>         http://lkml.kernel.org/r/20170323165512.60945ac6@redhat.com

Amazing series Frederic! This fixes all instances of the issue
for me on bare-metal and KVM guests, even acct-bug[1] is fixed.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>

[1] http://people.redhat.com/~lcapitul/real-time/acct-bug.c

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [tip:sched/urgent] vtime, sched/cputime: Remove vtime_account_user()
  2017-06-29 17:15 ` [PATCH 1/5] vtime: Remove vtime_account_user() Frederic Weisbecker
  2017-06-29 23:01   ` Rik van Riel
@ 2017-07-05 10:28   ` tip-bot for Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: tip-bot for Frederic Weisbecker @ 2017-07-05 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, peterz, tglx, lcapitulino, kernellwp, fweisbec,
	torvalds, hpa, riel, mingo

Commit-ID:  1c3eda01a79b8e9237d91c52c5a75b20983f47c6
Gitweb:     http://git.kernel.org/tip/1c3eda01a79b8e9237d91c52c5a75b20983f47c6
Author:     Frederic Weisbecker <fweisbec@gmail.com>
AuthorDate: Thu, 29 Jun 2017 19:15:07 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 5 Jul 2017 09:54:14 +0200

vtime, sched/cputime: Remove vtime_account_user()

It's an unnecessary function between vtime_user_exit() and
account_user_time().

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/vtime.h  |  9 +--------
 kernel/sched/cputime.c | 12 ++++++------
 2 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 0681fe2..18b405e 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -67,19 +67,12 @@ static inline void vtime_account_system(struct task_struct *tsk) { }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-extern void vtime_account_user(struct task_struct *tsk);
 extern void vtime_user_enter(struct task_struct *tsk);
-
-static inline void vtime_user_exit(struct task_struct *tsk)
-{
-	vtime_account_user(tsk);
-}
-
+extern void vtime_user_exit(struct task_struct *tsk);
 extern void vtime_guest_enter(struct task_struct *tsk);
 extern void vtime_guest_exit(struct task_struct *tsk);
 extern void vtime_init_idle(struct task_struct *tsk, int cpu);
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN  */
-static inline void vtime_account_user(struct task_struct *tsk) { }
 static inline void vtime_user_enter(struct task_struct *tsk) { }
 static inline void vtime_user_exit(struct task_struct *tsk) { }
 static inline void vtime_guest_enter(struct task_struct *tsk) { }
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 84a419b..5adc896 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -724,21 +724,21 @@ void vtime_account_system(struct task_struct *tsk)
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-void vtime_account_user(struct task_struct *tsk)
+void vtime_user_enter(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	tsk->vtime_snap_whence = VTIME_SYS;
 	if (vtime_delta(tsk))
-		account_user_time(tsk, get_vtime_delta(tsk));
+		__vtime_account_system(tsk);
+	tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-void vtime_user_enter(struct task_struct *tsk)
+void vtime_user_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
+	tsk->vtime_snap_whence = VTIME_SYS;
 	if (vtime_delta(tsk))
-		__vtime_account_system(tsk);
-	tsk->vtime_snap_whence = VTIME_USER;
+		account_user_time(tsk, get_vtime_delta(tsk));
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip:sched/urgent] sched/cputime: Always set tsk->vtime_snap_whence after accounting vtime
  2017-06-29 17:15 ` [PATCH 2/5] sched: Always set vtime_snap_whence after accounting vtime Frederic Weisbecker
  2017-06-29 23:01   ` Rik van Riel
@ 2017-07-05 10:28   ` tip-bot for Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: tip-bot for Frederic Weisbecker @ 2017-07-05 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: lcapitulino, riel, hpa, mingo, tglx, fweisbec, torvalds, peterz,
	linux-kernel, kernellwp

Commit-ID:  9fa57cf5a5c4aed1e45879b335fe433048709327
Gitweb:     http://git.kernel.org/tip/9fa57cf5a5c4aed1e45879b335fe433048709327
Author:     Frederic Weisbecker <fweisbec@gmail.com>
AuthorDate: Thu, 29 Jun 2017 19:15:08 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 5 Jul 2017 09:54:14 +0200

sched/cputime: Always set tsk->vtime_snap_whence after accounting vtime

Even though it doesn't have functional consequences, setting
the task's new context state after we actually accounted the pending
vtime from the old context state makes more sense from a review
perspective.

vtime_user_exit() is the only function that doesn't follow that rule
and that can bug the reviewer for a little while until he realizes there
is no reason for this special case.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5adc896..ab68927 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -736,9 +736,9 @@ void vtime_user_enter(struct task_struct *tsk)
 void vtime_user_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	tsk->vtime_snap_whence = VTIME_SYS;
 	if (vtime_delta(tsk))
 		account_user_time(tsk, get_vtime_delta(tsk));
+	tsk->vtime_snap_whence = VTIME_SYS;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip:sched/urgent] sched/cputime: Rename vtime fields
  2017-06-29 17:15 ` [PATCH 3/5] sched: Rename vtime fields Frederic Weisbecker
  2017-06-29 23:02   ` Rik van Riel
@ 2017-07-05 10:28   ` tip-bot for Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: tip-bot for Frederic Weisbecker @ 2017-07-05 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, linux-kernel, hpa, kernellwp, lcapitulino, tglx, fweisbec,
	riel, peterz, torvalds

Commit-ID:  60a9ce57e7c5ac1df3a39fb941022bbfa40c0862
Gitweb:     http://git.kernel.org/tip/60a9ce57e7c5ac1df3a39fb941022bbfa40c0862
Author:     Frederic Weisbecker <fweisbec@gmail.com>
AuthorDate: Thu, 29 Jun 2017 19:15:09 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 5 Jul 2017 09:54:14 +0200

sched/cputime: Rename vtime fields

The current "snapshot" based naming on vtime fields suggests we record
some past event but that's a low level picture of their actual purpose
which comes out blurry. The real point of these fields is to run a basic
state machine that tracks down cputime entry while switching between
contexts.

So lets reflect that with more meaningful names.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/init_task.h |  4 ++--
 include/linux/sched.h     |  4 ++--
 kernel/fork.c             |  4 ++--
 kernel/sched/cputime.c    | 30 +++++++++++++++---------------
 4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index e049526..3d53733 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -171,8 +171,8 @@ extern struct cred init_cred;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 # define INIT_VTIME(tsk)						\
 	.vtime_seqcount = SEQCNT_ZERO(tsk.vtime_seqcount),	\
-	.vtime_snap = 0,				\
-	.vtime_snap_whence = VTIME_SYS,
+	.vtime_starttime = 0,				\
+	.vtime_state = VTIME_SYS,
 #else
 # define INIT_VTIME(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9c4ca74..ff00164 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -689,7 +689,7 @@ struct task_struct {
 	struct prev_cputime		prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_t			vtime_seqcount;
-	unsigned long long		vtime_snap;
+	unsigned long long		vtime_starttime;
 	enum {
 		/* Task is sleeping or running in a CPU with VTIME inactive: */
 		VTIME_INACTIVE = 0,
@@ -697,7 +697,7 @@ struct task_struct {
 		VTIME_USER,
 		/* Task runs in kernelspace in a CPU with VTIME active: */
 		VTIME_SYS,
-	} vtime_snap_whence;
+	} vtime_state;
 #endif
 
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d..83c4f9b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1638,8 +1638,8 @@ static __latent_entropy struct task_struct *copy_process(
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_init(&p->vtime_seqcount);
-	p->vtime_snap = 0;
-	p->vtime_snap_whence = VTIME_INACTIVE;
+	p->vtime_starttime = 0;
+	p->vtime_state = VTIME_INACTIVE;
 #endif
 
 #if defined(SPLIT_RSS_COUNTING)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index ab68927..8c64753 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -683,10 +683,10 @@ static u64 vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
 
-	if (time_before(now, (unsigned long)tsk->vtime_snap))
+	if (time_before(now, (unsigned long)tsk->vtime_starttime))
 		return 0;
 
-	return jiffies_to_nsecs(now - tsk->vtime_snap);
+	return jiffies_to_nsecs(now - tsk->vtime_starttime);
 }
 
 static u64 get_vtime_delta(struct task_struct *tsk)
@@ -701,10 +701,10 @@ static u64 get_vtime_delta(struct task_struct *tsk)
 	 * elapsed time. Limit account_other_time to prevent rounding
 	 * errors from causing elapsed vtime to go negative.
 	 */
-	delta = jiffies_to_nsecs(now - tsk->vtime_snap);
+	delta = jiffies_to_nsecs(now - tsk->vtime_starttime);
 	other = account_other_time(delta);
-	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
-	tsk->vtime_snap = now;
+	WARN_ON_ONCE(tsk->vtime_state == VTIME_INACTIVE);
+	tsk->vtime_starttime = now;
 
 	return delta - other;
 }
@@ -746,7 +746,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 {
 	/*
 	 * The flags must be updated under the lock with
-	 * the vtime_snap flush and update.
+	 * the vtime_starttime flush and update.
 	 * That enforces a right ordering and update sequence
 	 * synchronization against the reader (task_gtime())
 	 * that can thus safely catch up with a tickless delta.
@@ -776,12 +776,12 @@ void vtime_account_idle(struct task_struct *tsk)
 void arch_vtime_task_switch(struct task_struct *prev)
 {
 	write_seqcount_begin(&prev->vtime_seqcount);
-	prev->vtime_snap_whence = VTIME_INACTIVE;
+	prev->vtime_state = VTIME_INACTIVE;
 	write_seqcount_end(&prev->vtime_seqcount);
 
 	write_seqcount_begin(&current->vtime_seqcount);
-	current->vtime_snap_whence = VTIME_SYS;
-	current->vtime_snap = jiffies;
+	current->vtime_state = VTIME_SYS;
+	current->vtime_starttime = jiffies;
 	write_seqcount_end(&current->vtime_seqcount);
 }
 
@@ -791,8 +791,8 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 
 	local_irq_save(flags);
 	write_seqcount_begin(&t->vtime_seqcount);
-	t->vtime_snap_whence = VTIME_SYS;
-	t->vtime_snap = jiffies;
+	t->vtime_state = VTIME_SYS;
+	t->vtime_starttime = jiffies;
 	write_seqcount_end(&t->vtime_seqcount);
 	local_irq_restore(flags);
 }
@@ -809,7 +809,7 @@ u64 task_gtime(struct task_struct *t)
 		seq = read_seqcount_begin(&t->vtime_seqcount);
 
 		gtime = t->gtime;
-		if (t->vtime_snap_whence == VTIME_SYS && t->flags & PF_VCPU)
+		if (t->vtime_state == VTIME_SYS && t->flags & PF_VCPU)
 			gtime += vtime_delta(t);
 
 	} while (read_seqcount_retry(&t->vtime_seqcount, seq));
@@ -840,7 +840,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		*stime = t->stime;
 
 		/* Task is sleeping, nothing to add */
-		if (t->vtime_snap_whence == VTIME_INACTIVE || is_idle_task(t))
+		if (t->vtime_state == VTIME_INACTIVE || is_idle_task(t))
 			continue;
 
 		delta = vtime_delta(t);
@@ -849,9 +849,9 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		 * Task runs either in user or kernel space, add pending nohz time to
 		 * the right place.
 		 */
-		if (t->vtime_snap_whence == VTIME_USER || t->flags & PF_VCPU)
+		if (t->vtime_state == VTIME_USER || t->flags & PF_VCPU)
 			*utime += delta;
-		else if (t->vtime_snap_whence == VTIME_SYS)
+		else if (t->vtime_state == VTIME_SYS)
 			*stime += delta;
 	} while (read_seqcount_retry(&t->vtime_seqcount, seq));
 }

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip:sched/urgent] sched/cputime: Move the vtime task fields to their own struct
  2017-06-29 17:15 ` [PATCH 4/5] sched: Move vtime task fields to their own struct Frederic Weisbecker
  2017-06-29 23:05   ` Rik van Riel
@ 2017-07-05 10:29   ` tip-bot for Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: tip-bot for Frederic Weisbecker @ 2017-07-05 10:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, lcapitulino, fweisbec, tglx, peterz, linux-kernel,
	kernellwp, riel, hpa, mingo

Commit-ID:  bac5b6b6b11560f323e71d0ebac4061cfe5f56c0
Gitweb:     http://git.kernel.org/tip/bac5b6b6b11560f323e71d0ebac4061cfe5f56c0
Author:     Frederic Weisbecker <fweisbec@gmail.com>
AuthorDate: Thu, 29 Jun 2017 19:15:10 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 5 Jul 2017 09:54:15 +0200

sched/cputime: Move the vtime task fields to their own struct

We are about to add vtime accumulation fields to the task struct. Let's
avoid more bloatification and gather vtime information to their own
struct.

Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/init_task.h |   6 +--
 include/linux/sched.h     |  26 ++++++-----
 kernel/fork.c             |   6 +--
 kernel/sched/cputime.c    | 112 ++++++++++++++++++++++++++--------------------
 4 files changed, 86 insertions(+), 64 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 3d53733..a2f6707 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -170,9 +170,9 @@ extern struct cred init_cred;
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 # define INIT_VTIME(tsk)						\
-	.vtime_seqcount = SEQCNT_ZERO(tsk.vtime_seqcount),	\
-	.vtime_starttime = 0,				\
-	.vtime_state = VTIME_SYS,
+	.vtime.seqcount = SEQCNT_ZERO(tsk.vtime.seqcount),		\
+	.vtime.starttime = 0,						\
+	.vtime.state = VTIME_SYS,
 #else
 # define INIT_VTIME(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ff00164..eeff8a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -223,6 +223,21 @@ struct task_cputime {
 #define prof_exp			stime
 #define sched_exp			sum_exec_runtime
 
+enum vtime_state {
+	/* Task is sleeping or running in a CPU with VTIME inactive: */
+	VTIME_INACTIVE = 0,
+	/* Task runs in userspace in a CPU with VTIME active: */
+	VTIME_USER,
+	/* Task runs in kernelspace in a CPU with VTIME active: */
+	VTIME_SYS,
+};
+
+struct vtime {
+	seqcount_t		seqcount;
+	unsigned long long	starttime;
+	enum vtime_state	state;
+};
+
 struct sched_info {
 #ifdef CONFIG_SCHED_INFO
 	/* Cumulative counters: */
@@ -688,16 +703,7 @@ struct task_struct {
 	u64				gtime;
 	struct prev_cputime		prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-	seqcount_t			vtime_seqcount;
-	unsigned long long		vtime_starttime;
-	enum {
-		/* Task is sleeping or running in a CPU with VTIME inactive: */
-		VTIME_INACTIVE = 0,
-		/* Task runs in userspace in a CPU with VTIME active: */
-		VTIME_USER,
-		/* Task runs in kernelspace in a CPU with VTIME active: */
-		VTIME_SYS,
-	} vtime_state;
+	struct vtime			vtime;
 #endif
 
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/fork.c b/kernel/fork.c
index 83c4f9b..d927ec1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1637,9 +1637,9 @@ static __latent_entropy struct task_struct *copy_process(
 	prev_cputime_init(&p->prev_cputime);
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-	seqcount_init(&p->vtime_seqcount);
-	p->vtime_starttime = 0;
-	p->vtime_state = VTIME_INACTIVE;
+	seqcount_init(&p->vtime.seqcount);
+	p->vtime.starttime = 0;
+	p->vtime.state = VTIME_INACTIVE;
 #endif
 
 #if defined(SPLIT_RSS_COUNTING)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 8c64753..9ee725e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -679,17 +679,17 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-static u64 vtime_delta(struct task_struct *tsk)
+static u64 vtime_delta(struct vtime *vtime)
 {
 	unsigned long now = READ_ONCE(jiffies);
 
-	if (time_before(now, (unsigned long)tsk->vtime_starttime))
+	if (time_before(now, (unsigned long)vtime->starttime))
 		return 0;
 
-	return jiffies_to_nsecs(now - tsk->vtime_starttime);
+	return jiffies_to_nsecs(now - vtime->starttime);
 }
 
-static u64 get_vtime_delta(struct task_struct *tsk)
+static u64 get_vtime_delta(struct vtime *vtime)
 {
 	unsigned long now = READ_ONCE(jiffies);
 	u64 delta, other;
@@ -701,49 +701,56 @@ static u64 get_vtime_delta(struct task_struct *tsk)
 	 * elapsed time. Limit account_other_time to prevent rounding
 	 * errors from causing elapsed vtime to go negative.
 	 */
-	delta = jiffies_to_nsecs(now - tsk->vtime_starttime);
+	delta = jiffies_to_nsecs(now - vtime->starttime);
 	other = account_other_time(delta);
-	WARN_ON_ONCE(tsk->vtime_state == VTIME_INACTIVE);
-	tsk->vtime_starttime = now;
+	WARN_ON_ONCE(vtime->state == VTIME_INACTIVE);
+	vtime->starttime = now;
 
 	return delta - other;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
 {
-	account_system_time(tsk, irq_count(), get_vtime_delta(tsk));
+	account_system_time(tsk, irq_count(), get_vtime_delta(&tsk->vtime));
 }
 
 void vtime_account_system(struct task_struct *tsk)
 {
-	if (!vtime_delta(tsk))
+	struct vtime *vtime = &tsk->vtime;
+
+	if (!vtime_delta(vtime))
 		return;
 
-	write_seqcount_begin(&tsk->vtime_seqcount);
+	write_seqcount_begin(&vtime->seqcount);
 	__vtime_account_system(tsk);
-	write_seqcount_end(&tsk->vtime_seqcount);
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_user_enter(struct task_struct *tsk)
 {
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
+	struct vtime *vtime = &tsk->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
+	if (vtime_delta(vtime))
 		__vtime_account_system(tsk);
-	tsk->vtime_snap_whence = VTIME_USER;
-	write_seqcount_end(&tsk->vtime_seqcount);
+	vtime->state = VTIME_USER;
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_user_exit(struct task_struct *tsk)
 {
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
-		account_user_time(tsk, get_vtime_delta(tsk));
-	tsk->vtime_snap_whence = VTIME_SYS;
-	write_seqcount_end(&tsk->vtime_seqcount);
+	struct vtime *vtime = &tsk->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
+	if (vtime_delta(vtime))
+		account_user_time(tsk, get_vtime_delta(vtime));
+	vtime->state = VTIME_SYS;
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_guest_enter(struct task_struct *tsk)
 {
+	struct vtime *vtime = &tsk->vtime;
 	/*
 	 * The flags must be updated under the lock with
 	 * the vtime_starttime flush and update.
@@ -751,54 +758,62 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * synchronization against the reader (task_gtime())
 	 * that can thus safely catch up with a tickless delta.
 	 */
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
+	write_seqcount_begin(&vtime->seqcount);
+	if (vtime_delta(vtime))
 		__vtime_account_system(tsk);
 	current->flags |= PF_VCPU;
-	write_seqcount_end(&tsk->vtime_seqcount);
+	write_seqcount_end(&vtime->seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_enter);
 
 void vtime_guest_exit(struct task_struct *tsk)
 {
-	write_seqcount_begin(&tsk->vtime_seqcount);
+	struct vtime *vtime = &tsk->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
 	__vtime_account_system(tsk);
 	current->flags &= ~PF_VCPU;
-	write_seqcount_end(&tsk->vtime_seqcount);
+	write_seqcount_end(&vtime->seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-	account_idle_time(get_vtime_delta(tsk));
+	account_idle_time(get_vtime_delta(&tsk->vtime));
 }
 
 void arch_vtime_task_switch(struct task_struct *prev)
 {
-	write_seqcount_begin(&prev->vtime_seqcount);
-	prev->vtime_state = VTIME_INACTIVE;
-	write_seqcount_end(&prev->vtime_seqcount);
+	struct vtime *vtime = &prev->vtime;
 
-	write_seqcount_begin(&current->vtime_seqcount);
-	current->vtime_state = VTIME_SYS;
-	current->vtime_starttime = jiffies;
-	write_seqcount_end(&current->vtime_seqcount);
+	write_seqcount_begin(&vtime->seqcount);
+	vtime->state = VTIME_INACTIVE;
+	write_seqcount_end(&vtime->seqcount);
+
+	vtime = &current->vtime;
+
+	write_seqcount_begin(&vtime->seqcount);
+	vtime->state = VTIME_SYS;
+	vtime->starttime = jiffies;
+	write_seqcount_end(&vtime->seqcount);
 }
 
 void vtime_init_idle(struct task_struct *t, int cpu)
 {
+	struct vtime *vtime = &t->vtime;
 	unsigned long flags;
 
 	local_irq_save(flags);
-	write_seqcount_begin(&t->vtime_seqcount);
-	t->vtime_state = VTIME_SYS;
-	t->vtime_starttime = jiffies;
-	write_seqcount_end(&t->vtime_seqcount);
+	write_seqcount_begin(&vtime->seqcount);
+	vtime->state = VTIME_SYS;
+	vtime->starttime = jiffies;
+	write_seqcount_end(&vtime->seqcount);
 	local_irq_restore(flags);
 }
 
 u64 task_gtime(struct task_struct *t)
 {
+	struct vtime *vtime = &t->vtime;
 	unsigned int seq;
 	u64 gtime;
 
@@ -806,13 +821,13 @@ u64 task_gtime(struct task_struct *t)
 		return t->gtime;
 
 	do {
-		seq = read_seqcount_begin(&t->vtime_seqcount);
+		seq = read_seqcount_begin(&vtime->seqcount);
 
 		gtime = t->gtime;
-		if (t->vtime_state == VTIME_SYS && t->flags & PF_VCPU)
-			gtime += vtime_delta(t);
+		if (vtime->state == VTIME_SYS && t->flags & PF_VCPU)
+			gtime += vtime_delta(vtime);
 
-	} while (read_seqcount_retry(&t->vtime_seqcount, seq));
+	} while (read_seqcount_retry(&vtime->seqcount, seq));
 
 	return gtime;
 }
@@ -824,8 +839,9 @@ u64 task_gtime(struct task_struct *t)
  */
 void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 {
-	u64 delta;
+	struct vtime *vtime = &t->vtime;
 	unsigned int seq;
+	u64 delta;
 
 	if (!vtime_accounting_enabled()) {
 		*utime = t->utime;
@@ -834,25 +850,25 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 	}
 
 	do {
-		seq = read_seqcount_begin(&t->vtime_seqcount);
+		seq = read_seqcount_begin(&vtime->seqcount);
 
 		*utime = t->utime;
 		*stime = t->stime;
 
 		/* Task is sleeping, nothing to add */
-		if (t->vtime_state == VTIME_INACTIVE || is_idle_task(t))
+		if (vtime->state == VTIME_INACTIVE || is_idle_task(t))
 			continue;
 
-		delta = vtime_delta(t);
+		delta = vtime_delta(vtime);
 
 		/*
 		 * Task runs either in user or kernel space, add pending nohz time to
 		 * the right place.
 		 */
-		if (t->vtime_state == VTIME_USER || t->flags & PF_VCPU)
+		if (vtime->state == VTIME_USER || t->flags & PF_VCPU)
 			*utime += delta;
-		else if (t->vtime_state == VTIME_SYS)
+		else if (vtime->state == VTIME_SYS)
 			*stime += delta;
-	} while (read_seqcount_retry(&t->vtime_seqcount, seq));
+	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip:sched/urgent] sched/cputime: Accumulate vtime on top of nsec clocksource
  2017-06-29 17:15 ` [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource Frederic Weisbecker
  2017-06-29 23:27   ` Rik van Riel
  2017-06-30  1:52   ` Wanpeng Li
@ 2017-07-05 10:29   ` tip-bot for Wanpeng Li
  2017-07-15  3:37   ` [PATCH 5/5] sched: " Levin, Alexander (Sasha Levin)
  3 siblings, 0 replies; 25+ messages in thread
From: tip-bot for Wanpeng Li @ 2017-07-05 10:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, linux-kernel, torvalds, mingo, fweisbec, peterz,
	lcapitulino, kernellwp, wanpeng.li, tglx, riel

Commit-ID:  2a42eb9594a1480b4ead9e036e06ee1290e5fa6d
Gitweb:     http://git.kernel.org/tip/2a42eb9594a1480b4ead9e036e06ee1290e5fa6d
Author:     Wanpeng Li <wanpeng.li@hotmail.com>
AuthorDate: Thu, 29 Jun 2017 19:15:11 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 5 Jul 2017 09:54:15 +0200

sched/cputime: Accumulate vtime on top of nsec clocksource

Currently the cputime source used by vtime is jiffies. When we cross
a context boundary and jiffies have changed since the last snapshot, the
pending cputime is accounted to the switching out context.

This system works ok if the ticks are not aligned across CPUs. If they
instead are aligned (ie: all fire at the same time) and the CPUs run in
userspace, the jiffies change is only observed on tick exit and therefore
the user cputime is accounted as system cputime. This is because the
CPU that maintains timekeeping fires its tick at the same time as the
others. It updates jiffies in the middle of the tick and the other CPUs
see that update on IRQ exit:

    CPU 0 (timekeeper)                  CPU 1
    -------------------              -------------
                      jiffies = N
    ...                              run in userspace for a jiffy
    tick entry                       tick entry (sees jiffies = N)
    set jiffies = N + 1
    tick exit                        tick exit (sees jiffies = N + 1)
                                                account 1 jiffy as stime

Fix this with using a nanosec clock source instead of jiffies. The
cputime is then accumulated and flushed everytime the pending delta
reaches a jiffy in order to mitigate the accounting overhead.

[ fweisbec: changelog, rebase on struct vtime, field renames, add delta
  on cputime readers, keep idle vtime as-is (low overhead accounting),
  harmonize clock sources. ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1498756511-11714-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h  |  3 +++
 kernel/sched/cputime.c | 64 +++++++++++++++++++++++++++++++++-----------------
 2 files changed, 45 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eeff8a0..4818126 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -236,6 +236,9 @@ struct vtime {
 	seqcount_t		seqcount;
 	unsigned long long	starttime;
 	enum vtime_state	state;
+	u64			utime;
+	u64			stime;
+	u64			gtime;
 };
 
 struct sched_info {
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 9ee725e..6e3ea4a 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -681,18 +681,19 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static u64 vtime_delta(struct vtime *vtime)
 {
-	unsigned long now = READ_ONCE(jiffies);
+	unsigned long long clock;
 
-	if (time_before(now, (unsigned long)vtime->starttime))
+	clock = sched_clock_cpu(smp_processor_id());
+	if (clock < vtime->starttime)
 		return 0;
 
-	return jiffies_to_nsecs(now - vtime->starttime);
+	return clock - vtime->starttime;
 }
 
 static u64 get_vtime_delta(struct vtime *vtime)
 {
-	unsigned long now = READ_ONCE(jiffies);
-	u64 delta, other;
+	u64 delta = vtime_delta(vtime);
+	u64 other;
 
 	/*
 	 * Unlike tick based timing, vtime based timing never has lost
@@ -701,17 +702,31 @@ static u64 get_vtime_delta(struct vtime *vtime)
 	 * elapsed time. Limit account_other_time to prevent rounding
 	 * errors from causing elapsed vtime to go negative.
 	 */
-	delta = jiffies_to_nsecs(now - vtime->starttime);
 	other = account_other_time(delta);
 	WARN_ON_ONCE(vtime->state == VTIME_INACTIVE);
-	vtime->starttime = now;
+	vtime->starttime += delta;
 
 	return delta - other;
 }
 
-static void __vtime_account_system(struct task_struct *tsk)
+static void __vtime_account_system(struct task_struct *tsk,
+				   struct vtime *vtime)
 {
-	account_system_time(tsk, irq_count(), get_vtime_delta(&tsk->vtime));
+	vtime->stime += get_vtime_delta(vtime);
+	if (vtime->stime >= TICK_NSEC) {
+		account_system_time(tsk, irq_count(), vtime->stime);
+		vtime->stime = 0;
+	}
+}
+
+static void vtime_account_guest(struct task_struct *tsk,
+				struct vtime *vtime)
+{
+	vtime->gtime += get_vtime_delta(vtime);
+	if (vtime->gtime >= TICK_NSEC) {
+		account_guest_time(tsk, vtime->gtime);
+		vtime->gtime = 0;
+	}
 }
 
 void vtime_account_system(struct task_struct *tsk)
@@ -722,7 +737,11 @@ void vtime_account_system(struct task_struct *tsk)
 		return;
 
 	write_seqcount_begin(&vtime->seqcount);
-	__vtime_account_system(tsk);
+	/* We might have scheduled out from guest path */
+	if (current->flags & PF_VCPU)
+		vtime_account_guest(tsk, vtime);
+	else
+		__vtime_account_system(tsk, vtime);
 	write_seqcount_end(&vtime->seqcount);
 }
 
@@ -731,8 +750,7 @@ void vtime_user_enter(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	if (vtime_delta(vtime))
-		__vtime_account_system(tsk);
+	__vtime_account_system(tsk, vtime);
 	vtime->state = VTIME_USER;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -742,8 +760,11 @@ void vtime_user_exit(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	if (vtime_delta(vtime))
-		account_user_time(tsk, get_vtime_delta(vtime));
+	vtime->utime += get_vtime_delta(vtime);
+	if (vtime->utime >= TICK_NSEC) {
+		account_user_time(tsk, vtime->utime);
+		vtime->utime = 0;
+	}
 	vtime->state = VTIME_SYS;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -759,8 +780,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&vtime->seqcount);
-	if (vtime_delta(vtime))
-		__vtime_account_system(tsk);
+	__vtime_account_system(tsk, vtime);
 	current->flags |= PF_VCPU;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -771,7 +791,7 @@ void vtime_guest_exit(struct task_struct *tsk)
 	struct vtime *vtime = &tsk->vtime;
 
 	write_seqcount_begin(&vtime->seqcount);
-	__vtime_account_system(tsk);
+	vtime_account_guest(tsk, vtime);
 	current->flags &= ~PF_VCPU;
 	write_seqcount_end(&vtime->seqcount);
 }
@@ -794,7 +814,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 
 	write_seqcount_begin(&vtime->seqcount);
 	vtime->state = VTIME_SYS;
-	vtime->starttime = jiffies;
+	vtime->starttime = sched_clock_cpu(smp_processor_id());
 	write_seqcount_end(&vtime->seqcount);
 }
 
@@ -806,7 +826,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	local_irq_save(flags);
 	write_seqcount_begin(&vtime->seqcount);
 	vtime->state = VTIME_SYS;
-	vtime->starttime = jiffies;
+	vtime->starttime = sched_clock_cpu(cpu);
 	write_seqcount_end(&vtime->seqcount);
 	local_irq_restore(flags);
 }
@@ -825,7 +845,7 @@ u64 task_gtime(struct task_struct *t)
 
 		gtime = t->gtime;
 		if (vtime->state == VTIME_SYS && t->flags & PF_VCPU)
-			gtime += vtime_delta(vtime);
+			gtime += vtime->gtime + vtime_delta(vtime);
 
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 
@@ -866,9 +886,9 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		 * the right place.
 		 */
 		if (vtime->state == VTIME_USER || t->flags & PF_VCPU)
-			*utime += delta;
+			*utime += vtime->utime + delta;
 		else if (vtime->state == VTIME_SYS)
-			*stime += delta;
+			*stime += vtime->stime + delta;
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting
  2017-07-04 16:52 ` Luiz Capitulino
@ 2017-07-05 13:16   ` Frederic Weisbecker
  0 siblings, 0 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2017-07-05 13:16 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: LKML, Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Wanpeng Li,
	Rik van Riel

On Tue, Jul 04, 2017 at 12:52:00PM -0400, Luiz Capitulino wrote:
> On Thu, 29 Jun 2017 19:15:06 +0200
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > Hi,
> > 
> > This is a proposition to fix
> > "[BUG nohz]: wrong user and system time accounting":
> >         http://lkml.kernel.org/r/20170323165512.60945ac6@redhat.com
> 
> Amazing series Frederic! This fixes all instances of the issue
> for me on bare-metal and KVM guests, even acct-bug[1] is fixed.
> 
> Tested-by: Luiz Capitulino <lcapitulino@redhat.com>
> 
> [1] http://people.redhat.com/~lcapitul/real-time/acct-bug.c

I'm glad it worked out! Thanks a lot for your testing. Now let's see
if someone stumbles upon a new bug introduced by this series. Given how
sensitive it can be.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource
  2017-06-29 23:27   ` Rik van Riel
@ 2017-07-05 13:20     ` Frederic Weisbecker
  0 siblings, 0 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2017-07-05 13:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, Wanpeng Li, Peter Zijlstra, Thomas Gleixner,
	Luiz Capitulino, Ingo Molnar

On Thu, Jun 29, 2017 at 07:27:27PM -0400, Rik van Riel wrote:
> On Thu, 2017-06-29 at 19:15 +0200, Frederic Weisbecker wrote:
> > From: Wanpeng Li <kernellwp@gmail.com>
> > 
> > Currently the cputime source used by vtime is jiffies. When we cross
> > a context boundary and jiffies have changed since the last snapshot,
> > the
> > pending cputime is accounted to the switching out context.
> > 
> > This system works ok if the ticks are not aligned across CPUs. If
> > they
> > instead are aligned (ie: all fire at the same time) and the CPUs run
> > in
> > userspace, the jiffies change is only observed on tick exit and
> > therefore
> > the user cputime is accounted as system cputime. This is because the
> > CPU that maintains timekeeping fires its tick at the same time as the
> > others. It updates jiffies in the middle of the tick and the other
> > CPUs
> > see that update on IRQ exit:
> > 
> >     CPU 0 (timekeeper)                  CPU 1
> >     -------------------              -------------
> >                       jiffies = N
> >     ...                              run in userspace for a jiffy
> >     tick entry                       tick entry (sees jiffies = N)
> >     set jiffies = N + 1
> >     tick exit                        tick exit (sees jiffies = N + 1)
> >                                                 account 1 jiffy as
> > stime
> > 
> > Fix this with using a nanosec clock source instead of jiffies. The
> > cputime is then accumulated and flushed everytime the pending delta
> > reaches a jiffy in order to mitigate the accounting overhead.
> 
> Glad to hear this could be done without dramatically
> increasing the accounting overhead!

Lets hope so, I actually haven't yet measured if there is a
performance delta :-s

If any I don't expect a big one.

Thanks for your reviews!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource
  2017-06-29 17:15 ` [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource Frederic Weisbecker
                     ` (2 preceding siblings ...)
  2017-07-05 10:29   ` [tip:sched/urgent] sched/cputime: " tip-bot for Wanpeng Li
@ 2017-07-15  3:37   ` Levin, Alexander (Sasha Levin)
  2017-07-15  5:26     ` Wanpeng Li
  3 siblings, 1 reply; 25+ messages in thread
From: Levin, Alexander (Sasha Levin) @ 2017-07-15  3:37 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Wanpeng Li, Peter Zijlstra, Thomas Gleixner,
	Luiz Capitulino, Ingo Molnar, Rik van Riel

On Thu, Jun 29, 2017 at 07:15:11PM +0200, Frederic Weisbecker wrote:
>From: Wanpeng Li <kernellwp@gmail.com>
>
>Currently the cputime source used by vtime is jiffies. When we cross
>a context boundary and jiffies have changed since the last snapshot, the
>pending cputime is accounted to the switching out context.
>
>This system works ok if the ticks are not aligned across CPUs. If they
>instead are aligned (ie: all fire at the same time) and the CPUs run in
>userspace, the jiffies change is only observed on tick exit and therefore
>the user cputime is accounted as system cputime. This is because the
>CPU that maintains timekeeping fires its tick at the same time as the
>others. It updates jiffies in the middle of the tick and the other CPUs
>see that update on IRQ exit:
>
>    CPU 0 (timekeeper)                  CPU 1
>    -------------------              -------------
>                      jiffies = N
>    ...                              run in userspace for a jiffy
>    tick entry                       tick entry (sees jiffies = N)
>    set jiffies = N + 1
>    tick exit                        tick exit (sees jiffies = N + 1)
>                                                account 1 jiffy as stime
>
>Fix this with using a nanosec clock source instead of jiffies. The
>cputime is then accumulated and flushed everytime the pending delta
>reaches a jiffy in order to mitigate the accounting overhead.
>
>[fweisbec: changelog, rebase on struct vtime, field renames, add delta
>on cputime readers, keep idle vtime as-is (low overhead accounting),
>harmonize clock sources]
>
>Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
>Suggested-by: Thomas Gleixner <tglx@linutronix.de>
>Not-Yet-Signed-off-by: Wanpeng Li <kernellwp@gmail.com>
>Cc: Rik van Riel <riel@redhat.com>
>Cc: Peter Zijlstra <peterz@infradead.org>
>Cc: Thomas Gleixner <tglx@linutronix.de>
>Cc: Wanpeng Li <kernellwp@gmail.com>
>Cc: Ingo Molnar <mingo@kernel.org>
>Cc: Luiz Capitulino <lcapitulino@redhat.com>
>Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>

Hi all,

This patch seems to be causing this:

BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u9:0/6
caller is debug_smp_processor_id+0x1c/0x20 lib/smp_processor_id.c:56
CPU: 1 PID: 6 Comm: kworker/u9:0 Not tainted 4.12.0-next-20170714 #187
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Workqueue: events_unbound call_usermodehelper_exec_work
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x11d/0x1ef lib/dump_stack.c:52
 check_preemption_disabled+0x1f4/0x200 lib/smp_processor_id.c:46
 debug_smp_processor_id+0x1c/0x20 lib/smp_processor_id.c:56
 vtime_delta.isra.6+0x11/0x60 kernel/sched/cputime.c:686
 task_cputime+0x3ca/0x790 kernel/sched/cputime.c:882
 thread_group_cputime+0x51a/0xaa0 kernel/sched/cputime.c:327
 thread_group_cputime_adjusted+0x73/0xf0 kernel/sched/cputime.c:676
 wait_task_zombie kernel/exit.c:1114 [inline]
 wait_consider_task+0x1c82/0x37f0 kernel/exit.c:1389
 do_wait_thread kernel/exit.c:1452 [inline]
 do_wait+0x457/0xb00 kernel/exit.c:1523
 kernel_wait4+0x1fd/0x380 kernel/exit.c:1665
 SYSC_wait4+0x145/0x160 kernel/exit.c:1677
 SyS_wait4+0x2c/0x40 kernel/exit.c:1673
 call_usermodehelper_exec_sync kernel/kmod.c:286 [inline]
 call_usermodehelper_exec_work+0x1fc/0x2c0 kernel/kmod.c:323
 process_one_work+0xae7/0x1a00 kernel/workqueue.c:2097
 worker_thread+0x221/0x1860 kernel/workqueue.c:2231
 kthread+0x35f/0x430 kernel/kthread.c:231
 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:425
capability: warning: `syz-executor5' uses 32-bit capabilities (legacy support in use)
BUG: using smp_processor_id() in preemptible [00000000] code: syz-executor6/7013
caller is debug_smp_processor_id+0x1c/0x20 lib/smp_processor_id.c:56
CPU: 3 PID: 7013 Comm: syz-executor6 Not tainted 4.12.0-next-20170714 #187
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x11d/0x1ef lib/dump_stack.c:52
 check_preemption_disabled+0x1f4/0x200 lib/smp_processor_id.c:46
 debug_smp_processor_id+0x1c/0x20 lib/smp_processor_id.c:56
 vtime_delta.isra.6+0x11/0x60 kernel/sched/cputime.c:686
 task_cputime+0x3ca/0x790 kernel/sched/cputime.c:882
 thread_group_cputime+0x51a/0xaa0 kernel/sched/cputime.c:327
 thread_group_cputime_adjusted+0x73/0xf0 kernel/sched/cputime.c:676
 wait_task_zombie kernel/exit.c:1114 [inline]
 wait_consider_task+0x1c82/0x37f0 kernel/exit.c:1389
 do_wait_thread kernel/exit.c:1452 [inline]
 do_wait+0x457/0xb00 kernel/exit.c:1523
 kernel_wait4+0x1fd/0x380 kernel/exit.c:1665
 SYSC_wait4+0x145/0x160 kernel/exit.c:1677
 SyS_wait4+0x2c/0x40 kernel/exit.c:1673
 do_syscall_64+0x267/0x740 arch/x86/entry/common.c:284
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x40bd8a
RSP: 002b:00007ffdbdf67b08 EFLAGS: 00000246 ORIG_RAX: 000000000000003d
RAX: ffffffffffffffda RBX: 0000000000b22914 RCX: 000000000040bd8a
RDX: 0000000040000001 RSI: 00007ffdbdf67b4c RDI: ffffffffffffffff
RBP: 0000000000002243 R08: 0000000000001b65 R09: 0000000000b22940
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffdbdf67b4c R14: 0000000000016ee4 R15: 0000000000000016
BUG: using smp_processor_id() in preemptible [00000000] code: init/1
caller is debug_smp_processor_id+0x1c/0x20 lib/smp_processor_id.c:56
CPU: 3 PID: 1 Comm: init Not tainted 4.12.0-next-20170714 #187
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x11d/0x1ef lib/dump_stack.c:52
 check_preemption_disabled+0x1f4/0x200 lib/smp_processor_id.c:46
 debug_smp_processor_id+0x1c/0x20 lib/smp_processor_id.c:56
 vtime_delta.isra.6+0x11/0x60 kernel/sched/cputime.c:686
 task_cputime+0x3ca/0x790 kernel/sched/cputime.c:882
 thread_group_cputime+0x51a/0xaa0 kernel/sched/cputime.c:327
 thread_group_cputime_adjusted+0x73/0xf0 kernel/sched/cputime.c:676
 wait_task_zombie kernel/exit.c:1114 [inline]
 wait_consider_task+0x1c82/0x37f0 kernel/exit.c:1389
 do_wait_thread kernel/exit.c:1452 [inline]
 do_wait+0x457/0xb00 kernel/exit.c:1523
 kernel_wait4+0x1fd/0x380 kernel/exit.c:1665
 SYSC_wait4+0x145/0x160 kernel/exit.c:1677
 SyS_wait4+0x2c/0x40 kernel/exit.c:1673
 do_syscall_64+0x267/0x740 arch/x86/entry/common.c:284
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f61952dca3e
RSP: 002b:00007fff93bafea0 EFLAGS: 00000246 ORIG_RAX: 000000000000003d
RAX: ffffffffffffffda RBX: 00007f6195c326a0 RCX: 00007f61952dca3e
RDX: 0000000000000001 RSI: 00007fff93bafedc RDI: ffffffffffffffff
RBP: 00007fff93bafedc R08: 00007fff93bb0870 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000004
R13: 00007fff93bb0bd0 R14: 0000000000000000 R15: 0000000000000000

-- 

Thanks,
Sasha

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource
  2017-07-15  3:37   ` [PATCH 5/5] sched: " Levin, Alexander (Sasha Levin)
@ 2017-07-15  5:26     ` Wanpeng Li
  0 siblings, 0 replies; 25+ messages in thread
From: Wanpeng Li @ 2017-07-15  5:26 UTC (permalink / raw)
  To: Levin, Alexander (Sasha Levin)
  Cc: Frederic Weisbecker, LKML, Peter Zijlstra, Thomas Gleixner,
	Luiz Capitulino, Ingo Molnar, Rik van Riel

2017-07-15 11:37 GMT+08:00 Levin, Alexander (Sasha Levin)
<alexander.levin@verizon.com>:
> On Thu, Jun 29, 2017 at 07:15:11PM +0200, Frederic Weisbecker wrote:
>>From: Wanpeng Li <kernellwp@gmail.com>
>>
>>Currently the cputime source used by vtime is jiffies. When we cross
>>a context boundary and jiffies have changed since the last snapshot, the
>>pending cputime is accounted to the switching out context.
>>
>>This system works ok if the ticks are not aligned across CPUs. If they
>>instead are aligned (ie: all fire at the same time) and the CPUs run in
>>userspace, the jiffies change is only observed on tick exit and therefore
>>the user cputime is accounted as system cputime. This is because the
>>CPU that maintains timekeeping fires its tick at the same time as the
>>others. It updates jiffies in the middle of the tick and the other CPUs
>>see that update on IRQ exit:
>>
>>    CPU 0 (timekeeper)                  CPU 1
>>    -------------------              -------------
>>                      jiffies = N
>>    ...                              run in userspace for a jiffy
>>    tick entry                       tick entry (sees jiffies = N)
>>    set jiffies = N + 1
>>    tick exit                        tick exit (sees jiffies = N + 1)
>>                                                account 1 jiffy as stime
>>
>>Fix this with using a nanosec clock source instead of jiffies. The
>>cputime is then accumulated and flushed everytime the pending delta
>>reaches a jiffy in order to mitigate the accounting overhead.
>>
>>[fweisbec: changelog, rebase on struct vtime, field renames, add delta
>>on cputime readers, keep idle vtime as-is (low overhead accounting),
>>harmonize clock sources]
>>
>>Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
>>Suggested-by: Thomas Gleixner <tglx@linutronix.de>
>>Not-Yet-Signed-off-by: Wanpeng Li <kernellwp@gmail.com>
>>Cc: Rik van Riel <riel@redhat.com>
>>Cc: Peter Zijlstra <peterz@infradead.org>
>>Cc: Thomas Gleixner <tglx@linutronix.de>
>>Cc: Wanpeng Li <kernellwp@gmail.com>
>>Cc: Ingo Molnar <mingo@kernel.org>
>>Cc: Luiz Capitulino <lcapitulino@redhat.com>
>>Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
>
> Hi all,
>
> This patch seems to be causing this:

Yeah, there is a patch to fix it.
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=0e4097c3354e2f5a5ad8affd9dc7f7f7d00bb6b9

Regards,
Wanpeng Li

>
> BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u9:0/6
> caller is debug_smp_processor_id+0x1c/0x20 lib/smp_processor_id.c:56
> CPU: 1 PID: 6 Comm: kworker/u9:0 Not tainted 4.12.0-next-20170714 #187
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
> Workqueue: events_unbound call_usermodehelper_exec_work
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x11d/0x1ef lib/dump_stack.c:52
>  check_preemption_disabled+0x1f4/0x200 lib/smp_processor_id.c:46
>  debug_smp_processor_id+0x1c/0x20 lib/smp_processor_id.c:56
>  vtime_delta.isra.6+0x11/0x60 kernel/sched/cputime.c:686
>  task_cputime+0x3ca/0x790 kernel/sched/cputime.c:882
>  thread_group_cputime+0x51a/0xaa0 kernel/sched/cputime.c:327
>  thread_group_cputime_adjusted+0x73/0xf0 kernel/sched/cputime.c:676
>  wait_task_zombie kernel/exit.c:1114 [inline]
>  wait_consider_task+0x1c82/0x37f0 kernel/exit.c:1389
>  do_wait_thread kernel/exit.c:1452 [inline]
>  do_wait+0x457/0xb00 kernel/exit.c:1523
>  kernel_wait4+0x1fd/0x380 kernel/exit.c:1665
>  SYSC_wait4+0x145/0x160 kernel/exit.c:1677
>  SyS_wait4+0x2c/0x40 kernel/exit.c:1673
>  call_usermodehelper_exec_sync kernel/kmod.c:286 [inline]
>  call_usermodehelper_exec_work+0x1fc/0x2c0 kernel/kmod.c:323
>  process_one_work+0xae7/0x1a00 kernel/workqueue.c:2097
>  worker_thread+0x221/0x1860 kernel/workqueue.c:2231
>  kthread+0x35f/0x430 kernel/kthread.c:231
>  ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:425
> capability: warning: `syz-executor5' uses 32-bit capabilities (legacy support in use)
> BUG: using smp_processor_id() in preemptible [00000000] code: syz-executor6/7013
> caller is debug_smp_processor_id+0x1c/0x20 lib/smp_processor_id.c:56
> CPU: 3 PID: 7013 Comm: syz-executor6 Not tainted 4.12.0-next-20170714 #187
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x11d/0x1ef lib/dump_stack.c:52
>  check_preemption_disabled+0x1f4/0x200 lib/smp_processor_id.c:46
>  debug_smp_processor_id+0x1c/0x20 lib/smp_processor_id.c:56
>  vtime_delta.isra.6+0x11/0x60 kernel/sched/cputime.c:686
>  task_cputime+0x3ca/0x790 kernel/sched/cputime.c:882
>  thread_group_cputime+0x51a/0xaa0 kernel/sched/cputime.c:327
>  thread_group_cputime_adjusted+0x73/0xf0 kernel/sched/cputime.c:676
>  wait_task_zombie kernel/exit.c:1114 [inline]
>  wait_consider_task+0x1c82/0x37f0 kernel/exit.c:1389
>  do_wait_thread kernel/exit.c:1452 [inline]
>  do_wait+0x457/0xb00 kernel/exit.c:1523
>  kernel_wait4+0x1fd/0x380 kernel/exit.c:1665
>  SYSC_wait4+0x145/0x160 kernel/exit.c:1677
>  SyS_wait4+0x2c/0x40 kernel/exit.c:1673
>  do_syscall_64+0x267/0x740 arch/x86/entry/common.c:284
>  entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x40bd8a
> RSP: 002b:00007ffdbdf67b08 EFLAGS: 00000246 ORIG_RAX: 000000000000003d
> RAX: ffffffffffffffda RBX: 0000000000b22914 RCX: 000000000040bd8a
> RDX: 0000000040000001 RSI: 00007ffdbdf67b4c RDI: ffffffffffffffff
> RBP: 0000000000002243 R08: 0000000000001b65 R09: 0000000000b22940
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 00007ffdbdf67b4c R14: 0000000000016ee4 R15: 0000000000000016
> BUG: using smp_processor_id() in preemptible [00000000] code: init/1
> caller is debug_smp_processor_id+0x1c/0x20 lib/smp_processor_id.c:56
> CPU: 3 PID: 1 Comm: init Not tainted 4.12.0-next-20170714 #187
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x11d/0x1ef lib/dump_stack.c:52
>  check_preemption_disabled+0x1f4/0x200 lib/smp_processor_id.c:46
>  debug_smp_processor_id+0x1c/0x20 lib/smp_processor_id.c:56
>  vtime_delta.isra.6+0x11/0x60 kernel/sched/cputime.c:686
>  task_cputime+0x3ca/0x790 kernel/sched/cputime.c:882
>  thread_group_cputime+0x51a/0xaa0 kernel/sched/cputime.c:327
>  thread_group_cputime_adjusted+0x73/0xf0 kernel/sched/cputime.c:676
>  wait_task_zombie kernel/exit.c:1114 [inline]
>  wait_consider_task+0x1c82/0x37f0 kernel/exit.c:1389
>  do_wait_thread kernel/exit.c:1452 [inline]
>  do_wait+0x457/0xb00 kernel/exit.c:1523
>  kernel_wait4+0x1fd/0x380 kernel/exit.c:1665
>  SYSC_wait4+0x145/0x160 kernel/exit.c:1677
>  SyS_wait4+0x2c/0x40 kernel/exit.c:1673
>  do_syscall_64+0x267/0x740 arch/x86/entry/common.c:284
>  entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 0033:0x7f61952dca3e
> RSP: 002b:00007fff93bafea0 EFLAGS: 00000246 ORIG_RAX: 000000000000003d
> RAX: ffffffffffffffda RBX: 00007f6195c326a0 RCX: 00007f61952dca3e
> RDX: 0000000000000001 RSI: 00007fff93bafedc RDI: ffffffffffffffff
> RBP: 00007fff93bafedc R08: 00007fff93bb0870 R09: 0000000000000001
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000004
> R13: 00007fff93bb0bd0 R14: 0000000000000000 R15: 0000000000000000
>
> --
>
> Thanks,
> Sasha

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-07-15  5:26 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-29 17:15 [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Frederic Weisbecker
2017-06-29 17:15 ` [PATCH 1/5] vtime: Remove vtime_account_user() Frederic Weisbecker
2017-06-29 23:01   ` Rik van Riel
2017-07-05 10:28   ` [tip:sched/urgent] vtime, sched/cputime: " tip-bot for Frederic Weisbecker
2017-06-29 17:15 ` [PATCH 2/5] sched: Always set vtime_snap_whence after accounting vtime Frederic Weisbecker
2017-06-29 23:01   ` Rik van Riel
2017-07-05 10:28   ` [tip:sched/urgent] sched/cputime: Always set tsk->vtime_snap_whence " tip-bot for Frederic Weisbecker
2017-06-29 17:15 ` [PATCH 3/5] sched: Rename vtime fields Frederic Weisbecker
2017-06-29 23:02   ` Rik van Riel
2017-07-05 10:28   ` [tip:sched/urgent] sched/cputime: " tip-bot for Frederic Weisbecker
2017-06-29 17:15 ` [PATCH 4/5] sched: Move vtime task fields to their own struct Frederic Weisbecker
2017-06-29 23:05   ` Rik van Riel
2017-07-05 10:29   ` [tip:sched/urgent] sched/cputime: Move the " tip-bot for Frederic Weisbecker
2017-06-29 17:15 ` [PATCH 5/5] sched: Accumulate vtime on top of nsec clocksource Frederic Weisbecker
2017-06-29 23:27   ` Rik van Riel
2017-07-05 13:20     ` Frederic Weisbecker
2017-06-30  1:52   ` Wanpeng Li
2017-07-05 10:29   ` [tip:sched/urgent] sched/cputime: " tip-bot for Wanpeng Li
2017-07-15  3:37   ` [PATCH 5/5] sched: " Levin, Alexander (Sasha Levin)
2017-07-15  5:26     ` Wanpeng Li
2017-06-30  1:41 ` [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting Wanpeng Li
2017-06-30 17:32   ` Luiz Capitulino
2017-07-03 10:28 ` Thomas Gleixner
2017-07-04 16:52 ` Luiz Capitulino
2017-07-05 13:16   ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).