linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] cputime fixes and cleanups
@ 2016-07-13 14:50 Frederic Weisbecker
  2016-07-13 14:50 ` [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time Frederic Weisbecker
                   ` (4 more replies)
  0 siblings, 5 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2016-07-13 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Paolo Bonzini, Peter Zijlstra,
	Wanpeng Li, Thomas Gleixner, Radim Krcmar, Mike Galbraith,
	Rik van Riel

Ingo,

Please pull the sched/cputime branch that can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	sched/cputime

It is based on top of tip:sched/core due to conflicts against commits in
this branch.

HEAD: 363e4ca0466a3db318fce18e53f6e492bffc0202

---
Summary of changes:

* Fix system/idle cputime leaked on cputime accounting (all nohz configs)
* Remove nohz full messy ad-hoc irqtime accounting, make it compatible
  with CONFIG_IRQ_TIME_ACCOUNTING instead.
* Cleanups
* Remove unecessary irq disablement on irqtime code.

Thanks,
	Frederic
---

Rik van Riel (3):
      sched,time: Count actually elapsed irq & softirq time
      nohz,cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
      time: Drop local_irq_save/restore from irqtime_account_irq

Frederic Weisbecker (2):
      sched: Complete cleanup of old vtime gen irqtime accounting
      sched: Reorganize vtime native irqtime accounting headers


 include/asm-generic/cputime_nsecs.h |   2 +
 include/linux/vtime.h               |  50 ++++-------
 init/Kconfig                        |   6 +-
 kernel/sched/cputime.c              | 171 ++++++++++++++++++------------------
 4 files changed, 109 insertions(+), 120 deletions(-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-07-13 14:50 [GIT PULL] cputime fixes and cleanups Frederic Weisbecker
@ 2016-07-13 14:50 ` Frederic Weisbecker
  2016-07-14 10:37   ` [tip:timers/nohz] sched/cputime: " tip-bot for Rik van Riel
  2016-08-09  3:59   ` [PATCH 1/5] sched,time: " Wanpeng Li
  2016-07-13 14:50 ` [PATCH 2/5] nohz,cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code Frederic Weisbecker
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2016-07-13 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Rik van Riel, Paolo Bonzini, Peter Zijlstra, Wanpeng Li,
	Thomas Gleixner, Radim Krcmar, Frederic Weisbecker,
	Mike Galbraith

From: Rik van Riel <riel@redhat.com>

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of the time was actually spent
handling irqs, and could conceivably mis-count all of the ticks during
a period as irq time, when there was some irq and some softirq time.

This can actually happen when irqtime_account_process_tick is called
from account_idle_ticks, which can pass a larger number of ticks down
all at once.

Fix this by changing irqtime_account_hi_update, irqtime_account_si_update,
and steal_account_process_ticks to work with cputime_t time units, and
return the amount of time spent in each mode.

Rename steal_account_process_ticks to steal_account_process_time, to
reflect that time is now accounted in cputime_t, instead of ticks.

Additionally, have irqtime_account_process_tick take into account how
much time was spent in each of steal, irq, and softirq time.

The latter could help improve the accuracy of cputime
accounting when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Radim Krcmar <rkrcmar@redhat.com>
[fweisbec: Make nsecs_to_cputime64() actually return cputime64_t]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/asm-generic/cputime_nsecs.h |   2 +
 kernel/sched/cputime.c              | 124 ++++++++++++++++++++++--------------
 2 files changed, 79 insertions(+), 47 deletions(-)

diff --git a/include/asm-generic/cputime_nsecs.h b/include/asm-generic/cputime_nsecs.h
index 0f1c6f3..a84e28e 100644
--- a/include/asm-generic/cputime_nsecs.h
+++ b/include/asm-generic/cputime_nsecs.h
@@ -50,6 +50,8 @@ typedef u64 __nocast cputime64_t;
 	(__force u64)(__ct)
 #define nsecs_to_cputime(__nsecs)	\
 	(__force cputime_t)(__nsecs)
+#define nsecs_to_cputime64(__nsecs)	\
+	(__force cputime64_t)(__nsecs)
 
 
 /*
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3d60e5d..db82ae1 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,40 +79,50 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static cputime_t irqtime_account_hi_update(cputime_t maxtime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	cputime_t irq_cputime;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq_cputime = nsecs_to_cputime64(this_cpu_read(cpu_hardirq_time)) -
+		      cpustat[CPUTIME_IRQ];
+	irq_cputime = min(irq_cputime, maxtime);
+	cpustat[CPUTIME_IRQ] += irq_cputime;
 	local_irq_restore(flags);
-	return ret;
+	return irq_cputime;
 }
 
-static int irqtime_account_si_update(void)
+static cputime_t irqtime_account_si_update(cputime_t maxtime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	cputime_t softirq_cputime;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	softirq_cputime = nsecs_to_cputime64(this_cpu_read(cpu_softirq_time)) -
+			  cpustat[CPUTIME_SOFTIRQ];
+	softirq_cputime = min(softirq_cputime, maxtime);
+	cpustat[CPUTIME_SOFTIRQ] += softirq_cputime;
 	local_irq_restore(flags);
-	return ret;
+	return softirq_cputime;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 
 #define sched_clock_irqtime	(0)
 
+static cputime_t irqtime_account_hi_update(cputime_t dummy)
+{
+	return 0;
+}
+
+static cputime_t irqtime_account_si_update(cputime_t dummy)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
 static inline void task_group_account_field(struct task_struct *p, int index,
@@ -257,32 +267,45 @@ void account_idle_time(cputime_t cputime)
 		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
-static __always_inline unsigned long steal_account_process_tick(unsigned long max_jiffies)
+static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 {
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
+		cputime_t steal_cputime;
 		u64 steal;
-		unsigned long steal_jiffies;
 
 		steal = paravirt_steal_clock(smp_processor_id());
 		steal -= this_rq()->prev_steal_time;
 
-		/*
-		 * steal is in nsecs but our caller is expecting steal
-		 * time in jiffies. Lets cast the result to jiffies
-		 * granularity and account the rest on the next rounds.
-		 */
-		steal_jiffies = min(nsecs_to_jiffies(steal), max_jiffies);
-		this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies);
+		steal_cputime = min(nsecs_to_cputime(steal), maxtime);
+		account_steal_time(steal_cputime);
+		this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);
 
-		account_steal_time(jiffies_to_cputime(steal_jiffies));
-		return steal_jiffies;
+		return steal_cputime;
 	}
 #endif
 	return 0;
 }
 
 /*
+ * Account how much elapsed time was spent in steal, irq, or softirq time.
+ */
+static inline cputime_t account_other_time(cputime_t max)
+{
+	cputime_t accounted;
+
+	accounted = steal_account_process_time(max);
+
+	if (accounted < max)
+		accounted += irqtime_account_hi_update(max - accounted);
+
+	if (accounted < max)
+		accounted += irqtime_account_si_update(max - accounted);
+
+	return accounted;
+}
+
+/*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
  */
@@ -342,21 +365,23 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 					 struct rq *rq, int ticks)
 {
-	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
-	u64 cputime = (__force u64) cputime_one_jiffy;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	u64 cputime = (__force u64) cputime_one_jiffy * ticks;
+	cputime_t scaled, other;
 
-	if (steal_account_process_tick(ULONG_MAX))
+	/*
+	 * When returning from idle, many ticks can get accounted at
+	 * once, including some ticks of steal, irq, and softirq time.
+	 * Subtract those ticks from the amount of time accounted to
+	 * idle, or potentially user or system time. Due to rounding,
+	 * other time can exceed ticks occasionally.
+	 */
+	other = account_other_time(cputime);
+	if (other >= cputime)
 		return;
+	cputime -= other;
+	scaled = cputime_to_scaled(cputime);
 
-	cputime *= ticks;
-	scaled *= ticks;
-
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
@@ -466,7 +491,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
  */
 void account_process_tick(struct task_struct *p, int user_tick)
 {
-	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
+	cputime_t cputime, scaled, steal;
 	struct rq *rq = this_rq();
 
 	if (vtime_accounting_cpu_enabled())
@@ -477,16 +502,21 @@ void account_process_tick(struct task_struct *p, int user_tick)
 		return;
 	}
 
-	if (steal_account_process_tick(ULONG_MAX))
+	cputime = cputime_one_jiffy;
+	steal = steal_account_process_time(cputime);
+
+	if (steal >= cputime)
 		return;
 
+	cputime -= steal;
+	scaled = cputime_to_scaled(cputime);
+
 	if (user_tick)
-		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
+		account_user_time(p, cputime, scaled);
 	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
-		account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
-				    one_jiffy_scaled);
+		account_system_time(p, HARDIRQ_OFFSET, cputime, scaled);
 	else
-		account_idle_time(cputime_one_jiffy);
+		account_idle_time(cputime);
 }
 
 /*
@@ -681,14 +711,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	unsigned long delta_jiffies, steal_jiffies;
+	cputime_t delta, steal;
 
-	delta_jiffies = now - tsk->vtime_snap;
-	steal_jiffies = steal_account_process_tick(delta_jiffies);
+	delta = jiffies_to_cputime(now - tsk->vtime_snap);
+	steal = steal_account_process_time(delta);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return jiffies_to_cputime(delta_jiffies - steal_jiffies);
+	return delta - steal;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/5] nohz,cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
  2016-07-13 14:50 [GIT PULL] cputime fixes and cleanups Frederic Weisbecker
  2016-07-13 14:50 ` [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time Frederic Weisbecker
@ 2016-07-13 14:50 ` Frederic Weisbecker
  2016-07-14 10:37   ` [tip:timers/nohz] sched/cputime: " tip-bot for Rik van Riel
  2016-07-13 14:50 ` [PATCH 3/5] sched: Complete cleanup of old vtime gen irqtime accounting Frederic Weisbecker
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2016-07-13 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Rik van Riel, Paolo Bonzini, Peter Zijlstra, Wanpeng Li,
	Thomas Gleixner, Radim Krcmar, Frederic Weisbecker,
	Mike Galbraith

From: Rik van Riel <riel@redhat.com>

The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
appear to currently work right.

On CPUs without nohz_full=, only tick based irq time sampling is
done, which breaks down when dealing with a nohz_idle CPU.

On firewalls and similar systems, no ticks may happen on a CPU for a
while, and the irq time spent may never get accounted properly. This
can cause issues with capacity planning and power saving, which use
the CPU statistics as inputs in decision making.

Remove the VTIME_GEN vtime irq time code, and replace it with the
IRQ_TIME_ACCOUNTING code, when selected as a config option by the user.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/vtime.h  | 32 ++++++++++++++------------------
 init/Kconfig           |  6 +++---
 kernel/sched/cputime.c | 16 +++-------------
 3 files changed, 20 insertions(+), 34 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index fa21969..d1977d84 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -14,6 +14,18 @@ struct task_struct;
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
+
+#ifdef __ARCH_HAS_VTIME_ACCOUNT
+extern void vtime_account_irq_enter(struct task_struct *tsk);
+#else
+extern void vtime_common_account_irq_enter(struct task_struct *tsk);
+static inline void vtime_account_irq_enter(struct task_struct *tsk)
+{
+	if (vtime_accounting_cpu_enabled())
+		vtime_common_account_irq_enter(tsk);
+}
+#endif /* __ARCH_HAS_VTIME_ACCOUNT */
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
@@ -64,17 +76,6 @@ extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 extern void vtime_account_user(struct task_struct *tsk);
 
-#ifdef __ARCH_HAS_VTIME_ACCOUNT
-extern void vtime_account_irq_enter(struct task_struct *tsk);
-#else
-extern void vtime_common_account_irq_enter(struct task_struct *tsk);
-static inline void vtime_account_irq_enter(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_common_account_irq_enter(tsk);
-}
-#endif /* __ARCH_HAS_VTIME_ACCOUNT */
-
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 static inline void vtime_task_switch(struct task_struct *prev) { }
@@ -85,13 +86,8 @@ static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-extern void vtime_gen_account_irq_exit(struct task_struct *tsk);
-
-static inline void vtime_account_irq_exit(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_gen_account_irq_exit(tsk);
-}
+static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
+static inline void vtime_account_irq_exit(struct task_struct *tsk) { }
 
 extern void vtime_user_enter(struct task_struct *tsk);
 
diff --git a/init/Kconfig b/init/Kconfig
index f755a60..397c2a63 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -375,9 +375,11 @@ config VIRT_CPU_ACCOUNTING_GEN
 
 	  If unsure, say N.
 
+endchoice
+
 config IRQ_TIME_ACCOUNTING
 	bool "Fine granularity task level IRQ time accounting"
-	depends on HAVE_IRQ_TIME_ACCOUNTING && !NO_HZ_FULL
+	depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
 	help
 	  Select this option to enable fine granularity task irq time
 	  accounting. This is done by reading a timestamp on each
@@ -386,8 +388,6 @@ config IRQ_TIME_ACCOUNTING
 
 	  If in doubt, say N here.
 
-endchoice
-
 config BSD_PROCESS_ACCT
 	bool "BSD Process Accounting"
 	depends on MULTIUSER
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index db82ae1..ca7e33c 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -711,14 +711,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	cputime_t delta, steal;
+	cputime_t delta, other;
 
 	delta = jiffies_to_cputime(now - tsk->vtime_snap);
-	steal = steal_account_process_time(delta);
+	other = account_other_time(delta);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return delta - steal;
+	return delta - other;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
@@ -738,16 +738,6 @@ void vtime_account_system(struct task_struct *tsk)
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-void vtime_gen_account_irq_exit(struct task_struct *tsk)
-{
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
-		__vtime_account_system(tsk);
-	if (context_tracking_in_user())
-		tsk->vtime_snap_whence = VTIME_USER;
-	write_seqcount_end(&tsk->vtime_seqcount);
-}
-
 void vtime_account_user(struct task_struct *tsk)
 {
 	cputime_t delta_cpu;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/5] sched: Complete cleanup of old vtime gen irqtime accounting
  2016-07-13 14:50 [GIT PULL] cputime fixes and cleanups Frederic Weisbecker
  2016-07-13 14:50 ` [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time Frederic Weisbecker
  2016-07-13 14:50 ` [PATCH 2/5] nohz,cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code Frederic Weisbecker
@ 2016-07-13 14:50 ` Frederic Weisbecker
  2016-07-14 10:38   ` [tip:timers/nohz] sched/cputime: Clean up the old vtime gen irqtime accounting completely tip-bot for Frederic Weisbecker
  2016-07-13 14:50 ` [PATCH 4/5] sched: Reorganize vtime native irqtime accounting headers Frederic Weisbecker
  2016-07-13 14:50 ` [PATCH 5/5] time: Drop local_irq_save/restore from irqtime_account_irq Frederic Weisbecker
  4 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2016-07-13 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Paolo Bonzini, Peter Zijlstra,
	Wanpeng Li, Thomas Gleixner, Radim Krcmar, Mike Galbraith,
	Rik van Riel

Vtime generic irqtime accounting has been removed but there are a few
remains to cleanup:

* The vtime_accounting_cpu_enabled() check in irq entry was only used
  by CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can safely remove it.

* Without the vtime_accounting_cpu_enabled(), we no longer need to
  have a vtime_common_account_irq_enter() indirect function.

* Move vtime_account_irq_enter() implementation under
  CONFIG_VIRT_CPU_ACCOUNTING_NATIVE which is the last user.

* The vtime_account_user() call was only used on irq entry for
  CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can remove that too.

Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/vtime.h  | 11 -----------
 kernel/sched/cputime.c | 33 ++++++++++-----------------------
 2 files changed, 10 insertions(+), 34 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index d1977d84..65aef5e 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -14,18 +14,7 @@ struct task_struct;
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
-
-#ifdef __ARCH_HAS_VTIME_ACCOUNT
 extern void vtime_account_irq_enter(struct task_struct *tsk);
-#else
-extern void vtime_common_account_irq_enter(struct task_struct *tsk);
-static inline void vtime_account_irq_enter(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_common_account_irq_enter(tsk);
-}
-#endif /* __ARCH_HAS_VTIME_ACCOUNT */
-
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index ca7e33c..16a873c 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -431,6 +431,10 @@ void vtime_common_task_switch(struct task_struct *prev)
 }
 #endif
 
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
+
+
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 /*
  * Archs that account the whole time spent in the idle task
  * (outside irq) as idle time can rely on this and just implement
@@ -440,33 +444,16 @@ void vtime_common_task_switch(struct task_struct *prev)
  * vtime_account().
  */
 #ifndef __ARCH_HAS_VTIME_ACCOUNT
-void vtime_common_account_irq_enter(struct task_struct *tsk)
+void vtime_account_irq_enter(struct task_struct *tsk)
 {
-	if (!in_interrupt()) {
-		/*
-		 * If we interrupted user, context_tracking_in_user()
-		 * is 1 because the context tracking don't hook
-		 * on irq entry/exit. This way we know if
-		 * we need to flush user time on kernel entry.
-		 */
-		if (context_tracking_in_user()) {
-			vtime_account_user(tsk);
-			return;
-		}
-
-		if (is_idle_task(tsk)) {
-			vtime_account_idle(tsk);
-			return;
-		}
-	}
-	vtime_account_system(tsk);
+	if (!in_interrupt() && is_idle_task(tsk))
+		vtime_account_idle(tsk);
+	else
+		vtime_account_system(tsk);
 }
-EXPORT_SYMBOL_GPL(vtime_common_account_irq_enter);
+EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
 #endif /* __ARCH_HAS_VTIME_ACCOUNT */
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 
-
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 {
 	*ut = p->utime;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/5] sched: Reorganize vtime native irqtime accounting headers
  2016-07-13 14:50 [GIT PULL] cputime fixes and cleanups Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2016-07-13 14:50 ` [PATCH 3/5] sched: Complete cleanup of old vtime gen irqtime accounting Frederic Weisbecker
@ 2016-07-13 14:50 ` Frederic Weisbecker
  2016-07-14 10:38   ` [tip:timers/nohz] sched/cputime: " tip-bot for Frederic Weisbecker
  2016-07-13 14:50 ` [PATCH 5/5] time: Drop local_irq_save/restore from irqtime_account_irq Frederic Weisbecker
  4 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2016-07-13 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Paolo Bonzini, Peter Zijlstra,
	Wanpeng Li, Thomas Gleixner, Radim Krcmar, Mike Galbraith,
	Rik van Riel

The vtime irqtime accounting headers are very scattered and convoluted
right now. Reorganize them such that it is obvious that only
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE does use it.

Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/vtime.h | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 65aef5e..aa9bfea 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -12,12 +12,9 @@ struct task_struct;
 /*
  * vtime_accounting_cpu_enabled() definitions/declarations
  */
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+#if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE)
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
-extern void vtime_account_irq_enter(struct task_struct *tsk);
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
-
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+#elif defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN)
 /*
  * Checks if vtime is enabled on some CPU. Cputime readers want to be careful
  * in that case and compute the tickless cputime.
@@ -38,11 +35,9 @@ static inline bool vtime_accounting_cpu_enabled(void)
 
 	return false;
 }
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
-
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 static inline bool vtime_accounting_cpu_enabled(void) { return false; }
-#endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
+#endif
 
 
 /*
@@ -70,14 +65,10 @@ extern void vtime_account_user(struct task_struct *tsk);
 static inline void vtime_task_switch(struct task_struct *prev) { }
 static inline void vtime_account_system(struct task_struct *tsk) { }
 static inline void vtime_account_user(struct task_struct *tsk) { }
-static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
-static inline void vtime_account_irq_exit(struct task_struct *tsk) { }
-
 extern void vtime_user_enter(struct task_struct *tsk);
 
 static inline void vtime_user_exit(struct task_struct *tsk)
@@ -88,11 +79,6 @@ extern void vtime_guest_enter(struct task_struct *tsk);
 extern void vtime_guest_exit(struct task_struct *tsk);
 extern void vtime_init_idle(struct task_struct *tsk, int cpu);
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN  */
-static inline void vtime_account_irq_exit(struct task_struct *tsk)
-{
-	/* On hard|softirq exit we always account to hard|softirq cputime */
-	vtime_account_system(tsk);
-}
 static inline void vtime_user_enter(struct task_struct *tsk) { }
 static inline void vtime_user_exit(struct task_struct *tsk) { }
 static inline void vtime_guest_enter(struct task_struct *tsk) { }
@@ -100,6 +86,19 @@ static inline void vtime_guest_exit(struct task_struct *tsk) { }
 static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
 #endif
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+extern void vtime_account_irq_enter(struct task_struct *tsk);
+static inline void vtime_account_irq_exit(struct task_struct *tsk)
+{
+	/* On hard|softirq exit we always account to hard|softirq cputime */
+	vtime_account_system(tsk);
+}
+#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
+static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
+static inline void vtime_account_irq_exit(struct task_struct *tsk) { }
+#endif
+
+
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 extern void irqtime_account_irq(struct task_struct *tsk);
 #else
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/5] time: Drop local_irq_save/restore from irqtime_account_irq
  2016-07-13 14:50 [GIT PULL] cputime fixes and cleanups Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2016-07-13 14:50 ` [PATCH 4/5] sched: Reorganize vtime native irqtime accounting headers Frederic Weisbecker
@ 2016-07-13 14:50 ` Frederic Weisbecker
  2016-07-14 10:38   ` [tip:timers/nohz] sched/cputime: Drop local_irq_save/restore from irqtime_account_irq() tip-bot for Rik van Riel
  4 siblings, 1 reply; 42+ messages in thread
From: Frederic Weisbecker @ 2016-07-13 14:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Rik van Riel, Paolo Bonzini, Peter Zijlstra, Wanpeng Li,
	Thomas Gleixner, Radim Krcmar, Frederic Weisbecker,
	Mike Galbraith

From: Rik van Riel <riel@redhat.com>

Paolo pointed out that irqs are already blocked when irqtime_account_irq
is called. That means there is no reason to call local_irq_save/restore
again.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/cputime.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 16a873c..ea0f6f3 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -49,15 +49,12 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
  */
 void irqtime_account_irq(struct task_struct *curr)
 {
-	unsigned long flags;
 	s64 delta;
 	int cpu;
 
 	if (!sched_clock_irqtime)
 		return;
 
-	local_irq_save(flags);
-
 	cpu = smp_processor_id();
 	delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
 	__this_cpu_add(irq_start_time, delta);
@@ -75,7 +72,6 @@ void irqtime_account_irq(struct task_struct *curr)
 		__this_cpu_add(cpu_softirq_time, delta);
 
 	irq_time_write_end();
-	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [tip:timers/nohz] sched/cputime: Count actually elapsed irq & softirq time
  2016-07-13 14:50 ` [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time Frederic Weisbecker
@ 2016-07-14 10:37   ` tip-bot for Rik van Riel
  2016-08-09  3:59   ` [PATCH 1/5] sched,time: " Wanpeng Li
  1 sibling, 0 replies; 42+ messages in thread
From: tip-bot for Rik van Riel @ 2016-07-14 10:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: riel, torvalds, rkrcmar, pbonzini, mingo, peterz, wanpeng.li,
	fweisbec, linux-kernel, efault, hpa, tglx

Commit-ID:  57430218317e5b280a80582a139b26029c25de6c
Gitweb:     http://git.kernel.org/tip/57430218317e5b280a80582a139b26029c25de6c
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 13 Jul 2016 16:50:01 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 14 Jul 2016 10:42:34 +0200

sched/cputime: Count actually elapsed irq & softirq time

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of the time was actually spent
handling irqs, and could conceivably mis-count all of the ticks during
a period as irq time, when there was some irq and some softirq time.

This can actually happen when irqtime_account_process_tick is called
from account_idle_ticks, which can pass a larger number of ticks down
all at once.

Fix this by changing irqtime_account_hi_update(), irqtime_account_si_update(),
and steal_account_process_ticks() to work with cputime_t time units, and
return the amount of time spent in each mode.

Rename steal_account_process_ticks() to steal_account_process_time(), to
reflect that time is now accounted in cputime_t, instead of ticks.

Additionally, have irqtime_account_process_tick() take into account how
much time was spent in each of steal, irq, and softirq time.

The latter could help improve the accuracy of cputime
accounting when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
[ Make nsecs_to_cputime64() actually return cputime64_t. ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/asm-generic/cputime_nsecs.h |   2 +
 kernel/sched/cputime.c              | 124 ++++++++++++++++++++++--------------
 2 files changed, 79 insertions(+), 47 deletions(-)

diff --git a/include/asm-generic/cputime_nsecs.h b/include/asm-generic/cputime_nsecs.h
index 0f1c6f3..a84e28e 100644
--- a/include/asm-generic/cputime_nsecs.h
+++ b/include/asm-generic/cputime_nsecs.h
@@ -50,6 +50,8 @@ typedef u64 __nocast cputime64_t;
 	(__force u64)(__ct)
 #define nsecs_to_cputime(__nsecs)	\
 	(__force cputime_t)(__nsecs)
+#define nsecs_to_cputime64(__nsecs)	\
+	(__force cputime64_t)(__nsecs)
 
 
 /*
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3d60e5d..db82ae1 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -79,40 +79,50 @@ void irqtime_account_irq(struct task_struct *curr)
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
-static int irqtime_account_hi_update(void)
+static cputime_t irqtime_account_hi_update(cputime_t maxtime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	cputime_t irq_cputime;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
-		ret = 1;
+	irq_cputime = nsecs_to_cputime64(this_cpu_read(cpu_hardirq_time)) -
+		      cpustat[CPUTIME_IRQ];
+	irq_cputime = min(irq_cputime, maxtime);
+	cpustat[CPUTIME_IRQ] += irq_cputime;
 	local_irq_restore(flags);
-	return ret;
+	return irq_cputime;
 }
 
-static int irqtime_account_si_update(void)
+static cputime_t irqtime_account_si_update(cputime_t maxtime)
 {
 	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
-	u64 latest_ns;
-	int ret = 0;
+	cputime_t softirq_cputime;
 
 	local_irq_save(flags);
-	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
-		ret = 1;
+	softirq_cputime = nsecs_to_cputime64(this_cpu_read(cpu_softirq_time)) -
+			  cpustat[CPUTIME_SOFTIRQ];
+	softirq_cputime = min(softirq_cputime, maxtime);
+	cpustat[CPUTIME_SOFTIRQ] += softirq_cputime;
 	local_irq_restore(flags);
-	return ret;
+	return softirq_cputime;
 }
 
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 
 #define sched_clock_irqtime	(0)
 
+static cputime_t irqtime_account_hi_update(cputime_t dummy)
+{
+	return 0;
+}
+
+static cputime_t irqtime_account_si_update(cputime_t dummy)
+{
+	return 0;
+}
+
 #endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
 
 static inline void task_group_account_field(struct task_struct *p, int index,
@@ -257,32 +267,45 @@ void account_idle_time(cputime_t cputime)
 		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
-static __always_inline unsigned long steal_account_process_tick(unsigned long max_jiffies)
+static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 {
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
+		cputime_t steal_cputime;
 		u64 steal;
-		unsigned long steal_jiffies;
 
 		steal = paravirt_steal_clock(smp_processor_id());
 		steal -= this_rq()->prev_steal_time;
 
-		/*
-		 * steal is in nsecs but our caller is expecting steal
-		 * time in jiffies. Lets cast the result to jiffies
-		 * granularity and account the rest on the next rounds.
-		 */
-		steal_jiffies = min(nsecs_to_jiffies(steal), max_jiffies);
-		this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies);
+		steal_cputime = min(nsecs_to_cputime(steal), maxtime);
+		account_steal_time(steal_cputime);
+		this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);
 
-		account_steal_time(jiffies_to_cputime(steal_jiffies));
-		return steal_jiffies;
+		return steal_cputime;
 	}
 #endif
 	return 0;
 }
 
 /*
+ * Account how much elapsed time was spent in steal, irq, or softirq time.
+ */
+static inline cputime_t account_other_time(cputime_t max)
+{
+	cputime_t accounted;
+
+	accounted = steal_account_process_time(max);
+
+	if (accounted < max)
+		accounted += irqtime_account_hi_update(max - accounted);
+
+	if (accounted < max)
+		accounted += irqtime_account_si_update(max - accounted);
+
+	return accounted;
+}
+
+/*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
  */
@@ -342,21 +365,23 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 					 struct rq *rq, int ticks)
 {
-	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
-	u64 cputime = (__force u64) cputime_one_jiffy;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	u64 cputime = (__force u64) cputime_one_jiffy * ticks;
+	cputime_t scaled, other;
 
-	if (steal_account_process_tick(ULONG_MAX))
+	/*
+	 * When returning from idle, many ticks can get accounted at
+	 * once, including some ticks of steal, irq, and softirq time.
+	 * Subtract those ticks from the amount of time accounted to
+	 * idle, or potentially user or system time. Due to rounding,
+	 * other time can exceed ticks occasionally.
+	 */
+	other = account_other_time(cputime);
+	if (other >= cputime)
 		return;
+	cputime -= other;
+	scaled = cputime_to_scaled(cputime);
 
-	cputime *= ticks;
-	scaled *= ticks;
-
-	if (irqtime_account_hi_update()) {
-		cpustat[CPUTIME_IRQ] += cputime;
-	} else if (irqtime_account_si_update()) {
-		cpustat[CPUTIME_SOFTIRQ] += cputime;
-	} else if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
@@ -466,7 +491,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
  */
 void account_process_tick(struct task_struct *p, int user_tick)
 {
-	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
+	cputime_t cputime, scaled, steal;
 	struct rq *rq = this_rq();
 
 	if (vtime_accounting_cpu_enabled())
@@ -477,16 +502,21 @@ void account_process_tick(struct task_struct *p, int user_tick)
 		return;
 	}
 
-	if (steal_account_process_tick(ULONG_MAX))
+	cputime = cputime_one_jiffy;
+	steal = steal_account_process_time(cputime);
+
+	if (steal >= cputime)
 		return;
 
+	cputime -= steal;
+	scaled = cputime_to_scaled(cputime);
+
 	if (user_tick)
-		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
+		account_user_time(p, cputime, scaled);
 	else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
-		account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
-				    one_jiffy_scaled);
+		account_system_time(p, HARDIRQ_OFFSET, cputime, scaled);
 	else
-		account_idle_time(cputime_one_jiffy);
+		account_idle_time(cputime);
 }
 
 /*
@@ -681,14 +711,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	unsigned long delta_jiffies, steal_jiffies;
+	cputime_t delta, steal;
 
-	delta_jiffies = now - tsk->vtime_snap;
-	steal_jiffies = steal_account_process_tick(delta_jiffies);
+	delta = jiffies_to_cputime(now - tsk->vtime_snap);
+	steal = steal_account_process_time(delta);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return jiffies_to_cputime(delta_jiffies - steal_jiffies);
+	return delta - steal;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [tip:timers/nohz] sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
  2016-07-13 14:50 ` [PATCH 2/5] nohz,cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code Frederic Weisbecker
@ 2016-07-14 10:37   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 42+ messages in thread
From: tip-bot for Rik van Riel @ 2016-07-14 10:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: wanpeng.li, fweisbec, torvalds, tglx, linux-kernel, peterz, riel,
	efault, hpa, pbonzini, rkrcmar, mingo

Commit-ID:  b58c35840521bb02b150e1d0d34ca9197f8b7145
Gitweb:     http://git.kernel.org/tip/b58c35840521bb02b150e1d0d34ca9197f8b7145
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 13 Jul 2016 16:50:02 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 14 Jul 2016 10:42:34 +0200

sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code

The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
appear to currently work right.

On CPUs without nohz_full=, only tick based irq time sampling is
done, which breaks down when dealing with a nohz_idle CPU.

On firewalls and similar systems, no ticks may happen on a CPU for a
while, and the irq time spent may never get accounted properly. This
can cause issues with capacity planning and power saving, which use
the CPU statistics as inputs in decision making.

Remove the VTIME_GEN vtime irq time code, and replace it with the
IRQ_TIME_ACCOUNTING code, when selected as a config option by the user.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/vtime.h  | 32 ++++++++++++++------------------
 init/Kconfig           |  6 +++---
 kernel/sched/cputime.c | 16 +++-------------
 3 files changed, 20 insertions(+), 34 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index fa21969..d1977d84 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -14,6 +14,18 @@ struct task_struct;
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
+
+#ifdef __ARCH_HAS_VTIME_ACCOUNT
+extern void vtime_account_irq_enter(struct task_struct *tsk);
+#else
+extern void vtime_common_account_irq_enter(struct task_struct *tsk);
+static inline void vtime_account_irq_enter(struct task_struct *tsk)
+{
+	if (vtime_accounting_cpu_enabled())
+		vtime_common_account_irq_enter(tsk);
+}
+#endif /* __ARCH_HAS_VTIME_ACCOUNT */
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
@@ -64,17 +76,6 @@ extern void vtime_account_system(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
 extern void vtime_account_user(struct task_struct *tsk);
 
-#ifdef __ARCH_HAS_VTIME_ACCOUNT
-extern void vtime_account_irq_enter(struct task_struct *tsk);
-#else
-extern void vtime_common_account_irq_enter(struct task_struct *tsk);
-static inline void vtime_account_irq_enter(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_common_account_irq_enter(tsk);
-}
-#endif /* __ARCH_HAS_VTIME_ACCOUNT */
-
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 static inline void vtime_task_switch(struct task_struct *prev) { }
@@ -85,13 +86,8 @@ static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-extern void vtime_gen_account_irq_exit(struct task_struct *tsk);
-
-static inline void vtime_account_irq_exit(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_gen_account_irq_exit(tsk);
-}
+static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
+static inline void vtime_account_irq_exit(struct task_struct *tsk) { }
 
 extern void vtime_user_enter(struct task_struct *tsk);
 
diff --git a/init/Kconfig b/init/Kconfig
index c02d897..787dd76 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -375,9 +375,11 @@ config VIRT_CPU_ACCOUNTING_GEN
 
 	  If unsure, say N.
 
+endchoice
+
 config IRQ_TIME_ACCOUNTING
 	bool "Fine granularity task level IRQ time accounting"
-	depends on HAVE_IRQ_TIME_ACCOUNTING && !NO_HZ_FULL
+	depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
 	help
 	  Select this option to enable fine granularity task irq time
 	  accounting. This is done by reading a timestamp on each
@@ -386,8 +388,6 @@ config IRQ_TIME_ACCOUNTING
 
 	  If in doubt, say N here.
 
-endchoice
-
 config BSD_PROCESS_ACCT
 	bool "BSD Process Accounting"
 	depends on MULTIUSER
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index db82ae1..ca7e33c 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -711,14 +711,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	cputime_t delta, steal;
+	cputime_t delta, other;
 
 	delta = jiffies_to_cputime(now - tsk->vtime_snap);
-	steal = steal_account_process_time(delta);
+	other = account_other_time(delta);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return delta - steal;
+	return delta - other;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
@@ -738,16 +738,6 @@ void vtime_account_system(struct task_struct *tsk)
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
-void vtime_gen_account_irq_exit(struct task_struct *tsk)
-{
-	write_seqcount_begin(&tsk->vtime_seqcount);
-	if (vtime_delta(tsk))
-		__vtime_account_system(tsk);
-	if (context_tracking_in_user())
-		tsk->vtime_snap_whence = VTIME_USER;
-	write_seqcount_end(&tsk->vtime_seqcount);
-}
-
 void vtime_account_user(struct task_struct *tsk)
 {
 	cputime_t delta_cpu;

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [tip:timers/nohz] sched/cputime: Clean up the old vtime gen irqtime accounting completely
  2016-07-13 14:50 ` [PATCH 3/5] sched: Complete cleanup of old vtime gen irqtime accounting Frederic Weisbecker
@ 2016-07-14 10:38   ` tip-bot for Frederic Weisbecker
  0 siblings, 0 replies; 42+ messages in thread
From: tip-bot for Frederic Weisbecker @ 2016-07-14 10:38 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, efault, pbonzini, peterz, riel, torvalds,
	wanpeng.li, fweisbec, mingo, hpa, tglx, rkrcmar

Commit-ID:  0cfdf9a198b0d4f5ad6c87d894db7830b796b2cc
Gitweb:     http://git.kernel.org/tip/0cfdf9a198b0d4f5ad6c87d894db7830b796b2cc
Author:     Frederic Weisbecker <fweisbec@gmail.com>
AuthorDate: Wed, 13 Jul 2016 16:50:03 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 14 Jul 2016 10:42:35 +0200

sched/cputime: Clean up the old vtime gen irqtime accounting completely

Vtime generic irqtime accounting has been removed but there are a few
remnants to clean up:

* The vtime_accounting_cpu_enabled() check in irq entry was only used
  by CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can safely remove it.

* Without the vtime_accounting_cpu_enabled(), we no longer need to
  have a vtime_common_account_irq_enter() indirect function.

* Move vtime_account_irq_enter() implementation under
  CONFIG_VIRT_CPU_ACCOUNTING_NATIVE which is the last user.

* The vtime_account_user() call was only used on irq entry for
  CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can remove that too.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/vtime.h  | 11 -----------
 kernel/sched/cputime.c | 33 ++++++++++-----------------------
 2 files changed, 10 insertions(+), 34 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index d1977d84..65aef5e 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -14,18 +14,7 @@ struct task_struct;
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
-
-#ifdef __ARCH_HAS_VTIME_ACCOUNT
 extern void vtime_account_irq_enter(struct task_struct *tsk);
-#else
-extern void vtime_common_account_irq_enter(struct task_struct *tsk);
-static inline void vtime_account_irq_enter(struct task_struct *tsk)
-{
-	if (vtime_accounting_cpu_enabled())
-		vtime_common_account_irq_enter(tsk);
-}
-#endif /* __ARCH_HAS_VTIME_ACCOUNT */
-
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index ca7e33c..16a873c 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -431,6 +431,10 @@ void vtime_common_task_switch(struct task_struct *prev)
 }
 #endif
 
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
+
+
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 /*
  * Archs that account the whole time spent in the idle task
  * (outside irq) as idle time can rely on this and just implement
@@ -440,33 +444,16 @@ void vtime_common_task_switch(struct task_struct *prev)
  * vtime_account().
  */
 #ifndef __ARCH_HAS_VTIME_ACCOUNT
-void vtime_common_account_irq_enter(struct task_struct *tsk)
+void vtime_account_irq_enter(struct task_struct *tsk)
 {
-	if (!in_interrupt()) {
-		/*
-		 * If we interrupted user, context_tracking_in_user()
-		 * is 1 because the context tracking don't hook
-		 * on irq entry/exit. This way we know if
-		 * we need to flush user time on kernel entry.
-		 */
-		if (context_tracking_in_user()) {
-			vtime_account_user(tsk);
-			return;
-		}
-
-		if (is_idle_task(tsk)) {
-			vtime_account_idle(tsk);
-			return;
-		}
-	}
-	vtime_account_system(tsk);
+	if (!in_interrupt() && is_idle_task(tsk))
+		vtime_account_idle(tsk);
+	else
+		vtime_account_system(tsk);
 }
-EXPORT_SYMBOL_GPL(vtime_common_account_irq_enter);
+EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
 #endif /* __ARCH_HAS_VTIME_ACCOUNT */
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
-
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 {
 	*ut = p->utime;

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [tip:timers/nohz] sched/cputime: Reorganize vtime native irqtime accounting headers
  2016-07-13 14:50 ` [PATCH 4/5] sched: Reorganize vtime native irqtime accounting headers Frederic Weisbecker
@ 2016-07-14 10:38   ` tip-bot for Frederic Weisbecker
  0 siblings, 0 replies; 42+ messages in thread
From: tip-bot for Frederic Weisbecker @ 2016-07-14 10:38 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, wanpeng.li, tglx, linux-kernel, efault, hpa, pbonzini,
	riel, mingo, torvalds, rkrcmar, fweisbec

Commit-ID:  8612f17ab99c1f0770792bc875f5f039212a2a85
Gitweb:     http://git.kernel.org/tip/8612f17ab99c1f0770792bc875f5f039212a2a85
Author:     Frederic Weisbecker <fweisbec@gmail.com>
AuthorDate: Wed, 13 Jul 2016 16:50:04 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 14 Jul 2016 10:42:35 +0200

sched/cputime: Reorganize vtime native irqtime accounting headers

The vtime irqtime accounting headers are very scattered and convoluted
right now. Reorganize them such that it is obvious that only
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE does use it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/vtime.h | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 65aef5e..aa9bfea 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -12,12 +12,9 @@ struct task_struct;
 /*
  * vtime_accounting_cpu_enabled() definitions/declarations
  */
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+#if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE)
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
-extern void vtime_account_irq_enter(struct task_struct *tsk);
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
-
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+#elif defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN)
 /*
  * Checks if vtime is enabled on some CPU. Cputime readers want to be careful
  * in that case and compute the tickless cputime.
@@ -38,11 +35,9 @@ static inline bool vtime_accounting_cpu_enabled(void)
 
 	return false;
 }
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
-
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+#else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 static inline bool vtime_accounting_cpu_enabled(void) { return false; }
-#endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
+#endif
 
 
 /*
@@ -70,14 +65,10 @@ extern void vtime_account_user(struct task_struct *tsk);
 static inline void vtime_task_switch(struct task_struct *prev) { }
 static inline void vtime_account_system(struct task_struct *tsk) { }
 static inline void vtime_account_user(struct task_struct *tsk) { }
-static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
-static inline void vtime_account_irq_exit(struct task_struct *tsk) { }
-
 extern void vtime_user_enter(struct task_struct *tsk);
 
 static inline void vtime_user_exit(struct task_struct *tsk)
@@ -88,11 +79,6 @@ extern void vtime_guest_enter(struct task_struct *tsk);
 extern void vtime_guest_exit(struct task_struct *tsk);
 extern void vtime_init_idle(struct task_struct *tsk, int cpu);
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING_GEN  */
-static inline void vtime_account_irq_exit(struct task_struct *tsk)
-{
-	/* On hard|softirq exit we always account to hard|softirq cputime */
-	vtime_account_system(tsk);
-}
 static inline void vtime_user_enter(struct task_struct *tsk) { }
 static inline void vtime_user_exit(struct task_struct *tsk) { }
 static inline void vtime_guest_enter(struct task_struct *tsk) { }
@@ -100,6 +86,19 @@ static inline void vtime_guest_exit(struct task_struct *tsk) { }
 static inline void vtime_init_idle(struct task_struct *tsk, int cpu) { }
 #endif
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+extern void vtime_account_irq_enter(struct task_struct *tsk);
+static inline void vtime_account_irq_exit(struct task_struct *tsk)
+{
+	/* On hard|softirq exit we always account to hard|softirq cputime */
+	vtime_account_system(tsk);
+}
+#else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
+static inline void vtime_account_irq_enter(struct task_struct *tsk) { }
+static inline void vtime_account_irq_exit(struct task_struct *tsk) { }
+#endif
+
+
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 extern void irqtime_account_irq(struct task_struct *tsk);
 #else

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [tip:timers/nohz] sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()
  2016-07-13 14:50 ` [PATCH 5/5] time: Drop local_irq_save/restore from irqtime_account_irq Frederic Weisbecker
@ 2016-07-14 10:38   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 42+ messages in thread
From: tip-bot for Rik van Riel @ 2016-07-14 10:38 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: fweisbec, rkrcmar, efault, torvalds, peterz, linux-kernel, tglx,
	riel, pbonzini, mingo, wanpeng.li, hpa

Commit-ID:  553bf6bbfd8a540c70aee28eb50e24caff456a03
Gitweb:     http://git.kernel.org/tip/553bf6bbfd8a540c70aee28eb50e24caff456a03
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 13 Jul 2016 16:50:05 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 14 Jul 2016 10:42:35 +0200

sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()

Paolo pointed out that irqs are already blocked when irqtime_account_irq()
is called. That means there is no reason to call local_irq_save/restore()
again.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 16a873c..ea0f6f3 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -49,15 +49,12 @@ DEFINE_PER_CPU(seqcount_t, irq_time_seq);
  */
 void irqtime_account_irq(struct task_struct *curr)
 {
-	unsigned long flags;
 	s64 delta;
 	int cpu;
 
 	if (!sched_clock_irqtime)
 		return;
 
-	local_irq_save(flags);
-
 	cpu = smp_processor_id();
 	delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
 	__this_cpu_add(irq_start_time, delta);
@@ -75,7 +72,6 @@ void irqtime_account_irq(struct task_struct *curr)
 		__this_cpu_add(cpu_softirq_time, delta);
 
 	irq_time_write_end();
-	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-07-13 14:50 ` [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time Frederic Weisbecker
  2016-07-14 10:37   ` [tip:timers/nohz] sched/cputime: " tip-bot for Rik van Riel
@ 2016-08-09  3:59   ` Wanpeng Li
  2016-08-09 14:06     ` Rik van Riel
  1 sibling, 1 reply; 42+ messages in thread
From: Wanpeng Li @ 2016-08-09  3:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Rik van Riel, Paolo Bonzini, Peter Zijlstra,
	Wanpeng Li, Thomas Gleixner, Radim Krcmar, Mike Galbraith

Hi Rik,
2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
> From: Rik van Riel <riel@redhat.com>
>
> Currently, if there was any irq or softirq time during 'ticks'
> jiffies, the entire period will be accounted as irq or softirq
> time.
>
> This is inaccurate if only a subset of the time was actually spent
> handling irqs, and could conceivably mis-count all of the ticks during
> a period as irq time, when there was some irq and some softirq time.
>
> This can actually happen when irqtime_account_process_tick is called
> from account_idle_ticks, which can pass a larger number of ticks down
> all at once.
>
> Fix this by changing irqtime_account_hi_update, irqtime_account_si_update,
> and steal_account_process_ticks to work with cputime_t time units, and
> return the amount of time spent in each mode.

Do we need to minus st cputime from idle cputime in
account_idle_ticks() when noirqtime is true? I try to add this logic
w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
however, there is no difference, where I miss?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09  3:59   ` [PATCH 1/5] sched,time: " Wanpeng Li
@ 2016-08-09 14:06     ` Rik van Riel
  2016-08-09 23:07       ` Wanpeng Li
  2016-08-09 23:25       ` Wanpeng Li
  0 siblings, 2 replies; 42+ messages in thread
From: Rik van Riel @ 2016-08-09 14:06 UTC (permalink / raw)
  To: Wanpeng Li, Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Paolo Bonzini, Peter Zijlstra, Wanpeng Li,
	Thomas Gleixner, Radim Krcmar, Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 1395 bytes --]

On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
> Hi Rik,
> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
> > From: Rik van Riel <riel@redhat.com>
> > 
> > Currently, if there was any irq or softirq time during 'ticks'
> > jiffies, the entire period will be accounted as irq or softirq
> > time.
> > 
> > This is inaccurate if only a subset of the time was actually spent
> > handling irqs, and could conceivably mis-count all of the ticks
> > during
> > a period as irq time, when there was some irq and some softirq
> > time.
> > 
> > This can actually happen when irqtime_account_process_tick is
> > called
> > from account_idle_ticks, which can pass a larger number of ticks
> > down
> > all at once.
> > 
> > Fix this by changing irqtime_account_hi_update,
> > irqtime_account_si_update,
> > and steal_account_process_ticks to work with cputime_t time units,
> > and
> > return the amount of time spent in each mode.
> 
> Do we need to minus st cputime from idle cputime in
> account_idle_ticks() when noirqtime is true? I try to add this logic
> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
> however, there is no difference, where I miss?

Yes, you are right. The code in account_idle_ticks()
could use the same treatment.

I am not sure why it would not work, though...

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 14:06     ` Rik van Riel
@ 2016-08-09 23:07       ` Wanpeng Li
  2016-08-10  7:51         ` Wanpeng Li
  2016-08-09 23:25       ` Wanpeng Li
  1 sibling, 1 reply; 42+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>> Hi Rik,
>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>> > From: Rik van Riel <riel@redhat.com>
>> >
>> > Currently, if there was any irq or softirq time during 'ticks'
>> > jiffies, the entire period will be accounted as irq or softirq
>> > time.
>> >
>> > This is inaccurate if only a subset of the time was actually spent
>> > handling irqs, and could conceivably mis-count all of the ticks
>> > during
>> > a period as irq time, when there was some irq and some softirq
>> > time.
>> >
>> > This can actually happen when irqtime_account_process_tick is
>> > called
>> > from account_idle_ticks, which can pass a larger number of ticks
>> > down
>> > all at once.
>> >
>> > Fix this by changing irqtime_account_hi_update,
>> > irqtime_account_si_update,
>> > and steal_account_process_ticks to work with cputime_t time units,
>> > and
>> > return the amount of time spent in each mode.
>>
>> Do we need to minus st cputime from idle cputime in
>> account_idle_ticks() when noirqtime is true? I try to add this logic
>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>> however, there is no difference, where I miss?
>
> Yes, you are right. The code in account_idle_ticks()
> could use the same treatment.
>
> I am not sure why it would not work, though...

I will try nohz idle kvm guest and other more tests, a patch will be
sent out once successful.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 14:06     ` Rik van Riel
  2016-08-09 23:07       ` Wanpeng Li
@ 2016-08-09 23:25       ` Wanpeng Li
  2016-08-09 23:31         ` Wanpeng Li
                           ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>> Hi Rik,
>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>> > From: Rik van Riel <riel@redhat.com>
>> >
>> > Currently, if there was any irq or softirq time during 'ticks'
>> > jiffies, the entire period will be accounted as irq or softirq
>> > time.
>> >
>> > This is inaccurate if only a subset of the time was actually spent
>> > handling irqs, and could conceivably mis-count all of the ticks
>> > during
>> > a period as irq time, when there was some irq and some softirq
>> > time.
>> >
>> > This can actually happen when irqtime_account_process_tick is
>> > called
>> > from account_idle_ticks, which can pass a larger number of ticks
>> > down
>> > all at once.
>> >
>> > Fix this by changing irqtime_account_hi_update,
>> > irqtime_account_si_update,
>> > and steal_account_process_ticks to work with cputime_t time units,
>> > and
>> > return the amount of time spent in each mode.
>>
>> Do we need to minus st cputime from idle cputime in
>> account_idle_ticks() when noirqtime is true? I try to add this logic
>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>> however, there is no difference, where I miss?
>
> Yes, you are right. The code in account_idle_ticks()
> could use the same treatment.
>
> I am not sure why it would not work, though...

Actually I observed a regression caused by this patch. I use a i5
laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
cpu hog processes(for loop) running in the guest, I hot-unplug the
pCPUs on host one by one until there is only one left, then observe
the top in guest, there are 100% st for cpu0(housekeeping), and 75% st
for other cpus(nohz full). However, w/o this patch, 75% for all the
four cpus.

I try to figure out this recently, any tip is a great appreciated. :)

Regards,
Wapeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:25       ` Wanpeng Li
@ 2016-08-09 23:31         ` Wanpeng Li
  2016-08-09 23:35         ` Wanpeng Li
  2016-08-09 23:39         ` Wanpeng Li
  2 siblings, 0 replies; 42+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>>> Hi Rik,
>>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>>> > From: Rik van Riel <riel@redhat.com>
>>> >
>>> > Currently, if there was any irq or softirq time during 'ticks'
>>> > jiffies, the entire period will be accounted as irq or softirq
>>> > time.
>>> >
>>> > This is inaccurate if only a subset of the time was actually spent
>>> > handling irqs, and could conceivably mis-count all of the ticks
>>> > during
>>> > a period as irq time, when there was some irq and some softirq
>>> > time.
>>> >
>>> > This can actually happen when irqtime_account_process_tick is
>>> > called
>>> > from account_idle_ticks, which can pass a larger number of ticks
>>> > down
>>> > all at once.
>>> >
>>> > Fix this by changing irqtime_account_hi_update,
>>> > irqtime_account_si_update,
>>> > and steal_account_process_ticks to work with cputime_t time units,
>>> > and
>>> > return the amount of time spent in each mode.
>>>
>>> Do we need to minus st cputime from idle cputime in
>>> account_idle_ticks() when noirqtime is true? I try to add this logic
>>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>>> however, there is no difference, where I miss?
>>
>> Yes, you are right. The code in account_idle_ticks()
>> could use the same treatment.
>>
>> I am not sure why it would not work, though...
>
> Actually I observed a regression caused by this patch. I use a i5

The regression is caused by your commit "sched,time: Count actually
elapsed irq & softirq time".

> laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
> cpu hog processes(for loop) running in the guest, I hot-unplug the
> pCPUs on host one by one until there is only one left, then observe
> the top in guest, there are 100% st for cpu0(housekeeping), and 75% st
> for other cpus(nohz full). However, w/o this patch, 75% for all the
> four cpus.
>
> I try to figure out this recently, any tip is a great appreciated. :)
>
> Regards,
> Wapeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:25       ` Wanpeng Li
  2016-08-09 23:31         ` Wanpeng Li
@ 2016-08-09 23:35         ` Wanpeng Li
  2016-08-09 23:39         ` Wanpeng Li
  2 siblings, 0 replies; 42+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:35 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>>> Hi Rik,
>>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>>> > From: Rik van Riel <riel@redhat.com>
>>> >
>>> > Currently, if there was any irq or softirq time during 'ticks'
>>> > jiffies, the entire period will be accounted as irq or softirq
>>> > time.
>>> >
>>> > This is inaccurate if only a subset of the time was actually spent
>>> > handling irqs, and could conceivably mis-count all of the ticks
>>> > during
>>> > a period as irq time, when there was some irq and some softirq
>>> > time.
>>> >
>>> > This can actually happen when irqtime_account_process_tick is
>>> > called
>>> > from account_idle_ticks, which can pass a larger number of ticks
>>> > down
>>> > all at once.
>>> >
>>> > Fix this by changing irqtime_account_hi_update,
>>> > irqtime_account_si_update,
>>> > and steal_account_process_ticks to work with cputime_t time units,
>>> > and
>>> > return the amount of time spent in each mode.
>>>
>>> Do we need to minus st cputime from idle cputime in
>>> account_idle_ticks() when noirqtime is true? I try to add this logic
>>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>>> however, there is no difference, where I miss?
>>
>> Yes, you are right. The code in account_idle_ticks()
>> could use the same treatment.
>>
>> I am not sure why it would not work, though...
>
> Actually I observed a regression caused by this patch. I use a i5

The regression is caused by your commit "sched,time: Count actually
elapsed irq & softirq time".

> laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
> cpu hog processes(for loop) running in the guest, I hot-unplug the
> pCPUs on host one by one until there is only one left, then observe
> the top in guest, there are 100% st for cpu0(housekeeping), and 75% st
> for other cpus(nohz full). However, w/o this patch, 75% for all the
> four cpus.
>
> I try to figure out this recently, any tip is a great appreciated. :)
>
> Regards,
> Wapeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:25       ` Wanpeng Li
  2016-08-09 23:31         ` Wanpeng Li
  2016-08-09 23:35         ` Wanpeng Li
@ 2016-08-09 23:39         ` Wanpeng Li
  2016-08-10  5:07           ` Rik van Riel
  2016-08-10 16:52           ` [PATCH] time,virt: resync steal time when guest & host lose sync Rik van Riel
  2 siblings, 2 replies; 42+ messages in thread
From: Wanpeng Li @ 2016-08-09 23:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>>> Hi Rik,
>>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>>> > From: Rik van Riel <riel@redhat.com>
>>> >
>>> > Currently, if there was any irq or softirq time during 'ticks'
>>> > jiffies, the entire period will be accounted as irq or softirq
>>> > time.
>>> >
>>> > This is inaccurate if only a subset of the time was actually spent
>>> > handling irqs, and could conceivably mis-count all of the ticks
>>> > during
>>> > a period as irq time, when there was some irq and some softirq
>>> > time.
>>> >
>>> > This can actually happen when irqtime_account_process_tick is
>>> > called
>>> > from account_idle_ticks, which can pass a larger number of ticks
>>> > down
>>> > all at once.
>>> >
>>> > Fix this by changing irqtime_account_hi_update,
>>> > irqtime_account_si_update,
>>> > and steal_account_process_ticks to work with cputime_t time units,
>>> > and
>>> > return the amount of time spent in each mode.
>>>
>>> Do we need to minus st cputime from idle cputime in
>>> account_idle_ticks() when noirqtime is true? I try to add this logic
>>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>>> however, there is no difference, where I miss?
>>
>> Yes, you are right. The code in account_idle_ticks()
>> could use the same treatment.
>>
>> I am not sure why it would not work, though...
>
> Actually I observed a regression caused by this patch. I use a i5

The regression is caused by your commit "sched,time: Count actually
elapsed irq & softirq time".

> laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
> cpu hog processes(for loop) running in the guest, I hot-unplug the
> pCPUs on host one by one until there is only one left, then observe
> the top in guest, there are 100% st for cpu0(housekeeping), and 75% st
> for other cpus(nohz full). However, w/o this patch, 75% for all the
> four cpus.
>
> I try to figure out this recently, any tip is a great appreciated. :)
>
> Regards,
> Wapeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:39         ` Wanpeng Li
@ 2016-08-10  5:07           ` Rik van Riel
  2016-08-10  6:33             ` Wanpeng Li
  2016-08-10 16:52           ` [PATCH] time,virt: resync steal time when guest & host lose sync Rik van Riel
  1 sibling, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2016-08-10  5:07 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 3911 bytes --]

On Wed, 2016-08-10 at 07:39 +0800, Wanpeng Li wrote:
> 2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> > 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
> > > On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
> > > > Hi Rik,
> > > > 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.
> > > > com>:
> > > > > From: Rik van Riel <riel@redhat.com>
> > > > > 
> > > > > Currently, if there was any irq or softirq time during
> > > > > 'ticks'
> > > > > jiffies, the entire period will be accounted as irq or
> > > > > softirq
> > > > > time.
> > > > > 
> > > > > This is inaccurate if only a subset of the time was actually
> > > > > spent
> > > > > handling irqs, and could conceivably mis-count all of the
> > > > > ticks
> > > > > during
> > > > > a period as irq time, when there was some irq and some
> > > > > softirq
> > > > > time.
> > > > > 
> > > > > This can actually happen when irqtime_account_process_tick is
> > > > > called
> > > > > from account_idle_ticks, which can pass a larger number of
> > > > > ticks
> > > > > down
> > > > > all at once.
> > > > > 
> > > > > Fix this by changing irqtime_account_hi_update,
> > > > > irqtime_account_si_update,
> > > > > and steal_account_process_ticks to work with cputime_t time
> > > > > units,
> > > > > and
> > > > > return the amount of time spent in each mode.
> > > > 
> > > > Do we need to minus st cputime from idle cputime in
> > > > account_idle_ticks() when noirqtime is true? I try to add this
> > > > logic
> > > > w/ noirqtime and idle=poll boot parameter for a full dynticks
> > > > guest,
> > > > however, there is no difference, where I miss?
> > > 
> > > Yes, you are right. The code in account_idle_ticks()
> > > could use the same treatment.
> > > 
> > > I am not sure why it would not work, though...
> > 
> > Actually I observed a regression caused by this patch. I use a i5
> 
> The regression is caused by your commit "sched,time: Count actually
> elapsed irq & softirq time".

Wanpeng and I discussed this issue, and discovered
that this bug is triggered by my patch, specifically
this bit:

-       if (steal_account_process_tick(ULONG_MAX))
+       other = account_other_time(cputime);
+       if (other >= cputime)
                return;

Replacing "cputime" with "ULONG_MAX" as the argument
to account_other_time makes the bug disappear.

However, this is not the cause of the bug.

The cause of the bug appears to be that the values
used to figure out how much steal time has passed
are never initialized.

                steal = paravirt_steal_clock(smp_processor_id());
                steal -= this_rq()->prev_steal_time;

The first of the two may be initialized by the host
(I did not verify that), but the second one does not
have any explicit initializers anywhere in the kernel
tree.

This can lead to an arbitrarily large difference between
paravirt_steal_clock(smp_processor_id()) and
this_rq()->prev_steal_time, which results in nothing but
steal time getting accounted for a potentially a very
long amount of time.

Previously we carried this patch to initialize the
various rq->prev_* values at CPU hotplug time:

https://patchwork.codeaurora.org/patch/27699/

Which got reverted by Paolo here:

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=sche
d/core&id=3d89e5478bf550a50c99e93adf659369798263b0

Which leads me to this question:

Paulo, how would you like us to fix this bug?

It seems like the host and guest steal time CAN get out
of sync, sometimes by a ridiculous amount, and we need
some way to get the excessive difference out of the way,
without it getting accounted as steal time (not immediately,
and not over the next 17 hours, or months).

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-10  5:07           ` Rik van Riel
@ 2016-08-10  6:33             ` Wanpeng Li
  0 siblings, 0 replies; 42+ messages in thread
From: Wanpeng Li @ 2016-08-10  6:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 13:07 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Wed, 2016-08-10 at 07:39 +0800, Wanpeng Li wrote:
>> 2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
>> > 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > > On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>> > > > Hi Rik,
>> > > > 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.
>> > > > com>:
>> > > > > From: Rik van Riel <riel@redhat.com>
>> > > > >
>> > > > > Currently, if there was any irq or softirq time during
>> > > > > 'ticks'
>> > > > > jiffies, the entire period will be accounted as irq or
>> > > > > softirq
>> > > > > time.
>> > > > >
>> > > > > This is inaccurate if only a subset of the time was actually
>> > > > > spent
>> > > > > handling irqs, and could conceivably mis-count all of the
>> > > > > ticks
>> > > > > during
>> > > > > a period as irq time, when there was some irq and some
>> > > > > softirq
>> > > > > time.
>> > > > >
>> > > > > This can actually happen when irqtime_account_process_tick is
>> > > > > called
>> > > > > from account_idle_ticks, which can pass a larger number of
>> > > > > ticks
>> > > > > down
>> > > > > all at once.
>> > > > >
>> > > > > Fix this by changing irqtime_account_hi_update,
>> > > > > irqtime_account_si_update,
>> > > > > and steal_account_process_ticks to work with cputime_t time
>> > > > > units,
>> > > > > and
>> > > > > return the amount of time spent in each mode.
>> > > >
>> > > > Do we need to minus st cputime from idle cputime in
>> > > > account_idle_ticks() when noirqtime is true? I try to add this
>> > > > logic
>> > > > w/ noirqtime and idle=poll boot parameter for a full dynticks
>> > > > guest,
>> > > > however, there is no difference, where I miss?
>> > >
>> > > Yes, you are right. The code in account_idle_ticks()
>> > > could use the same treatment.
>> > >
>> > > I am not sure why it would not work, though...
>> >
>> > Actually I observed a regression caused by this patch. I use a i5
>>
>> The regression is caused by your commit "sched,time: Count actually
>> elapsed irq & softirq time".
>
> Wanpeng and I discussed this issue, and discovered
> that this bug is triggered by my patch, specifically
> this bit:
>
> -       if (steal_account_process_tick(ULONG_MAX))
> +       other = account_other_time(cputime);
> +       if (other >= cputime)
>                 return;
>
> Replacing "cputime" with "ULONG_MAX" as the argument
> to account_other_time makes the bug disappear.
>
> However, this is not the cause of the bug.
>
> The cause of the bug appears to be that the values
> used to figure out how much steal time has passed
> are never initialized.
>
>                 steal = paravirt_steal_clock(smp_processor_id());
>                 steal -= this_rq()->prev_steal_time;
>
> The first of the two may be initialized by the host
> (I did not verify that), but the second one does not
> have any explicit initializers anywhere in the kernel
> tree.
>
> This can lead to an arbitrarily large difference between
> paravirt_steal_clock(smp_processor_id()) and
> this_rq()->prev_steal_time, which results in nothing but
> steal time getting accounted for a potentially a very
> long amount of time.
>
> Previously we carried this patch to initialize the
> various rq->prev_* values at CPU hotplug time:
>
> https://patchwork.codeaurora.org/patch/27699/
>
> Which got reverted by Paolo here:
>
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=sche
> d/core&id=3d89e5478bf550a50c99e93adf659369798263b0
>
> Which leads me to this question:
>
> Paulo, how would you like us to fix this bug?
>
> It seems like the host and guest steal time CAN get out
> of sync, sometimes by a ridiculous amount, and we need
> some way to get the excessive difference out of the way,
> without it getting accounted as steal time (not immediately,
> and not over the next 17 hours, or months).

I can be the volunteer to fix it if you guys have an idea. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time
  2016-08-09 23:07       ` Wanpeng Li
@ 2016-08-10  7:51         ` Wanpeng Li
  0 siblings, 0 replies; 42+ messages in thread
From: Wanpeng Li @ 2016-08-10  7:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-10 7:07 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
>>> Hi Rik,
>>> 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
>>> > From: Rik van Riel <riel@redhat.com>
>>> >
>>> > Currently, if there was any irq or softirq time during 'ticks'
>>> > jiffies, the entire period will be accounted as irq or softirq
>>> > time.
>>> >
>>> > This is inaccurate if only a subset of the time was actually spent
>>> > handling irqs, and could conceivably mis-count all of the ticks
>>> > during
>>> > a period as irq time, when there was some irq and some softirq
>>> > time.
>>> >
>>> > This can actually happen when irqtime_account_process_tick is
>>> > called
>>> > from account_idle_ticks, which can pass a larger number of ticks
>>> > down
>>> > all at once.
>>> >
>>> > Fix this by changing irqtime_account_hi_update,
>>> > irqtime_account_si_update,
>>> > and steal_account_process_ticks to work with cputime_t time units,
>>> > and
>>> > return the amount of time spent in each mode.
>>>
>>> Do we need to minus st cputime from idle cputime in
>>> account_idle_ticks() when noirqtime is true? I try to add this logic
>>> w/ noirqtime and idle=poll boot parameter for a full dynticks guest,
>>> however, there is no difference, where I miss?
>>
>> Yes, you are right. The code in account_idle_ticks()
>> could use the same treatment.
>>
>> I am not sure why it would not work, though...
>
> I will try nohz idle kvm guest and other more tests, a patch will be
> sent out once successful.

After apply the same logic to account_idle_ticks() for nohz idle kvm
guest(noirqtime, idle=poll, one pCPU and four vCPUs), the average idle
drop from 56.8% to 54.75%, I think it makes sense to make a formal
patch.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-09 23:39         ` Wanpeng Li
  2016-08-10  5:07           ` Rik van Riel
@ 2016-08-10 16:52           ` Rik van Riel
  2016-08-11 10:11             ` Wanpeng Li
                               ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Rik van Riel @ 2016-08-10 16:52 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

On Wed, 10 Aug 2016 07:39:08 +0800
Wanpeng Li <kernellwp@gmail.com> wrote:

> The regression is caused by your commit "sched,time: Count actually
> elapsed irq & softirq time".

Wanpeng, does this patch fix your issue?

Paolo, what is your opinion on this issue?

I can think of all kinds of ways in which guest and host might lose
sync with steal time, from uninitialized values at boot, to guest
pause, followed by save to disk, and reload, to live migration, to...

---8<---

Subject: time,virt: resync steal time when guest & host lose sync

When guest and host wildly disagree on steal time, a guest can
do several things:
1) Quickly account all the steal time at once (the kernel did this before
   57430218317e ("sched/cputime: Count actually elapsed irq & softirq time"),
   when steal_account_process_ticks got ULONG_MAX as its maximum value.
2) Stay out of sync for an indeterminate amount of time. This is what the
   system does today.
3) Sync up the guest value to the host-provided value, without accounting
   an absurdly large value in the cpu time statistics.

This patch makes the kernel do (3), which seems like the right thing
to do.

The exact value of the threshold use probably does not matter too much,
as long as it is long enough to cover all the timer ticks that passed
during an idle period, because (irqtime_)account_idle_ticks can process
a large amount of time all at once.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Wanpeng Li <kernellwp@gmail.com>
---
 kernel/sched/cputime.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 1934f658c036..c18f9e717af6 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -273,7 +273,17 @@ static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 		steal = paravirt_steal_clock(smp_processor_id());
 		steal -= this_rq()->prev_steal_time;
 
-		steal_cputime = min(nsecs_to_cputime(steal), maxtime);
+		steal_cputime = nsecs_to_cputime(steal);
+		if (steal_cputime > 32 * maxtime) {
+			/*
+			 * Guest and host steal time values are way out of
+			 * sync. Sync up the guest steal time with the host.
+			 */
+			this_rq()->prev_steal_time +=
+					cputime_to_nsecs(steal_cputime);
+			return 0;
+		}
+		steal_cputime = min(steal_cputime, maxtime);
 		account_steal_time(steal_cputime);
 		this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-10 16:52           ` [PATCH] time,virt: resync steal time when guest & host lose sync Rik van Riel
@ 2016-08-11 10:11             ` Wanpeng Li
  2016-08-12  2:44               ` Rik van Riel
  2016-08-12 16:33             ` Paolo Bonzini
  2016-08-13  8:42             ` Ingo Molnar
  2 siblings, 1 reply; 42+ messages in thread
From: Wanpeng Li @ 2016-08-11 10:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-11 0:52 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Wed, 10 Aug 2016 07:39:08 +0800
> Wanpeng Li <kernellwp@gmail.com> wrote:
>
>> The regression is caused by your commit "sched,time: Count actually
>> elapsed irq & softirq time".
>
> Wanpeng, does this patch fix your issue?

I test this against kvm guest (nohz_full, four vCPUs running on one
pCPU, four cpuhog processes running on four vCPUs).
before this fix patch:
vCPU0's st is 100%, other vCPUs' st are ~75%.
after this fix patch:
all vCPUs' st are ~85%.
However, w/o commit "sched,time: Count actually elapsed irq & softirq
time", all vCPUs' st are ~75%.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-11 10:11             ` Wanpeng Li
@ 2016-08-12  2:44               ` Rik van Riel
  2016-08-12  7:09                 ` Wanpeng Li
  0 siblings, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2016-08-12  2:44 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 1880 bytes --]

On Thu, 2016-08-11 at 18:11 +0800, Wanpeng Li wrote:
> 2016-08-11 0:52 GMT+08:00 Rik van Riel <riel@redhat.com>:
> > On Wed, 10 Aug 2016 07:39:08 +0800
> > Wanpeng Li <kernellwp@gmail.com> wrote:
> > 
> > > The regression is caused by your commit "sched,time: Count
> > > actually
> > > elapsed irq & softirq time".
> > 
> > Wanpeng, does this patch fix your issue?
> 
> I test this against kvm guest (nohz_full, four vCPUs running on one
> pCPU, four cpuhog processes running on four vCPUs).
> before this fix patch:
> vCPU0's st is 100%, other vCPUs' st are ~75%.
> after this fix patch:
> all vCPUs' st are ~85%.
> However, w/o commit "sched,time: Count actually elapsed irq & softirq
> time", all vCPUs' st are ~75%.

If you pass ULONG_MAX as the maxtime argument to
steal_account_process_time(), does the steal time
get accounted properly at 75%?

If that is the case, I have a hypothesis:
1) The guest is running so much slower when sharing
   a CPU 4 ways, that it is accounting only ~90% of
   wall clock time as CPU time, due to missing the
   other 10% or so of clock ticks.
2) account_process_tick() only ever processes one tick
   at a time - if it gets called only 90x a second for
   a 100Hz guest, but all the steal time recorded by
   the host is fully accounted (ULONG_MAX limit), then
   that could make up for lost/skipped timer ticks.
3) not accounting "extra" steal time (beyond the amount
   of time accounted by account_process_tick) would reduce
   the total amount of time that gets accounted if there
   are missed ticks, taking time away from user/system/etc

Does the above make sense?

Am I overlooking some mechanism through which lost/skipped
ticks are made up for in the kernel?  I looked through the
code in kernel/time/ briefly, but did not spot it...

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-12  2:44               ` Rik van Riel
@ 2016-08-12  7:09                 ` Wanpeng Li
  2016-08-12 15:58                   ` Rik van Riel
  0 siblings, 1 reply; 42+ messages in thread
From: Wanpeng Li @ 2016-08-12  7:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-12 10:44 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Thu, 2016-08-11 at 18:11 +0800, Wanpeng Li wrote:
>> 2016-08-11 0:52 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > On Wed, 10 Aug 2016 07:39:08 +0800
>> > Wanpeng Li <kernellwp@gmail.com> wrote:
>> >
>> > > The regression is caused by your commit "sched,time: Count
>> > > actually
>> > > elapsed irq & softirq time".
>> >
>> > Wanpeng, does this patch fix your issue?
>>
>> I test this against kvm guest (nohz_full, four vCPUs running on one
>> pCPU, four cpuhog processes running on four vCPUs).
>> before this fix patch:
>> vCPU0's st is 100%, other vCPUs' st are ~75%.
>> after this fix patch:
>> all vCPUs' st are ~85%.
>> However, w/o commit "sched,time: Count actually elapsed irq & softirq
>> time", all vCPUs' st are ~75%.
>
> If you pass ULONG_MAX as the maxtime argument to
> steal_account_process_time(), does the steal time
> get accounted properly at 75%?

Yes.

>
> If that is the case, I have a hypothesis:
> 1) The guest is running so much slower when sharing
>    a CPU 4 ways, that it is accounting only ~90% of
>    wall clock time as CPU time, due to missing the
>    other 10% or so of clock ticks.
> 2) account_process_tick() only ever processes one tick
>    at a time - if it gets called only 90x a second for
>    a 100Hz guest, but all the steal time recorded by
>    the host is fully accounted (ULONG_MAX limit), then
>    that could make up for lost/skipped timer ticks.
> 3) not accounting "extra" steal time (beyond the amount
>    of time accounted by account_process_tick) would reduce
>    the total amount of time that gets accounted if there
>    are missed ticks, taking time away from user/system/etc
>
> Does the above make sense?
>
> Am I overlooking some mechanism through which lost/skipped
> ticks are made up for in the kernel?  I looked through the
> code in kernel/time/ briefly, but did not spot it...
>
> --
>
> All Rights Reversed.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-12  7:09                 ` Wanpeng Li
@ 2016-08-12 15:58                   ` Rik van Riel
  2016-08-13 15:36                     ` Frederic Weisbecker
  2016-08-15  8:53                     ` Wanpeng Li
  0 siblings, 2 replies; 42+ messages in thread
From: Rik van Riel @ 2016-08-12 15:58 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

On Fri, 12 Aug 2016 15:09:00 +0800
Wanpeng Li <kernellwp@gmail.com> wrote:
> 2016-08-12 10:44 GMT+08:00 Rik van Riel <riel@redhat.com>:

> > If you pass ULONG_MAX as the maxtime argument to
> > steal_account_process_time(), does the steal time
> > get accounted properly at 75%?  
> 
> Yes.

I talked with Paolo this morning, and it turns out that if a guest
misses several timer ticks in a row, they will simply get lost.

That means the functions calling steal_account_process_time may not
know how much CPU time has passed since the last time it was called,
but steal_account_process_time will get a good idea on how much time
the host spent running something else.

Removing the limit, and documenting why, seems like the right way to
fix this bug.

Wanpeng, does the patch below work for you?

Everybody else, does this patch look acceptable?

---8<---
Subject: time,virt: do not limit steal_account_process_time

When a guest is interrupted for a longer amount of time, missed clock
ticks are not redelivered later. Because of that, we should not limit
the amount of steal time accounted to the amount of time that the
calling functions think have passed.

Instead, simply let steal_account_process_time account however much
steal time the host told us elapsed. This can make up timer ticks
that were missed when the host scheduled somebody else.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Wanpeng Li <kernellwp@gmail.com>
---
 kernel/sched/cputime.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 1934f658c036..6f15274940fb 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -263,7 +263,12 @@ void account_idle_time(cputime_t cputime)
 		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
-static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
+/*
+ * When a guest is interrupted for a longer amount of time, missed clock
+ * ticks are not redelivered later. Due to that, this function may on
+ * occasion account more time than the calling functions think elapsed.
+ */
+static __always_inline cputime_t steal_account_process_time(void)
 {
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
@@ -273,7 +278,7 @@ static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 		steal = paravirt_steal_clock(smp_processor_id());
 		steal -= this_rq()->prev_steal_time;
 
-		steal_cputime = min(nsecs_to_cputime(steal), maxtime);
+		steal_cputime = nsecs_to_cputime(steal);
 		account_steal_time(steal_cputime);
 		this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);
 
@@ -290,7 +295,7 @@ static inline cputime_t account_other_time(cputime_t max)
 {
 	cputime_t accounted;
 
-	accounted = steal_account_process_time(max);
+	accounted = steal_account_process_time();
 
 	if (accounted < max)
 		accounted += irqtime_account_hi_update(max - accounted);
@@ -486,7 +491,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
 	}
 
 	cputime = cputime_one_jiffy;
-	steal = steal_account_process_time(cputime);
+	steal = steal_account_process_time();
 
 	if (steal >= cputime)
 		return;

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-10 16:52           ` [PATCH] time,virt: resync steal time when guest & host lose sync Rik van Riel
  2016-08-11 10:11             ` Wanpeng Li
@ 2016-08-12 16:33             ` Paolo Bonzini
  2016-08-12 17:23               ` Rik van Riel
  2016-08-13  8:42             ` Ingo Molnar
  2 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2016-08-12 16:33 UTC (permalink / raw)
  To: Rik van Riel, Wanpeng Li
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Peter Zijlstra,
	Wanpeng Li, Thomas Gleixner, Radim Krcmar, Mike Galbraith



On 10/08/2016 18:52, Rik van Riel wrote:
> Paolo, what is your opinion on this issue?
> 
> I can think of all kinds of ways in which guest and host might lose
> sync with steal time, from uninitialized values at boot, to guest
> pause, followed by save to disk, and reload, to live migration, to...

Guest and host _cannot_ lose sync because there is only one copy of the
values.  When the host wants to update the steal time value it just
reads the old value and writes the new value.  There cannot be a guest
pause, save to disk, live migration or whatever between these two steps
(and uninitialized values at boot are not how percpu values work).

Your hypothesis of lost ticks makes the most sense to me, and then
changing the argument to ULONG_MAX is the right thing to do.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-12 16:33             ` Paolo Bonzini
@ 2016-08-12 17:23               ` Rik van Riel
  2016-08-13  7:18                 ` Paolo Bonzini
  0 siblings, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2016-08-12 17:23 UTC (permalink / raw)
  To: Paolo Bonzini, Wanpeng Li
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Peter Zijlstra,
	Wanpeng Li, Thomas Gleixner, Radim Krcmar, Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 1349 bytes --]

On Fri, 2016-08-12 at 18:33 +0200, Paolo Bonzini wrote:
> 
> On 10/08/2016 18:52, Rik van Riel wrote:
> > Paolo, what is your opinion on this issue?
> > 
> > I can think of all kinds of ways in which guest and host might lose
> > sync with steal time, from uninitialized values at boot, to guest
> > pause, followed by save to disk, and reload, to live migration,
> > to...
> 
> Guest and host _cannot_ lose sync because there is only one copy of
> the
> values.  When the host wants to update the steal time value it just
> reads the old value and writes the new value.  There cannot be a
> guest
> pause, save to disk, live migration or whatever between these two
> steps
> (and uninitialized values at boot are not how percpu values work).

There is one copy of paravirt_steal_clock(smp_processor_id()),
but what keeps it in sync with this_rq()->prev_steal_time?

Is it something simple like them both being zeroed out when
the structures are first allocated at boot time?

> Your hypothesis of lost ticks makes the most sense to me, and then
> changing the argument to ULONG_MAX is the right thing to do.

I sent out a patch that just removes the parameter instead,
and documents why steal_account_process_time can encounter
more elapsed time than the calling functions expected.

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-12 17:23               ` Rik van Riel
@ 2016-08-13  7:18                 ` Paolo Bonzini
  0 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2016-08-13  7:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wanpeng Li, Frederic Weisbecker, Ingo Molnar, LKML,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

> There is one copy of paravirt_steal_clock(smp_processor_id()),
> but what keeps it in sync with this_rq()->prev_steal_time?
> 
> Is it something simple like them both being zeroed out when
> the structures are first allocated at boot time?

Yes, more precisely both of them being equal when the MSR is
written to.  They are just memory locations so they remain in
sync across pause, migration and the like, and prev_steal_time
is only ever updated with a previous value of paravirt_steal_clock().

> > Your hypothesis of lost ticks makes the most sense to me, and then
> > changing the argument to ULONG_MAX is the right thing to do.
> 
> I sent out a patch that just removes the parameter instead,
> and documents why steal_account_process_time can encounter
> more elapsed time than the calling functions expected.

Good, thanks!

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-10 16:52           ` [PATCH] time,virt: resync steal time when guest & host lose sync Rik van Riel
  2016-08-11 10:11             ` Wanpeng Li
  2016-08-12 16:33             ` Paolo Bonzini
@ 2016-08-13  8:42             ` Ingo Molnar
  2016-08-14  1:50               ` Rik van Riel
  2016-08-18  8:23               ` Wanpeng Li
  2 siblings, 2 replies; 42+ messages in thread
From: Ingo Molnar @ 2016-08-13  8:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wanpeng Li, Frederic Weisbecker, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith


* Rik van Riel <riel@redhat.com> wrote:

> On Wed, 10 Aug 2016 07:39:08 +0800
> Wanpeng Li <kernellwp@gmail.com> wrote:
> 
> > The regression is caused by your commit "sched,time: Count actually
> > elapsed irq & softirq time".
> 
> Wanpeng, does this patch fix your issue?
> 
> Paolo, what is your opinion on this issue?
> 
> I can think of all kinds of ways in which guest and host might lose
> sync with steal time, from uninitialized values at boot, to guest
> pause, followed by save to disk, and reload, to live migration, to...
> 
> ---8<---
> 
> Subject: time,virt: resync steal time when guest & host lose sync
> 
> When guest and host wildly disagree on steal time, a guest can
> do several things:
> 1) Quickly account all the steal time at once (the kernel did this before
>    57430218317e ("sched/cputime: Count actually elapsed irq & softirq time"),
>    when steal_account_process_ticks got ULONG_MAX as its maximum value.
> 2) Stay out of sync for an indeterminate amount of time. This is what the
>    system does today.
> 3) Sync up the guest value to the host-provided value, without accounting
>    an absurdly large value in the cpu time statistics.
> 
> This patch makes the kernel do (3), which seems like the right thing
> to do.
> 
> The exact value of the threshold use probably does not matter too much,
> as long as it is long enough to cover all the timer ticks that passed
> during an idle period, because (irqtime_)account_idle_ticks can process
> a large amount of time all at once.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Wanpeng Li <kernellwp@gmail.com>
> ---
>  kernel/sched/cputime.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)

fails to build on x86 allnoconfig:

  kernel/sched/cputime.c:524:10: error: too many arguments to function ‘steal_account_process_time’

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-12 15:58                   ` Rik van Riel
@ 2016-08-13 15:36                     ` Frederic Weisbecker
  2016-08-15  8:53                     ` Wanpeng Li
  1 sibling, 0 replies; 42+ messages in thread
From: Frederic Weisbecker @ 2016-08-13 15:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Wanpeng Li, Ingo Molnar, LKML, Paolo Bonzini, Peter Zijlstra,
	Wanpeng Li, Thomas Gleixner, Radim Krcmar, Mike Galbraith

On Fri, Aug 12, 2016 at 11:58:03AM -0400, Rik van Riel wrote:
> On Fri, 12 Aug 2016 15:09:00 +0800
> Wanpeng Li <kernellwp@gmail.com> wrote:
> > 2016-08-12 10:44 GMT+08:00 Rik van Riel <riel@redhat.com>:
> 
> > > If you pass ULONG_MAX as the maxtime argument to
> > > steal_account_process_time(), does the steal time
> > > get accounted properly at 75%?  
> > 
> > Yes.
> 
> I talked with Paolo this morning, and it turns out that if a guest
> misses several timer ticks in a row, they will simply get lost.
> 
> That means the functions calling steal_account_process_time may not
> know how much CPU time has passed since the last time it was called,
> but steal_account_process_time will get a good idea on how much time
> the host spent running something else.
> 
> Removing the limit, and documenting why, seems like the right way to
> fix this bug.
> 
> Wanpeng, does the patch below work for you?
> 
> Everybody else, does this patch look acceptable?
> 
> ---8<---
> Subject: time,virt: do not limit steal_account_process_time
> 
> When a guest is interrupted for a longer amount of time, missed clock
> ticks are not redelivered later. Because of that, we should not limit
> the amount of steal time accounted to the amount of time that the
> calling functions think have passed.
> 
> Instead, simply let steal_account_process_time account however much
> steal time the host told us elapsed. This can make up timer ticks
> that were missed when the host scheduled somebody else.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Wanpeng Li <kernellwp@gmail.com>

I much prefer this version. After all, even if that time spent in
host is very large, it's still stolen time and we want to account it.

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-13  8:42             ` Ingo Molnar
@ 2016-08-14  1:50               ` Rik van Riel
  2016-08-18  8:23               ` Wanpeng Li
  1 sibling, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2016-08-14  1:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Wanpeng Li, Frederic Weisbecker, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 2626 bytes --]

On Sat, 2016-08-13 at 10:42 +0200, Ingo Molnar wrote:
> * Rik van Riel <riel@redhat.com> wrote:
> 
> > On Wed, 10 Aug 2016 07:39:08 +0800
> > Wanpeng Li <kernellwp@gmail.com> wrote:
> > 
> > > The regression is caused by your commit "sched,time: Count
> > > actually
> > > elapsed irq & softirq time".
> > 
> > Wanpeng, does this patch fix your issue?
> > 
> > Paolo, what is your opinion on this issue?
> > 
> > I can think of all kinds of ways in which guest and host might lose
> > sync with steal time, from uninitialized values at boot, to guest
> > pause, followed by save to disk, and reload, to live migration,
> > to...
> > 
> > ---8<---
> > 
> > Subject: time,virt: resync steal time when guest & host lose sync
> > 
> > When guest and host wildly disagree on steal time, a guest can
> > do several things:
> > 1) Quickly account all the steal time at once (the kernel did this
> > before
> >    57430218317e ("sched/cputime: Count actually elapsed irq &
> > softirq time"),
> >    when steal_account_process_ticks got ULONG_MAX as its maximum
> > value.
> > 2) Stay out of sync for an indeterminate amount of time. This is
> > what the
> >    system does today.
> > 3) Sync up the guest value to the host-provided value, without
> > accounting
> >    an absurdly large value in the cpu time statistics.
> > 
> > This patch makes the kernel do (3), which seems like the right
> > thing
> > to do.
> > 
> > The exact value of the threshold use probably does not matter too
> > much,
> > as long as it is long enough to cover all the timer ticks that
> > passed
> > during an idle period, because (irqtime_)account_idle_ticks can
> > process
> > a large amount of time all at once.
> > 
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> > Reported-by: Wanpeng Li <kernellwp@gmail.com>
> > ---
> >  kernel/sched/cputime.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> fails to build on x86 allnoconfig:
> 
>   kernel/sched/cputime.c:524:10: error: too many arguments to
> function ‘steal_account_process_time’

Which patch did you apply?  The subject and comment
of the email suggest you tried applying the one
Paolo and Frederic objected to.

The compile error suggest you applied the patch with the
subject "time,virt: do not limit steal_account_process_time"

In that case, did you apply Wanpeng's patch that adds an
additional call site for steal_account_process_time?

I do not have that patch in my tree yet, and one additional
line of change will be needed.

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-12 15:58                   ` Rik van Riel
  2016-08-13 15:36                     ` Frederic Weisbecker
@ 2016-08-15  8:53                     ` Wanpeng Li
  2016-08-15 11:38                       ` Wanpeng Li
  2016-08-15 15:00                       ` Rik van Riel
  1 sibling, 2 replies; 42+ messages in thread
From: Wanpeng Li @ 2016-08-15  8:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-12 23:58 GMT+08:00 Rik van Riel <riel@redhat.com>:
[...]
> Wanpeng, does the patch below work for you?

It will break steal time for full dynticks guest, and there is a
calltrace of thread_group_cputime_adjusted call stack, RIP is
cputime_adjust+0xff/0x130.

>
> Everybody else, does this patch look acceptable?
>
> ---8<---
> Subject: time,virt: do not limit steal_account_process_time
>
> When a guest is interrupted for a longer amount of time, missed clock
> ticks are not redelivered later. Because of that, we should not limit

Interesting, so do we need to add a feature like lapic timer
interrrupt coalescing in kvm? There is a RTC interrupt coalescing in
qemu to handle the scenario you mentioned for windows guest, if we
need a similar feature for linux guest? Paolo, Radim? I can try it if
you think it's valuable. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-15  8:53                     ` Wanpeng Li
@ 2016-08-15 11:38                       ` Wanpeng Li
  2016-08-15 15:00                       ` Rik van Riel
  1 sibling, 0 replies; 42+ messages in thread
From: Wanpeng Li @ 2016-08-15 11:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-15 16:53 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-08-12 23:58 GMT+08:00 Rik van Riel <riel@redhat.com>:
> [...]
>> Wanpeng, does the patch below work for you?
>
> It will break steal time for full dynticks guest, and there is a
> calltrace of thread_group_cputime_adjusted call stack, RIP is
> cputime_adjust+0xff/0x130.

I have a patch on hand, will send out soon. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-15  8:53                     ` Wanpeng Li
  2016-08-15 11:38                       ` Wanpeng Li
@ 2016-08-15 15:00                       ` Rik van Riel
  2016-08-15 22:19                         ` Wanpeng Li
  2016-08-16  1:31                         ` Wanpeng Li
  1 sibling, 2 replies; 42+ messages in thread
From: Rik van Riel @ 2016-08-15 15:00 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 1111 bytes --]

On Mon, 2016-08-15 at 16:53 +0800, Wanpeng Li wrote:
> 2016-08-12 23:58 GMT+08:00 Rik van Riel <riel@redhat.com>:
> [...]
> > Wanpeng, does the patch below work for you?
> 
> It will break steal time for full dynticks guest, and there is a
> calltrace of thread_group_cputime_adjusted call stack, RIP is
> cputime_adjust+0xff/0x130.

How?  This patch is equivalent to passing ULONG_MAX to
steal_account_process_time, which you tried to no ill
effect before.

Do you have the full call trace?

> > Subject: time,virt: do not limit steal_account_process_time
> > 
> > When a guest is interrupted for a longer amount of time, missed
> > clock
> > ticks are not redelivered later. Because of that, we should not
> > limit
> 
> Interesting, so do we need to add a feature like lapic timer
> interrrupt coalescing in kvm? There is a RTC interrupt coalescing in
> qemu to handle the scenario you mentioned for windows guest, if we
> need a similar feature for linux guest? Paolo, Radim? I can try it if
> you think it's valuable. :)
> 
> Regards,
> Wanpeng Li
-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-15 15:00                       ` Rik van Riel
@ 2016-08-15 22:19                         ` Wanpeng Li
  2016-08-16  1:31                         ` Wanpeng Li
  1 sibling, 0 replies; 42+ messages in thread
From: Wanpeng Li @ 2016-08-15 22:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-15 23:00 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Mon, 2016-08-15 at 16:53 +0800, Wanpeng Li wrote:
>> 2016-08-12 23:58 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> [...]
>> > Wanpeng, does the patch below work for you?
>>
>> It will break steal time for full dynticks guest, and there is a
>> calltrace of thread_group_cputime_adjusted call stack, RIP is
>> cputime_adjust+0xff/0x130.
>
> How?  This patch is equivalent to passing ULONG_MAX to
> steal_account_process_time, which you tried to no ill
> effect before.

https://lkml.org/lkml/2016/8/15/217 I sent out a patch yesterday which
can fix the regression. Your patch breaks full dynticks guest since
vtime doesn't depend on clock ticks, so we should keep the max cputime
limit for it as my patch description mentioned, remove the limit for
vtime results in the calltrace. My patch does what Paolo suggested
https://lkml.org/lkml/2016/8/12/380 for lost ticks scenario.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-15 15:00                       ` Rik van Riel
  2016-08-15 22:19                         ` Wanpeng Li
@ 2016-08-16  1:31                         ` Wanpeng Li
  2016-08-16  2:11                           ` Rik van Riel
  1 sibling, 1 reply; 42+ messages in thread
From: Wanpeng Li @ 2016-08-16  1:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-15 23:00 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Mon, 2016-08-15 at 16:53 +0800, Wanpeng Li wrote:
>> 2016-08-12 23:58 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> [...]
>> > Wanpeng, does the patch below work for you?
>>
>> It will break steal time for full dynticks guest, and there is a
>> calltrace of thread_group_cputime_adjusted call stack, RIP is
>> cputime_adjust+0xff/0x130.
>
> How?  This patch is equivalent to passing ULONG_MAX to
> steal_account_process_time, which you tried to no ill
> effect before.

https://lkml.org/lkml/2016/6/8/404/ Paolo original suggested to add
the max cputime limit to the vtime, when the cpu is running in nohz
full mode and stop the tick, jiffies will be updated depends on clock
source instead of clock event device in
guest(tick_nohz_update_jiffies() callsite, ktime_get()), so it will
not be affected by lost clock ticks, my patch keeps the limit for
vtime and remove the limit to non-vtime. However, your patch removes
the limit for both scenarios and results in the below calltrace for
vtime.

>
> Do you have the full call trace?

[    6.929856] divide error: 0000 [#1] SMP
[    6.934217] Modules linked in:
[    6.937759] CPU: 3 PID: 57 Comm: kworker/u8:1 Not tainted 4.7.0+ #36
[    6.946105] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[    6.953951] Workqueue: events_unbound call_usermodehelper_exec_work
[    6.965726] task: ffff8e22b9785040 task.stack: ffff8e22b8b64000
[    6.970820] RIP: 0010:[<ffffffff870c7b4f>]  [<ffffffff870c7b4f>]
cputime_adjust+0xff/0x130
[    6.981841] RSP: 0000:ffff8e22b8b67b78  EFLAGS: 00010887
[    6.985946] RAX: a528afff5ad75000 RBX: ffff8e222e243c18 RCX: ffff8e22b8b67c28
[    7.001166] RDX: 0000000000000000 RSI: 0000000000000296 RDI: 0000000000000000
[    7.008758] RBP: ffff8e22b8b67ba8 R08: 00000000ffffffff R09: 00000000a528b000
[    7.015653] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000014a516
[    7.021376] R13: ffff8e22b8b67bb8 R14: ffff8e222e243c28 R15: ffff8e22b8b67c20
[    7.035498] FS:  0000000000000000(0000) GS:ffff8e22bac00000(0000)
knlGS:0000000000000000
[    7.054809] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.066571] CR2: 00000000ffffffff CR3: 000000007ae06000 CR4: 00000000001406e0
[    7.075162] Stack:
[    7.090141]  ffff8e22b8b67c28 ffff8e222e371ac0 ffff8e22b8b67c20
ffff8e22b8b67c28
[    7.108512]  ffff8e222e371ac0 ffff8e22b8b67cc0 ffff8e22b8b67be8
ffffffff870c8c01
[    7.123025]  00000000000e0471 fffffffffffdcf32 000000000014a516
ffff8e22b9785040
[    7.140622] Call Trace:
[    7.153076]  [<ffffffff870c8c01>] thread_group_cputime_adjusted+0x41/0x50
[    7.160807]  [<ffffffff870913bf>] wait_consider_task+0xa4f/0xff0
[    7.176449]  [<ffffffff87090fc1>] ? wait_consider_task+0x651/0xff0
[    7.186281]  [<ffffffff87091a3f>] ? do_wait+0xdf/0x320
[    7.226606]  [<ffffffff87091a7b>] do_wait+0x11b/0x320
[    7.239670]  [<ffffffff87093014>] SyS_wait4+0x64/0xc0
[    7.245385]  [<ffffffff87090180>] ? task_stopped_code+0x50/0x50
[    7.255924]  [<ffffffff870a8470>] call_usermodehelper_exec_work+0x70/0xb0
[    7.263011]  [<ffffffff870acbd0>] process_one_work+0x1e0/0x670
[    7.273051]  [<ffffffff870acb51>] ? process_one_work+0x161/0x670
[    7.277991]  [<ffffffff870ad18b>] worker_thread+0x12b/0x4a0
[    7.286920]  [<ffffffff870ad060>] ? process_one_work+0x670/0x670
[    7.291745]  [<ffffffff870b4011>] kthread+0x101/0x120
[    7.296878]  [<ffffffff878c94cf>] ret_from_fork+0x1f/0x40
[    7.306511]  [<ffffffff870b3f10>] ? kthread_create_on_node+0x250/0x250
[    7.311985] Code: 4d 39 c8 76 c1 4c 89 d0 48 c1 e8 20 48 85 c0 74
ca 4c 89 c0 49 d1 ea 4d 89 c8 48 d1 e8 49 89 c1 eb 9f 44 89 c8 31 d2
49 0f af c0 <49> f7 f2 4d 89 e2 48 39 f8 48 0f 42 c7 49 29 c2 4d 39 d3
76 0b
[    7.357565] RIP  [<ffffffff870c7b4f>] cputime_adjust+0xff/0x130
[    7.364633]  RSP <ffff8e22b8b67b78>
[    7.373247] ---[ end trace 76ca7475a22c5d43 ]---

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-16  1:31                         ` Wanpeng Li
@ 2016-08-16  2:11                           ` Rik van Riel
  2016-08-16  6:54                             ` Wanpeng Li
  0 siblings, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2016-08-16  2:11 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 6964 bytes --]

On Tue, 2016-08-16 at 09:31 +0800, Wanpeng Li wrote:
> 2016-08-15 23:00 GMT+08:00 Rik van Riel <riel@redhat.com>:
> > On Mon, 2016-08-15 at 16:53 +0800, Wanpeng Li wrote:
> > > 2016-08-12 23:58 GMT+08:00 Rik van Riel <riel@redhat.com>:
> > > [...]
> > > > Wanpeng, does the patch below work for you?
> > > 
> > > It will break steal time for full dynticks guest, and there is a
> > > calltrace of thread_group_cputime_adjusted call stack, RIP is
> > > cputime_adjust+0xff/0x130.
> > 
> > How?  This patch is equivalent to passing ULONG_MAX to
> > steal_account_process_time, which you tried to no ill
> > effect before.
> 
> https://lkml.org/lkml/2016/6/8/404/ Paolo original suggested to add
> the max cputime limit to the vtime, when the cpu is running in nohz
> full mode and stop the tick, jiffies will be updated depends on clock
> source instead of clock event device in
> guest(tick_nohz_update_jiffies() callsite, ktime_get()), so it will
> not be affected by lost clock ticks, my patch keeps the limit for
> vtime and remove the limit to non-vtime. However, your patch removes
> the limit for both scenarios and results in the below calltrace for
> vtime.

I understand what it does.

What I would like to understand is WHY enforcing the limit
is the right thing when using vtime, and the wrong thing
in all other scenarios.

Can you explain why you change the limit to ULONG_MAX in
three call sites, but not in the last one?

What is different about the first three, versus the last
one?

Are you sure it should be changed in three places, and
not in eg. two?

This seems like something we should try to understand,
rather than patch up haphazardly.

The changelog of your patch could use an explanation of
why the change is the correct way to go.

> > 
> > Do you have the full call trace?

OK, so you are seeing a divide by zero in cputime_adjust.

Specifically, this would be scale_stime getting passed
a (utime + stime) that adds up to 0. Stranger still, that
only gets called if neither utime or stime is 0, meaning
that one of utime or stime is negative, at the exact same
magnitude as the other.

Looking at thread_group_cputime(), I see some room for
rounding errors.

        do {
                seq = nextseq;
                flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, 
&seq);
                times->utime = sig->utime;
                times->stime = sig->stime;
                times->sum_exec_runtime = sig->sum_sched_runtime;

                for_each_thread(tsk, t) {
                        task_cputime(t, &utime, &stime);
                        times->utime += utime;
                        times->stime += stime;
                        times->sum_exec_runtime +=
task_sched_runtime(t);
                }
                /* If lockless access failed, take the lock. */
                nextseq = 1;
        } while (need_seqretry(&sig->stats_lock, seq));

Specifically, task_cputime calls vtime_delta, which works
off jiffies, while task_sched_runtime works straight off
the sched clock.

I can see how this would lead to a non-zero ->sum_exec_runtime,
while ->utime and/or ->stime are still at zero. This is fine.

What I do not see is how ->utime or ->stime could end up negative,
which is what would be needed to hit that divide by zero.

Unless I am overlooking something...

This would be a good thing to debug.

> [    6.929856] divide error: 0000 [#1] SMP
> [    6.934217] Modules linked in:
> [    6.937759] CPU: 3 PID: 57 Comm: kworker/u8:1 Not tainted 4.7.0+
> #36
> [    6.946105] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [    6.953951] Workqueue: events_unbound
> call_usermodehelper_exec_work
> [    6.965726] task: ffff8e22b9785040 task.stack: ffff8e22b8b64000
> [    6.970820] RIP: 0010:[<ffffffff870c7b4f>]  [<ffffffff870c7b4f>]
> cputime_adjust+0xff/0x130
> [    6.981841] RSP: 0000:ffff8e22b8b67b78  EFLAGS: 00010887
> [    6.985946] RAX: a528afff5ad75000 RBX: ffff8e222e243c18 RCX:
> ffff8e22b8b67c28
> [    7.001166] RDX: 0000000000000000 RSI: 0000000000000296 RDI:
> 0000000000000000
> [    7.008758] RBP: ffff8e22b8b67ba8 R08: 00000000ffffffff R09:
> 00000000a528b000
> [    7.015653] R10: 0000000000000000 R11: 0000000000000000 R12:
> 000000000014a516
> [    7.021376] R13: ffff8e22b8b67bb8 R14: ffff8e222e243c28 R15:
> ffff8e22b8b67c20
> [    7.035498] FS:  0000000000000000(0000) GS:ffff8e22bac00000(0000)
> knlGS:0000000000000000
> [    7.054809] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    7.066571] CR2: 00000000ffffffff CR3: 000000007ae06000 CR4:
> 00000000001406e0
> [    7.075162] Stack:
> [    7.090141]  ffff8e22b8b67c28 ffff8e222e371ac0 ffff8e22b8b67c20
> ffff8e22b8b67c28
> [    7.108512]  ffff8e222e371ac0 ffff8e22b8b67cc0 ffff8e22b8b67be8
> ffffffff870c8c01
> [    7.123025]  00000000000e0471 fffffffffffdcf32 000000000014a516
> ffff8e22b9785040
> [    7.140622] Call Trace:
> [    7.153076]  [<ffffffff870c8c01>]
> thread_group_cputime_adjusted+0x41/0x50
> [    7.160807]  [<ffffffff870913bf>] wait_consider_task+0xa4f/0xff0
> [    7.176449]  [<ffffffff87090fc1>] ? wait_consider_task+0x651/0xff0
> [    7.186281]  [<ffffffff87091a3f>] ? do_wait+0xdf/0x320
> [    7.226606]  [<ffffffff87091a7b>] do_wait+0x11b/0x320
> [    7.239670]  [<ffffffff87093014>] SyS_wait4+0x64/0xc0
> [    7.245385]  [<ffffffff87090180>] ? task_stopped_code+0x50/0x50
> [    7.255924]  [<ffffffff870a8470>]
> call_usermodehelper_exec_work+0x70/0xb0
> [    7.263011]  [<ffffffff870acbd0>] process_one_work+0x1e0/0x670
> [    7.273051]  [<ffffffff870acb51>] ? process_one_work+0x161/0x670
> [    7.277991]  [<ffffffff870ad18b>] worker_thread+0x12b/0x4a0
> [    7.286920]  [<ffffffff870ad060>] ? process_one_work+0x670/0x670
> [    7.291745]  [<ffffffff870b4011>] kthread+0x101/0x120
> [    7.296878]  [<ffffffff878c94cf>] ret_from_fork+0x1f/0x40
> [    7.306511]  [<ffffffff870b3f10>] ?
> kthread_create_on_node+0x250/0x250
> [    7.311985] Code: 4d 39 c8 76 c1 4c 89 d0 48 c1 e8 20 48 85 c0 74
> ca 4c 89 c0 49 d1 ea 4d 89 c8 48 d1 e8 49 89 c1 eb 9f 44 89 c8 31 d2
> 49 0f af c0 <49> f7 f2 4d 89 e2 48 39 f8 48 0f 42 c7 49 29 c2 4d 39
> d3
> 76 0b
> [    7.357565] RIP  [<ffffffff870c7b4f>] cputime_adjust+0xff/0x130
> [    7.364633]  RSP <ffff8e22b8b67b78>
> [    7.373247] ---[ end trace 76ca7475a22c5d43 ]---
-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-16  2:11                           ` Rik van Riel
@ 2016-08-16  6:54                             ` Wanpeng Li
  2016-08-16 14:01                               ` Rik van Riel
  0 siblings, 1 reply; 42+ messages in thread
From: Wanpeng Li @ 2016-08-16  6:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-16 10:11 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Tue, 2016-08-16 at 09:31 +0800, Wanpeng Li wrote:
>> 2016-08-15 23:00 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > On Mon, 2016-08-15 at 16:53 +0800, Wanpeng Li wrote:
>> > > 2016-08-12 23:58 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > > [...]
>> > > > Wanpeng, does the patch below work for you?
>> > >
>> > > It will break steal time for full dynticks guest, and there is a
>> > > calltrace of thread_group_cputime_adjusted call stack, RIP is
>> > > cputime_adjust+0xff/0x130.
>> >
>> > How?  This patch is equivalent to passing ULONG_MAX to
>> > steal_account_process_time, which you tried to no ill
>> > effect before.
>>
>> https://lkml.org/lkml/2016/6/8/404/ Paolo original suggested to add
>> the max cputime limit to the vtime, when the cpu is running in nohz
>> full mode and stop the tick, jiffies will be updated depends on clock
>> source instead of clock event device in
>> guest(tick_nohz_update_jiffies() callsite, ktime_get()), so it will
>> not be affected by lost clock ticks, my patch keeps the limit for
>> vtime and remove the limit to non-vtime. However, your patch removes
>> the limit for both scenarios and results in the below calltrace for
>> vtime.
>
> I understand what it does.
>
> What I would like to understand is WHY enforcing the limit
> is the right thing when using vtime, and the wrong thing
> in all other scenarios.

I observed that function get_vtime_delta() underflow which means that
delta < other when debugging your bugfix patch, I believe that is why
Paolo suggested to add the max cputime limit to vtime, he also pointed
out the potentional underflow before
https://lkml.org/lkml/2016/6/8/404/

>
> Can you explain why you change the limit to ULONG_MAX in
> three call sites, but not in the last one?
>
> What is different about the first three, versus the last
> one?
>
> Are you sure it should be changed in three places, and
> not in eg. two?
>
> This seems like something we should try to understand,
> rather than patch up haphazardly.
>
> The changelog of your patch could use an explanation of
> why the change is the correct way to go.
>
>> >
>> > Do you have the full call trace?
>
> OK, so you are seeing a divide by zero in cputime_adjust.
>
> Specifically, this would be scale_stime getting passed
> a (utime + stime) that adds up to 0. Stranger still, that
> only gets called if neither utime or stime is 0, meaning
> that one of utime or stime is negative, at the exact same
> magnitude as the other.
>
> Looking at thread_group_cputime(), I see some room for
> rounding errors.
>
>         do {
>                 seq = nextseq;
>                 flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock,
> &seq);
>                 times->utime = sig->utime;
>                 times->stime = sig->stime;
>                 times->sum_exec_runtime = sig->sum_sched_runtime;
>
>                 for_each_thread(tsk, t) {
>                         task_cputime(t, &utime, &stime);
>                         times->utime += utime;
>                         times->stime += stime;
>                         times->sum_exec_runtime +=
> task_sched_runtime(t);
>                 }
>                 /* If lockless access failed, take the lock. */
>                 nextseq = 1;
>         } while (need_seqretry(&sig->stats_lock, seq));
>
> Specifically, task_cputime calls vtime_delta, which works
> off jiffies, while task_sched_runtime works straight off
> the sched clock.

I try to replace task_sched_runtime() by t->se.sum_exec_runtime just
for testing (refer to Commit d670ec13178d0 "posix-cpu-timers: Cure SMP
wobbles"), divide zero still appear.

>
> I can see how this would lead to a non-zero ->sum_exec_runtime,
> while ->utime and/or ->stime are still at zero. This is fine.
>
> What I do not see is how ->utime or ->stime could end up negative,
> which is what would be needed to hit that divide by zero.

stime is negative since underflow incurred in function
get_vtime_delta(), however, it is u64, so it is a very big number, and
the total is 0 after for loop in scale_time(), but not 0 before that.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-16  6:54                             ` Wanpeng Li
@ 2016-08-16 14:01                               ` Rik van Riel
  2016-08-16 23:08                                 ` Wanpeng Li
  0 siblings, 1 reply; 42+ messages in thread
From: Rik van Riel @ 2016-08-16 14:01 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 2730 bytes --]

On Tue, 2016-08-16 at 14:54 +0800, Wanpeng Li wrote:
> 2016-08-16 10:11 GMT+08:00 Rik van Riel <riel@redhat.com>:
> > On Tue, 2016-08-16 at 09:31 +0800, Wanpeng Li wrote:
> > > 2016-08-15 23:00 GMT+08:00 Rik van Riel <riel@redhat.com>:
> > > > On Mon, 2016-08-15 at 16:53 +0800, Wanpeng Li wrote:
> > > > > 2016-08-12 23:58 GMT+08:00 Rik van Riel <riel@redhat.com>:
> > > > > [...]
> > > > > > Wanpeng, does the patch below work for you?
> > > > > 
> > > > > It will break steal time for full dynticks guest, and there
> > > > > is a
> > > > > calltrace of thread_group_cputime_adjusted call stack, RIP is
> > > > > cputime_adjust+0xff/0x130.
> > > > 
> > > > How?  This patch is equivalent to passing ULONG_MAX to
> > > > steal_account_process_time, which you tried to no ill
> > > > effect before.
> > > 
> > > https://lkml.org/lkml/2016/6/8/404/ Paolo original suggested to
> > > add
> > > the max cputime limit to the vtime, when the cpu is running in
> > > nohz
> > > full mode and stop the tick, jiffies will be updated depends on
> > > clock
> > > source instead of clock event device in
> > > guest(tick_nohz_update_jiffies() callsite, ktime_get()), so it
> > > will
> > > not be affected by lost clock ticks, my patch keeps the limit for
> > > vtime and remove the limit to non-vtime. However, your patch
> > > removes
> > > the limit for both scenarios and results in the below calltrace
> > > for
> > > vtime.
> > 
> > I understand what it does.
> > 
> > What I would like to understand is WHY enforcing the limit
> > is the right thing when using vtime, and the wrong thing
> > in all other scenarios.
> 
> I observed that function get_vtime_delta() underflow which means that
> delta < other when debugging your bugfix patch, I believe that is why
> Paolo suggested to add the max cputime limit to vtime, he also
> pointed
> out the potentional underflow before
> https://lkml.org/lkml/2016/6/8/404/

Looking at get_vtime_delta() I can see exactly how the underflow
can happen.  The interval returned by account_other_time() is NOT
rounded down to the nearest jiffy, while the base interval it is
subtracted from is.

Furthermore, even if we did not have that rounding issue, a guest
could get preempted in-between determining delta, and calling
account_other_time(), which could also cause the issue.

Could you re-send your patch with a comment in get_vtime_delta(),
as well as the changelog, explaining exactly why account_other_time()
should be limited from get_vtime_delta(), but not from the other
three call sites?

Documentation could save future developers a bunch of debugging
time on this code.

thanks,

Rik
-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-16 14:01                               ` Rik van Riel
@ 2016-08-16 23:08                                 ` Wanpeng Li
  0 siblings, 0 replies; 42+ messages in thread
From: Wanpeng Li @ 2016-08-16 23:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Frederic Weisbecker, Ingo Molnar, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-16 22:01 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Tue, 2016-08-16 at 14:54 +0800, Wanpeng Li wrote:
>> 2016-08-16 10:11 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > On Tue, 2016-08-16 at 09:31 +0800, Wanpeng Li wrote:
>> > > 2016-08-15 23:00 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > > > On Mon, 2016-08-15 at 16:53 +0800, Wanpeng Li wrote:
>> > > > > 2016-08-12 23:58 GMT+08:00 Rik van Riel <riel@redhat.com>:
>> > > > > [...]
>> > > > > > Wanpeng, does the patch below work for you?
>> > > > >
>> > > > > It will break steal time for full dynticks guest, and there
>> > > > > is a
>> > > > > calltrace of thread_group_cputime_adjusted call stack, RIP is
>> > > > > cputime_adjust+0xff/0x130.
>> > > >
>> > > > How?  This patch is equivalent to passing ULONG_MAX to
>> > > > steal_account_process_time, which you tried to no ill
>> > > > effect before.
>> > >
>> > > https://lkml.org/lkml/2016/6/8/404/ Paolo original suggested to
>> > > add
>> > > the max cputime limit to the vtime, when the cpu is running in
>> > > nohz
>> > > full mode and stop the tick, jiffies will be updated depends on
>> > > clock
>> > > source instead of clock event device in
>> > > guest(tick_nohz_update_jiffies() callsite, ktime_get()), so it
>> > > will
>> > > not be affected by lost clock ticks, my patch keeps the limit for
>> > > vtime and remove the limit to non-vtime. However, your patch
>> > > removes
>> > > the limit for both scenarios and results in the below calltrace
>> > > for
>> > > vtime.
>> >
>> > I understand what it does.
>> >
>> > What I would like to understand is WHY enforcing the limit
>> > is the right thing when using vtime, and the wrong thing
>> > in all other scenarios.
>>
>> I observed that function get_vtime_delta() underflow which means that
>> delta < other when debugging your bugfix patch, I believe that is why
>> Paolo suggested to add the max cputime limit to vtime, he also
>> pointed
>> out the potentional underflow before
>> https://lkml.org/lkml/2016/6/8/404/
>
> Looking at get_vtime_delta() I can see exactly how the underflow
> can happen.  The interval returned by account_other_time() is NOT
> rounded down to the nearest jiffy, while the base interval it is
> subtracted from is.
>
> Furthermore, even if we did not have that rounding issue, a guest
> could get preempted in-between determining delta, and calling
> account_other_time(), which could also cause the issue.
>
> Could you re-send your patch with a comment in get_vtime_delta(),
> as well as the changelog, explaining exactly why account_other_time()
> should be limited from get_vtime_delta(), but not from the other
> three call sites?
>
> Documentation could save future developers a bunch of debugging
> time on this code.

Will do. Thanks for bearing with me through such a long discussion,
I'm very happy we finally come to an agreement. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] time,virt: resync steal time when guest & host lose sync
  2016-08-13  8:42             ` Ingo Molnar
  2016-08-14  1:50               ` Rik van Riel
@ 2016-08-18  8:23               ` Wanpeng Li
  1 sibling, 0 replies; 42+ messages in thread
From: Wanpeng Li @ 2016-08-18  8:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Frederic Weisbecker, LKML, Paolo Bonzini,
	Peter Zijlstra, Wanpeng Li, Thomas Gleixner, Radim Krcmar,
	Mike Galbraith

2016-08-13 16:42 GMT+08:00 Ingo Molnar <mingo@kernel.org>:
>
> * Rik van Riel <riel@redhat.com> wrote:
>
>> On Wed, 10 Aug 2016 07:39:08 +0800
>> Wanpeng Li <kernellwp@gmail.com> wrote:
>>
>> > The regression is caused by your commit "sched,time: Count actually
>> > elapsed irq & softirq time".
>>
>> Wanpeng, does this patch fix your issue?
>>
>> Paolo, what is your opinion on this issue?
>>
>> I can think of all kinds of ways in which guest and host might lose
>> sync with steal time, from uninitialized values at boot, to guest
>> pause, followed by save to disk, and reload, to live migration, to...
>>
>> ---8<---
>>
>> Subject: time,virt: resync steal time when guest & host lose sync
>>
>> When guest and host wildly disagree on steal time, a guest can
>> do several things:
>> 1) Quickly account all the steal time at once (the kernel did this before
>>    57430218317e ("sched/cputime: Count actually elapsed irq & softirq time"),
>>    when steal_account_process_ticks got ULONG_MAX as its maximum value.
>> 2) Stay out of sync for an indeterminate amount of time. This is what the
>>    system does today.
>> 3) Sync up the guest value to the host-provided value, without accounting
>>    an absurdly large value in the cpu time statistics.
>>
>> This patch makes the kernel do (3), which seems like the right thing
>> to do.
>>
>> The exact value of the threshold use probably does not matter too much,
>> as long as it is long enough to cover all the timer ticks that passed
>> during an idle period, because (irqtime_)account_idle_ticks can process
>> a large amount of time all at once.
>>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
>> Reported-by: Wanpeng Li <kernellwp@gmail.com>
>> ---
>>  kernel/sched/cputime.c | 12 +++++++++++-
>>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> fails to build on x86 allnoconfig:
>
>   kernel/sched/cputime.c:524:10: error: too many arguments to function ‘steal_account_process_time’

Please try this one. https://lkml.org/lkml/2016/8/16/931

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2016-08-18  8:38 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-13 14:50 [GIT PULL] cputime fixes and cleanups Frederic Weisbecker
2016-07-13 14:50 ` [PATCH 1/5] sched,time: Count actually elapsed irq & softirq time Frederic Weisbecker
2016-07-14 10:37   ` [tip:timers/nohz] sched/cputime: " tip-bot for Rik van Riel
2016-08-09  3:59   ` [PATCH 1/5] sched,time: " Wanpeng Li
2016-08-09 14:06     ` Rik van Riel
2016-08-09 23:07       ` Wanpeng Li
2016-08-10  7:51         ` Wanpeng Li
2016-08-09 23:25       ` Wanpeng Li
2016-08-09 23:31         ` Wanpeng Li
2016-08-09 23:35         ` Wanpeng Li
2016-08-09 23:39         ` Wanpeng Li
2016-08-10  5:07           ` Rik van Riel
2016-08-10  6:33             ` Wanpeng Li
2016-08-10 16:52           ` [PATCH] time,virt: resync steal time when guest & host lose sync Rik van Riel
2016-08-11 10:11             ` Wanpeng Li
2016-08-12  2:44               ` Rik van Riel
2016-08-12  7:09                 ` Wanpeng Li
2016-08-12 15:58                   ` Rik van Riel
2016-08-13 15:36                     ` Frederic Weisbecker
2016-08-15  8:53                     ` Wanpeng Li
2016-08-15 11:38                       ` Wanpeng Li
2016-08-15 15:00                       ` Rik van Riel
2016-08-15 22:19                         ` Wanpeng Li
2016-08-16  1:31                         ` Wanpeng Li
2016-08-16  2:11                           ` Rik van Riel
2016-08-16  6:54                             ` Wanpeng Li
2016-08-16 14:01                               ` Rik van Riel
2016-08-16 23:08                                 ` Wanpeng Li
2016-08-12 16:33             ` Paolo Bonzini
2016-08-12 17:23               ` Rik van Riel
2016-08-13  7:18                 ` Paolo Bonzini
2016-08-13  8:42             ` Ingo Molnar
2016-08-14  1:50               ` Rik van Riel
2016-08-18  8:23               ` Wanpeng Li
2016-07-13 14:50 ` [PATCH 2/5] nohz,cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code Frederic Weisbecker
2016-07-14 10:37   ` [tip:timers/nohz] sched/cputime: " tip-bot for Rik van Riel
2016-07-13 14:50 ` [PATCH 3/5] sched: Complete cleanup of old vtime gen irqtime accounting Frederic Weisbecker
2016-07-14 10:38   ` [tip:timers/nohz] sched/cputime: Clean up the old vtime gen irqtime accounting completely tip-bot for Frederic Weisbecker
2016-07-13 14:50 ` [PATCH 4/5] sched: Reorganize vtime native irqtime accounting headers Frederic Weisbecker
2016-07-14 10:38   ` [tip:timers/nohz] sched/cputime: " tip-bot for Frederic Weisbecker
2016-07-13 14:50 ` [PATCH 5/5] time: Drop local_irq_save/restore from irqtime_account_irq Frederic Weisbecker
2016-07-14 10:38   ` [tip:timers/nohz] sched/cputime: Drop local_irq_save/restore from irqtime_account_irq() tip-bot for Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).