All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] sched/cputime: Resync steal time when guest & host lose sync
@ 2016-08-17  2:05 Wanpeng Li
  2016-08-17  2:15 ` Rik van Riel
  2016-08-18 10:55 ` [tip:sched/core] " tip-bot for Wanpeng Li
  0 siblings, 2 replies; 4+ messages in thread
From: Wanpeng Li @ 2016-08-17  2:05 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Wanpeng Li, Ingo Molnar, Peter Zijlstra, Rik van Riel,
	Paolo Bonzini, Radim Krcmar, Mike Galbraith, Frederic Weisbecker,
	Thomas Gleixner

From: Wanpeng Li <wanpeng.li@hotmail.com>

Commit:

  	57430218317e ("sched/cputime: Count actually elapsed irq & softirq time")

... triggered a regression:

| An i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
| cpu hog processes(for loop) running in the guest, I hot-unplug the pCPUs 
| on host one by one until there is only one left, then observe the top in 
| guest, there are 100% st for cpu0(housekeeping), and 75% st for other cpus
| (nohz full mode). However, w/o this commit, 75% for all the four cpus.

When a guest is interrupted for a longer amount of time, missed clock ticks 
are not redelivered later. Because of that, we should not limit the amount 
of steal time accounted to the amount of time that the calling functions 
think have passed.

However, the interval returned by account_other_time() is NOT rounded down 
to the nearest jiffy, while the base interval in get_vtime_delta() it is 
subtracted from is, so the max cputime limit is required to avoid underflow.

This patch fix the regression by limiting the account_other_time() from 
get_vtime_delta() to avoid underflow, and let other three call sites
(account_other_time() and steal_account_process_time()) account however 
much steal time the host told us elapsed. 

Suggested-by: Rik van Riel <riel@redhat.com> 
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
v2 -> v3:
 * update code comments
v1 -> v2:
 * add code comments and update the changelog

 kernel/sched/cputime.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 9858266..2b9e5e5 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -263,6 +263,11 @@ void account_idle_time(cputime_t cputime)
 		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
+/*
+ * When a guest is interrupted for a longer amount of time, missed clock
+ * ticks are not redelivered later. Due to that, this function may on
+ * occasion account more time than the calling functions think elapsed.
+ */
 static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 {
 #ifdef CONFIG_PARAVIRT
@@ -371,7 +376,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	 * idle, or potentially user or system time. Due to rounding,
 	 * other time can exceed ticks occasionally.
 	 */
-	other = account_other_time(cputime);
+	other = account_other_time(ULONG_MAX);
 	if (other >= cputime)
 		return;
 	cputime -= other;
@@ -486,7 +491,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
 	}
 
 	cputime = cputime_one_jiffy;
-	steal = steal_account_process_time(cputime);
+	steal = steal_account_process_time(ULONG_MAX);
 
 	if (steal >= cputime)
 		return;
@@ -516,7 +521,7 @@ void account_idle_ticks(unsigned long ticks)
 	}
 
 	cputime = jiffies_to_cputime(ticks);
-	steal = steal_account_process_time(cputime);
+	steal = steal_account_process_time(ULONG_MAX);
 
 	if (steal >= cputime)
 		return;
@@ -694,6 +699,13 @@ static cputime_t get_vtime_delta(struct task_struct *tsk)
 	unsigned long now = READ_ONCE(jiffies);
 	cputime_t delta, other;
 
+	/*
+	 * Unlike tick based timing, vtime based timing never has lost
+	 * ticks, and no need for steal time accounting to make up for
+	 * lost ticks. Vtime accounts a rounded version of actual
+	 * elapsed time. Limit account_other_time to prevent rounding
+	 * errors from causing elapsed vtime to go negative.
+	 */
 	delta = jiffies_to_cputime(now - tsk->vtime_snap);
 	other = account_other_time(delta);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] sched/cputime: Resync steal time when guest & host lose sync
  2016-08-17  2:05 [PATCH v3] sched/cputime: Resync steal time when guest & host lose sync Wanpeng Li
@ 2016-08-17  2:15 ` Rik van Riel
  2016-08-18 10:55 ` [tip:sched/core] " tip-bot for Wanpeng Li
  1 sibling, 0 replies; 4+ messages in thread
From: Rik van Riel @ 2016-08-17  2:15 UTC (permalink / raw)
  To: Wanpeng Li, linux-kernel, kvm
  Cc: Wanpeng Li, Ingo Molnar, Peter Zijlstra, Paolo Bonzini,
	Radim Krcmar, Mike Galbraith, Frederic Weisbecker,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 911 bytes --]

On Wed, 2016-08-17 at 10:05 +0800, Wanpeng Li wrote:

> This patch fix the regression by limiting the account_other_time()
> from 
> get_vtime_delta() to avoid underflow, and let other three call sites
> (account_other_time() and steal_account_process_time()) account
> however 
> much steal time the host told us elapsed. 
> 
> Suggested-by: Rik van Riel <riel@redhat.com> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krcmar <rkrcmar@redhat.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
> 

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [tip:sched/core] sched/cputime: Resync steal time when guest & host lose sync
  2016-08-17  2:05 [PATCH v3] sched/cputime: Resync steal time when guest & host lose sync Wanpeng Li
  2016-08-17  2:15 ` Rik van Riel
@ 2016-08-18 10:55 ` tip-bot for Wanpeng Li
  2016-08-18 12:42   ` Frederic Weisbecker
  1 sibling, 1 reply; 4+ messages in thread
From: tip-bot for Wanpeng Li @ 2016-08-18 10:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, efault, tglx, peterz, linux-kernel, wanpeng.li, hpa,
	torvalds, rkrcmar, fweisbec, riel, pbonzini

Commit-ID:  03cbc732639ddcad15218c4b2046d255851ff1e3
Gitweb:     http://git.kernel.org/tip/03cbc732639ddcad15218c4b2046d255851ff1e3
Author:     Wanpeng Li <wanpeng.li@hotmail.com>
AuthorDate: Wed, 17 Aug 2016 10:05:46 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Aug 2016 11:19:48 +0200

sched/cputime: Resync steal time when guest & host lose sync

Commit:

  57430218317e ("sched/cputime: Count actually elapsed irq & softirq time")

... fixed a bug but also triggered a regression:

On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs
on host one by one until there is only one left, then observe CPU utilization
via 'top' in the guest, it shows:

  100% st for cpu0(housekeeping)
   75% st for other CPUs (nohz full mode)

However, w/o this commit it shows the correct 75% for all four CPUs.

When a guest is interrupted for a longer amount of time, missed clock ticks
are not redelivered later. Because of that, we should not limit the amount
of steal time accounted to the amount of time that the calling functions
think have passed.

However, the interval returned by account_other_time() is NOT rounded down
to the nearest jiffy, while the base interval in get_vtime_delta() it is
subtracted from is, so the max cputime limit is required to avoid underflow.

This patch fixes the regression by limiting the account_other_time() from
get_vtime_delta() to avoid underflow, and lets the other three call sites
(in account_other_time() and steal_account_process_time()) account however
much steal time the host told us elapsed.

Suggested-by: Rik van Riel <riel@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng.li@hotmail.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 2ee83b2..a846cf8 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -263,6 +263,11 @@ void account_idle_time(cputime_t cputime)
 		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
+/*
+ * When a guest is interrupted for a longer amount of time, missed clock
+ * ticks are not redelivered later. Due to that, this function may on
+ * occasion account more time than the calling functions think elapsed.
+ */
 static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 {
 #ifdef CONFIG_PARAVIRT
@@ -371,7 +376,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	 * idle, or potentially user or system time. Due to rounding,
 	 * other time can exceed ticks occasionally.
 	 */
-	other = account_other_time(cputime);
+	other = account_other_time(ULONG_MAX);
 	if (other >= cputime)
 		return;
 	cputime -= other;
@@ -486,7 +491,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
 	}
 
 	cputime = cputime_one_jiffy;
-	steal = steal_account_process_time(cputime);
+	steal = steal_account_process_time(ULONG_MAX);
 
 	if (steal >= cputime)
 		return;
@@ -516,7 +521,7 @@ void account_idle_ticks(unsigned long ticks)
 	}
 
 	cputime = jiffies_to_cputime(ticks);
-	steal = steal_account_process_time(cputime);
+	steal = steal_account_process_time(ULONG_MAX);
 
 	if (steal >= cputime)
 		return;
@@ -699,6 +704,13 @@ static cputime_t get_vtime_delta(struct task_struct *tsk)
 	unsigned long now = READ_ONCE(jiffies);
 	cputime_t delta, other;
 
+	/*
+	 * Unlike tick based timing, vtime based timing never has lost
+	 * ticks, and no need for steal time accounting to make up for
+	 * lost ticks. Vtime accounts a rounded version of actual
+	 * elapsed time. Limit account_other_time to prevent rounding
+	 * errors from causing elapsed vtime to go negative.
+	 */
 	delta = jiffies_to_cputime(now - tsk->vtime_snap);
 	other = account_other_time(delta);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [tip:sched/core] sched/cputime: Resync steal time when guest & host lose sync
  2016-08-18 10:55 ` [tip:sched/core] " tip-bot for Wanpeng Li
@ 2016-08-18 12:42   ` Frederic Weisbecker
  0 siblings, 0 replies; 4+ messages in thread
From: Frederic Weisbecker @ 2016-08-18 12:42 UTC (permalink / raw)
  To: mingo, wanpeng.li, linux-kernel, peterz, efault, tglx, rkrcmar,
	torvalds, hpa, pbonzini, riel
  Cc: linux-tip-commits

On Thu, Aug 18, 2016 at 03:55:39AM -0700, tip-bot for Wanpeng Li wrote:
> Commit-ID:  03cbc732639ddcad15218c4b2046d255851ff1e3
> Gitweb:     http://git.kernel.org/tip/03cbc732639ddcad15218c4b2046d255851ff1e3
> Author:     Wanpeng Li <wanpeng.li@hotmail.com>
> AuthorDate: Wed, 17 Aug 2016 10:05:46 +0800
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Thu, 18 Aug 2016 11:19:48 +0200
> 
> sched/cputime: Resync steal time when guest & host lose sync
> 
> Commit:
> 
>   57430218317e ("sched/cputime: Count actually elapsed irq & softirq time")
> 
> ... fixed a bug but also triggered a regression:
> 
> On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four
> CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs
> on host one by one until there is only one left, then observe CPU utilization
> via 'top' in the guest, it shows:
> 
>   100% st for cpu0(housekeeping)
>    75% st for other CPUs (nohz full mode)
> 
> However, w/o this commit it shows the correct 75% for all four CPUs.
> 
> When a guest is interrupted for a longer amount of time, missed clock ticks
> are not redelivered later. Because of that, we should not limit the amount
> of steal time accounted to the amount of time that the calling functions
> think have passed.
> 
> However, the interval returned by account_other_time() is NOT rounded down
> to the nearest jiffy, while the base interval in get_vtime_delta() it is
> subtracted from is, so the max cputime limit is required to avoid underflow.
> 
> This patch fixes the regression by limiting the account_other_time() from
> get_vtime_delta() to avoid underflow, and lets the other three call sites
> (in account_other_time() and steal_account_process_time()) account however
> much steal time the host told us elapsed.
> 
> Suggested-by: Rik van Riel <riel@redhat.com>
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>

ACK, thanks Wanpeng Li!

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-08-18 12:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-17  2:05 [PATCH v3] sched/cputime: Resync steal time when guest & host lose sync Wanpeng Li
2016-08-17  2:15 ` Rik van Riel
2016-08-18 10:55 ` [tip:sched/core] " tip-bot for Wanpeng Li
2016-08-18 12:42   ` Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.