linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 1/3] KVM: fix steal clock warp during guest cpu hotplug
@ 2016-06-08  3:05 Wanpeng Li
  2016-06-08  3:05 ` [PATCH v5 2/3] sched/cputime: Fix prev steal time accouting during " Wanpeng Li
  2016-06-08  3:05 ` [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting Wanpeng Li
  0 siblings, 2 replies; 18+ messages in thread
From: Wanpeng Li @ 2016-06-08  3:05 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Wanpeng Li, Paolo Bonzini, Radim Krčmář,
	Ingo Molnar, Peter Zijlstra (Intel),
	Rik van Riel, Thomas Gleixner, Frederic Weisbecker, John Stultz

From: Wanpeng Li <wanpeng.li@hotmail.com>

Sometimes, after CPU hotplug you can observe a spike in stolen time
(100%) followed by the CPU being marked as 100% idle when it's actually
busy with a CPU hog task.  The trace looks like the following:

cpuhp/1-12    [001] d.h1   167.461657: account_process_tick: steal = 1291385514, prev_steal_time = 0
cpuhp/1-12    [001] d.h1   167.461659: account_process_tick: steal_jiffies = 1291
<idle>-0     [001] d.h1   167.462663: account_process_tick: steal = 18732255, prev_steal_time = 1291000000
<idle>-0     [001] d.h1   167.462664: account_process_tick: steal_jiffies = 18446744072437

The sudden decrease of "steal" causes steal_jiffies to underflow.
The root cause is kvm_steal_time being reset to 0 after hot-plugging
back in a CPU.  Instead, the preexisting value can be used, which is
what the core scheduler code expects.

John Stultz also reported a similar issue after guest S3.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
v4 -> v5:
 * improve commit message
v2 -> v3:
 * fix the root cause
v1 -> v2:
 * update patch subject, description and comments
 * deal with the case where steal time suddenly increases by a ludicrous amount

 arch/x86/kernel/kvm.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index eea2a6f..1ef5e48 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -301,8 +301,6 @@ static void kvm_register_steal_time(void)
 	if (!has_steal_clock)
 		return;
 
-	memset(st, 0, sizeof(*st));
-
 	wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
 	pr_info("kvm-stealtime: cpu %d, msr %llx\n",
 		cpu, (unsigned long long) slow_virt_to_phys(st));
-- 
1.9.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 2/3] sched/cputime: Fix prev steal time accouting during cpu hotplug
  2016-06-08  3:05 [PATCH v5 1/3] KVM: fix steal clock warp during guest cpu hotplug Wanpeng Li
@ 2016-06-08  3:05 ` Wanpeng Li
  2016-06-08 10:01   ` Paolo Bonzini
  2016-06-08  3:05 ` [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting Wanpeng Li
  1 sibling, 1 reply; 18+ messages in thread
From: Wanpeng Li @ 2016-06-08  3:05 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Wanpeng Li, Ingo Molnar, Peter Zijlstra (Intel),
	Rik van Riel, Thomas Gleixner, Frederic Weisbecker,
	Paolo Bonzini, Radim Krčmář

From: Wanpeng Li <wanpeng.li@hotmail.com>

Commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug") 
set rq->prev_* to 0 after a cpu hotplug comes back in order to fix the scenario:

| steal is smaller than rq->prev_steal_time we end up with an insane large
| value which then gets added to rq->prev_steal_time, resulting in a permanent
| wreckage of the accounting.

However, it is still buggy.

rq->prev_steal_time = 0:

As Rik pointed out:

| setting rq->prev_irq_time to 0 in the guest, and then getting a giant value from 
| the host, could result in a very large of steal_jiffies.

rq->prev_steal_time_rq = 0:

| steal = paravirt_steal_clock(cpu_of(rq));
| steal -= rq->prev_steal_time_rq;
|
| if (unlikely(steal > delta))
|	 steal = delta;
|
| rq->prev_steal_time_rq += steal;
| delta -= steal;
|
| rq->clock_task += delta;

steal is a giant value and rq->prev_steal_time_rq is 0, rq->prev_steal_time_rq 
grows in delta granularity, rq->clock_task can't ramp up until rq->prev_steal_time_rq 
catches up steal clock since delta value will be 0 after reducing steal time from 
normal execution time. That's why I obersved that cpuhg/1-12 continue running 
until rq->prev_steal_time_rq catches up steal clock timestamp.

I believe rq->prev_irq_time has similar issue. So this patch fix it by reverting  
commit e9532e69b8d1.

Fixes: 'commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")'
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
v4 -> v5:
 * revert commit e9532e69b8d1 

 kernel/sched/core.c  |  1 -
 kernel/sched/sched.h | 13 -------------
 2 files changed, 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f2cae4..7d45bb3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7213,7 +7213,6 @@ static void sched_rq_cpu_starting(unsigned int cpu)
 	struct rq *rq = cpu_rq(cpu);
 
 	rq->calc_load_update = calc_load_update;
-	account_reset_rq(rq);
 	update_max_interval();
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72f1f30..de607e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1809,16 +1809,3 @@ static inline void cpufreq_trigger_update(u64 time) {}
 #else /* arch_scale_freq_capacity */
 #define arch_scale_freq_invariant()	(false)
 #endif
-
-static inline void account_reset_rq(struct rq *rq)
-{
-#ifdef CONFIG_IRQ_TIME_ACCOUNTING
-	rq->prev_irq_time = 0;
-#endif
-#ifdef CONFIG_PARAVIRT
-	rq->prev_steal_time = 0;
-#endif
-#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
-	rq->prev_steal_time_rq = 0;
-#endif
-}
-- 
1.9.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-08  3:05 [PATCH v5 1/3] KVM: fix steal clock warp during guest cpu hotplug Wanpeng Li
  2016-06-08  3:05 ` [PATCH v5 2/3] sched/cputime: Fix prev steal time accouting during " Wanpeng Li
@ 2016-06-08  3:05 ` Wanpeng Li
  2016-06-08  7:22   ` Ingo Molnar
                     ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: Wanpeng Li @ 2016-06-08  3:05 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Wanpeng Li, Ingo Molnar, Peter Zijlstra (Intel),
	Rik van Riel, Thomas Gleixner, Frederic Weisbecker,
	Paolo Bonzini, Radim Krčmář

From: Wanpeng Li <wanpeng.li@hotmail.com>

This patch adds guest steal-time support to full dynticks CPU
time accounting. After the following commit:

ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")

... time sampling became jiffy based, even if it's still listened
to ring boundaries, so steal_account_process_tick() is reused
to account how many 'ticks' are stolen-time, after the last accumulation.

Suggested-by: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
v4 -> v5:
 * apply same logic to account_idle_time, so change get_vtime_delta instead
v3 -> v4:
 * fix grammar errors, thanks Ingo
 * cleanup fragile codes, thanks Ingo
v2 -> v3:
 * convert steal time jiffies to cputime
v1 -> v2:
 * fix divide zero bug, thanks Rik

 kernel/sched/cputime.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 75f98c5..b62f9f8 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -257,7 +257,7 @@ void account_idle_time(cputime_t cputime)
 		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
-static __always_inline bool steal_account_process_tick(void)
+static __always_inline unsigned long steal_account_process_tick(void)
 {
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
@@ -279,7 +279,7 @@ static __always_inline bool steal_account_process_tick(void)
 		return steal_jiffies;
 	}
 #endif
-	return false;
+	return 0;
 }
 
 /*
@@ -681,12 +681,17 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
 	unsigned long now = READ_ONCE(jiffies);
-	unsigned long delta = now - tsk->vtime_snap;
+	cputime_t delta_time, steal_time;
 
+	steal_time = jiffies_to_cputime(steal_account_process_tick());
+	delta_time = jiffies_to_cputime(now - tsk->vtime_snap);
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
 	tsk->vtime_snap = now;
 
-	return jiffies_to_cputime(delta);
+	if (steal_time < delta_time)
+		delta_time -= steal_time;
+
+	return delta_time;
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
-- 
1.9.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-08  3:05 ` [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting Wanpeng Li
@ 2016-06-08  7:22   ` Ingo Molnar
  2016-06-08  7:27     ` Wanpeng Li
  2016-06-08 10:14   ` Paolo Bonzini
  2016-06-08 19:05   ` Rik van Riel
  2 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2016-06-08  7:22 UTC (permalink / raw)
  To: Wanpeng Li, Rik van Riel, Frédéric Weisbecker
  Cc: linux-kernel, kvm, Wanpeng Li, Peter Zijlstra (Intel),
	Rik van Riel, Thomas Gleixner, Frederic Weisbecker,
	Paolo Bonzini, Radim Krčmář


* Wanpeng Li <kernellwp@gmail.com> wrote:

> From: Wanpeng Li <wanpeng.li@hotmail.com>
> 
> This patch adds guest steal-time support to full dynticks CPU
> time accounting. After the following commit:
> 
> ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")
> 
> ... time sampling became jiffy based, even if it's still listened
> to ring boundaries, so steal_account_process_tick() is reused
> to account how many 'ticks' are stolen-time, after the last accumulation.

So the 'ring boundary' part still doesn't parse (neither grammatically nor 
logically) - please rephrase it because I have no idea what you want to say here.

Did you want to say:

> ... time sampling became jiffy based, even if it's still being context tracked, 
> so steal_account_process_tick() is reused to account how many 'ticks' are 
> stolen-time, after the last accumulation.

... which makes sense grammatically but does not make sense to me logically. :-/

Rik, Frederic, could you please help out?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-08  7:22   ` Ingo Molnar
@ 2016-06-08  7:27     ` Wanpeng Li
  2016-06-08  7:52       ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Wanpeng Li @ 2016-06-08  7:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Frédéric Weisbecker, linux-kernel, kvm,
	Wanpeng Li, Peter Zijlstra (Intel),
	Thomas Gleixner, Paolo Bonzini, Radim Krčmář

2016-06-08 15:22 GMT+08:00 Ingo Molnar <mingo@kernel.org>:
>
> * Wanpeng Li <kernellwp@gmail.com> wrote:
>
>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>
>> This patch adds guest steal-time support to full dynticks CPU
>> time accounting. After the following commit:
>>
>> ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")
>>
>> ... time sampling became jiffy based, even if it's still listened
>> to ring boundaries, so steal_account_process_tick() is reused
>> to account how many 'ticks' are stolen-time, after the last accumulation.
>
> So the 'ring boundary' part still doesn't parse (neither grammatically nor
> logically) - please rephrase it because I have no idea what you want to say here.

It is original from this slides.
http://ertl.jp/~shinpei/conf/ospert13/slides/FredericWeisbecker.pdf,
slide 28.

>
> Did you want to say:
>
>> ... time sampling became jiffy based, even if it's still being context tracked,
>> so steal_account_process_tick() is reused to account how many 'ticks' are
>> stolen-time, after the last accumulation.
>
> ... which makes sense grammatically but does not make sense to me logically. :-/
>
> Rik, Frederic, could you please help out?
>
> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-08  7:27     ` Wanpeng Li
@ 2016-06-08  7:52       ` Ingo Molnar
  2016-06-08  8:04         ` Wanpeng Li
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2016-06-08  7:52 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Rik van Riel, Frédéric Weisbecker, linux-kernel, kvm,
	Wanpeng Li, Peter Zijlstra (Intel),
	Thomas Gleixner, Paolo Bonzini, Radim Krčmář


* Wanpeng Li <kernellwp@gmail.com> wrote:

> 2016-06-08 15:22 GMT+08:00 Ingo Molnar <mingo@kernel.org>:
> >
> > * Wanpeng Li <kernellwp@gmail.com> wrote:
> >
> >> From: Wanpeng Li <wanpeng.li@hotmail.com>
> >>
> >> This patch adds guest steal-time support to full dynticks CPU
> >> time accounting. After the following commit:
> >>
> >> ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")
> >>
> >> ... time sampling became jiffy based, even if it's still listened
> >> to ring boundaries, so steal_account_process_tick() is reused
> >> to account how many 'ticks' are stolen-time, after the last accumulation.
> >
> > So the 'ring boundary' part still doesn't parse (neither grammatically nor
> > logically) - please rephrase it because I have no idea what you want to say here.
> 
> It is original from this slides.
> http://ertl.jp/~shinpei/conf/ospert13/slides/FredericWeisbecker.pdf,
> slide 28.

Yes, I now understand that this is meant as 'context tracking is active', but I 
don't understand the way you use it in this changelog's context.

Btw., the grammatically correct way to add that phrase would have been:

 ... time sampling became jiffy based, even if it's still listening to ring 
 boundaries, so steal_account_process_tick() is reused to account how many 
 'ticks' are stolen-time, after the last accumulation.

But I still don't understand it, nor did Paolo understand it.

Nor is there any 0/3 boilerplace description that gives some context about what 
these changes are about. Exactly what do you mean by 'add steal-time support' - we 
clearly had that before. So is your patch lifting some limitation? Or was 
steal-time accounting totally inactive with certain dynticks configurations? The 
changelog does not tell us anything about that...

I'd like to quote from a mail of Andrew Morton:

 "Please update the changelog to describe the current behavior.

  Please also describe why you think that behavior should be changed.
  ie: what's the reason for this patch."

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-08  7:52       ` Ingo Molnar
@ 2016-06-08  8:04         ` Wanpeng Li
  2016-06-08  8:12           ` Wanpeng Li
  0 siblings, 1 reply; 18+ messages in thread
From: Wanpeng Li @ 2016-06-08  8:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Frédéric Weisbecker, linux-kernel, kvm,
	Wanpeng Li, Peter Zijlstra (Intel),
	Thomas Gleixner, Paolo Bonzini, Radim Krčmář

2016-06-08 15:52 GMT+08:00 Ingo Molnar <mingo@kernel.org>:
>
> * Wanpeng Li <kernellwp@gmail.com> wrote:
>
>> 2016-06-08 15:22 GMT+08:00 Ingo Molnar <mingo@kernel.org>:
>> >
>> > * Wanpeng Li <kernellwp@gmail.com> wrote:
>> >
>> >> From: Wanpeng Li <wanpeng.li@hotmail.com>
>> >>
>> >> This patch adds guest steal-time support to full dynticks CPU
>> >> time accounting. After the following commit:
>> >>
>> >> ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")
>> >>
>> >> ... time sampling became jiffy based, even if it's still listened
>> >> to ring boundaries, so steal_account_process_tick() is reused
>> >> to account how many 'ticks' are stolen-time, after the last accumulation.
>> >
>> > So the 'ring boundary' part still doesn't parse (neither grammatically nor
>> > logically) - please rephrase it because I have no idea what you want to say here.
>>
>> It is original from this slides.
>> http://ertl.jp/~shinpei/conf/ospert13/slides/FredericWeisbecker.pdf,
>> slide 28.
>
> Yes, I now understand that this is meant as 'context tracking is active', but I
> don't understand the way you use it in this changelog's context.
>
> Btw., the grammatically correct way to add that phrase would have been:
>
>  ... time sampling became jiffy based, even if it's still listening to ring
>  boundaries, so steal_account_process_tick() is reused to account how many
>  'ticks' are stolen-time, after the last accumulation.

Thanks, Ingo!

>
> But I still don't understand it, nor did Paolo understand it.
>
> Nor is there any 0/3 boilerplace description that gives some context about what
> these changes are about. Exactly what do you mean by 'add steal-time support' - we
> clearly had that before. So is your patch lifting some limitation? Or was
> steal-time accounting totally inactive with certain dynticks configurations? The
> changelog does not tell us anything about that...

Now I understand why you said "write-only code". vtime(depends on
context tracking) which is just used in full dynamic doesn't account
steal time, however, periodic/nohz idle which not use vtime have codes
account steal time in cputime.c, this patch add the steal time
acccount support in vtime which will be used in full dynamic guest.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-08  8:04         ` Wanpeng Li
@ 2016-06-08  8:12           ` Wanpeng Li
  0 siblings, 0 replies; 18+ messages in thread
From: Wanpeng Li @ 2016-06-08  8:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Frédéric Weisbecker, linux-kernel, kvm,
	Wanpeng Li, Peter Zijlstra (Intel),
	Thomas Gleixner, Paolo Bonzini, Radim Krčmář

2016-06-08 16:04 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> 2016-06-08 15:52 GMT+08:00 Ingo Molnar <mingo@kernel.org>:
>>
>> * Wanpeng Li <kernellwp@gmail.com> wrote:
>>
>>> 2016-06-08 15:22 GMT+08:00 Ingo Molnar <mingo@kernel.org>:
>>> >
>>> > * Wanpeng Li <kernellwp@gmail.com> wrote:
>>> >
>>> >> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>> >>
>>> >> This patch adds guest steal-time support to full dynticks CPU
>>> >> time accounting. After the following commit:
>>> >>
>>> >> ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")
>>> >>
>>> >> ... time sampling became jiffy based, even if it's still listened
>>> >> to ring boundaries, so steal_account_process_tick() is reused
>>> >> to account how many 'ticks' are stolen-time, after the last accumulation.
>>> >
>>> > So the 'ring boundary' part still doesn't parse (neither grammatically nor
>>> > logically) - please rephrase it because I have no idea what you want to say here.
>>>
>>> It is original from this slides.
>>> http://ertl.jp/~shinpei/conf/ospert13/slides/FredericWeisbecker.pdf,
>>> slide 28.
>>
>> Yes, I now understand that this is meant as 'context tracking is active', but I
>> don't understand the way you use it in this changelog's context.
>>
>> Btw., the grammatically correct way to add that phrase would have been:
>>
>>  ... time sampling became jiffy based, even if it's still listening to ring
>>  boundaries, so steal_account_process_tick() is reused to account how many
>>  'ticks' are stolen-time, after the last accumulation.
>
> Thanks, Ingo!
>
>>
>> But I still don't understand it, nor did Paolo understand it.
>>
>> Nor is there any 0/3 boilerplace description that gives some context about what
>> these changes are about. Exactly what do you mean by 'add steal-time support' - we
>> clearly had that before. So is your patch lifting some limitation? Or was
>> steal-time accounting totally inactive with certain dynticks configurations? The
>> changelog does not tell us anything about that...
>
> Now I understand why you said "write-only code". vtime(depends on
> context tracking) which is just used in full dynamic doesn't account

s/dynamic/dynticks

> steal time, however, periodic/nohz idle which not use vtime have codes
> account steal time in cputime.c, this patch add the steal time
> acccount support in vtime which will be used in full dynamic guest.

s/dynamic/dynticks

>
> Regards,
> Wanpeng Li



-- 
Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/3] sched/cputime: Fix prev steal time accouting during cpu hotplug
  2016-06-08  3:05 ` [PATCH v5 2/3] sched/cputime: Fix prev steal time accouting during " Wanpeng Li
@ 2016-06-08 10:01   ` Paolo Bonzini
  2016-06-08 10:59     ` Wanpeng Li
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2016-06-08 10:01 UTC (permalink / raw)
  To: Wanpeng Li, linux-kernel, kvm
  Cc: Wanpeng Li, Ingo Molnar, Peter Zijlstra (Intel),
	Rik van Riel, Thomas Gleixner, Frederic Weisbecker,
	Radim Krčmář



On 08/06/2016 05:05, Wanpeng Li wrote:
> From: Wanpeng Li <wanpeng.li@hotmail.com>
> 
> Commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug") 
> set rq->prev_* to 0 after a cpu hotplug comes back in order to fix the scenario:
> 
> | steal is smaller than rq->prev_steal_time we end up with an insane large
> | value which then gets added to rq->prev_steal_time, resulting in a permanent
> | wreckage of the accounting.
> 
> However, it is still buggy.
> 
> rq->prev_steal_time = 0:
> 
> As Rik pointed out:
> 
> | setting rq->prev_irq_time to 0 in the guest, and then getting a giant value from 
> | the host, could result in a very large of steal_jiffies.
> 
> rq->prev_steal_time_rq = 0:
> 
> | steal = paravirt_steal_clock(cpu_of(rq));
> | steal -= rq->prev_steal_time_rq;
> |
> | if (unlikely(steal > delta))
> |	 steal = delta;
> |
> | rq->prev_steal_time_rq += steal;
> | delta -= steal;
> |
> | rq->clock_task += delta;
> 
> steal is a giant value and rq->prev_steal_time_rq is 0, rq->prev_steal_time_rq 
> grows in delta granularity, rq->clock_task can't ramp up until rq->prev_steal_time_rq 
> catches up steal clock since delta value will be 0 after reducing steal time from 
> normal execution time. That's why I obersved that cpuhg/1-12 continue running 
> until rq->prev_steal_time_rq catches up steal clock timestamp.
> 
> I believe rq->prev_irq_time has similar issue. So this patch fix it by reverting  
> commit e9532e69b8d1.

This is still very hard to read.  I think the justification for this
patch is much simpler:

-------
Commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU
hotplug") set rq->prev_* to 0 after a cpu hotplug comes back in order to
fix the case where (after CPU hotplug) steal is smaller than
rq->prev_steal_time.

However, this should never happen. steal was only smaller because of the
KVM-specific bug fixed by the previous patch.  Worse, the previous patch
triggers a bug on CPU hot-unplug/plug operation: because
rq->prev_steal_time is cleared, all of the CPU's past steal time will be
accounted again on hot-plug.

Since the root cause has been fixed, we can just revert commit e9532e69b8d1.
-------

Since things are currently pretty broken, I think it's fair to keep the
two patches separate instead of squashing them.

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-08  3:05 ` [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting Wanpeng Li
  2016-06-08  7:22   ` Ingo Molnar
@ 2016-06-08 10:14   ` Paolo Bonzini
  2016-06-08 11:11     ` Wanpeng Li
  2016-06-13  3:38     ` Wanpeng Li
  2016-06-08 19:05   ` Rik van Riel
  2 siblings, 2 replies; 18+ messages in thread
From: Paolo Bonzini @ 2016-06-08 10:14 UTC (permalink / raw)
  To: Wanpeng Li, linux-kernel, kvm
  Cc: Wanpeng Li, Ingo Molnar, Peter Zijlstra (Intel),
	Rik van Riel, Thomas Gleixner, Frederic Weisbecker,
	Radim Krčmář



On 08/06/2016 05:05, Wanpeng Li wrote:
> From: Wanpeng Li <wanpeng.li@hotmail.com>
> 
> This patch adds guest steal-time support to full dynticks CPU
> time accounting. After the following commit:
> 
> ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")
> 
> ... time sampling became jiffy based, even if it's still listened
> to ring boundaries, so steal_account_process_tick() is reused
> to account how many 'ticks' are stolen-time, after the last accumulation.
> 
> Suggested-by: Rik van Riel <riel@redhat.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
> ---
> v4 -> v5:
>  * apply same logic to account_idle_time, so change get_vtime_delta instead
> v3 -> v4:
>  * fix grammar errors, thanks Ingo
>  * cleanup fragile codes, thanks Ingo
> v2 -> v3:
>  * convert steal time jiffies to cputime
> v1 -> v2:
>  * fix divide zero bug, thanks Rik
> 
>  kernel/sched/cputime.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 75f98c5..b62f9f8 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -257,7 +257,7 @@ void account_idle_time(cputime_t cputime)
>  		cpustat[CPUTIME_IDLE] += (__force u64) cputime;
>  }
>  
> -static __always_inline bool steal_account_process_tick(void)
> +static __always_inline unsigned long steal_account_process_tick(void)
>  {
>  #ifdef CONFIG_PARAVIRT
>  	if (static_key_false(&paravirt_steal_enabled)) {
> @@ -279,7 +279,7 @@ static __always_inline bool steal_account_process_tick(void)
>  		return steal_jiffies;
>  	}
>  #endif
> -	return false;
> +	return 0;
>  }
>  
>  /*
> @@ -681,12 +681,17 @@ static cputime_t vtime_delta(struct task_struct *tsk)
>  static cputime_t get_vtime_delta(struct task_struct *tsk)
>  {
>  	unsigned long now = READ_ONCE(jiffies);
> -	unsigned long delta = now - tsk->vtime_snap;
> +	cputime_t delta_time, steal_time;
>  
> +	steal_time = jiffies_to_cputime(steal_account_process_tick());
> +	delta_time = jiffies_to_cputime(now - tsk->vtime_snap);
>  	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
>  	tsk->vtime_snap = now;
>  
> -	return jiffies_to_cputime(delta);
> +	if (steal_time < delta_time)
> +		delta_time -= steal_time;
> +
> +	return delta_time;

I think this is wrong.  If you get more steal time than delta time
(which as Rik noticed can happen due to partial jiffies), you will end
up accounting things twice, once in steal_account_process_tick and once
here.  In other words you'll get the exact bug you're trying to fix.

The right thing is to add a max_jiffies argument to
steal_account_process_tick.  steal_account_process_tick will not attempt
to remove more than max_jiffies.  Here you pass delta_jiffies (i.e. now
- tsk->vtime_snap) to steal_account_process_tick, existing callers can
pass ULONG_MAX.  You can then

	return jiffies_to_cputime(delta_jiffies - steal_jiffies);

in get_vtime_delta and not worry about underflow.

Paolo

>  }
>  
>  static void __vtime_account_system(struct task_struct *tsk)
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/3] sched/cputime: Fix prev steal time accouting during cpu hotplug
  2016-06-08 10:01   ` Paolo Bonzini
@ 2016-06-08 10:59     ` Wanpeng Li
  0 siblings, 0 replies; 18+ messages in thread
From: Wanpeng Li @ 2016-06-08 10:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Wanpeng Li, Ingo Molnar,
	Peter Zijlstra (Intel),
	Rik van Riel, Thomas Gleixner, Frederic Weisbecker,
	Radim Krčmář

2016-06-08 18:01 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 08/06/2016 05:05, Wanpeng Li wrote:
>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>
>> Commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")
>> set rq->prev_* to 0 after a cpu hotplug comes back in order to fix the scenario:
>>
>> | steal is smaller than rq->prev_steal_time we end up with an insane large
>> | value which then gets added to rq->prev_steal_time, resulting in a permanent
>> | wreckage of the accounting.
>>
>> However, it is still buggy.
>>
>> rq->prev_steal_time = 0:
>>
>> As Rik pointed out:
>>
>> | setting rq->prev_irq_time to 0 in the guest, and then getting a giant value from
>> | the host, could result in a very large of steal_jiffies.
>>
>> rq->prev_steal_time_rq = 0:
>>
>> | steal = paravirt_steal_clock(cpu_of(rq));
>> | steal -= rq->prev_steal_time_rq;
>> |
>> | if (unlikely(steal > delta))
>> |      steal = delta;
>> |
>> | rq->prev_steal_time_rq += steal;
>> | delta -= steal;
>> |
>> | rq->clock_task += delta;
>>
>> steal is a giant value and rq->prev_steal_time_rq is 0, rq->prev_steal_time_rq
>> grows in delta granularity, rq->clock_task can't ramp up until rq->prev_steal_time_rq
>> catches up steal clock since delta value will be 0 after reducing steal time from
>> normal execution time. That's why I obersved that cpuhg/1-12 continue running
>> until rq->prev_steal_time_rq catches up steal clock timestamp.
>>
>> I believe rq->prev_irq_time has similar issue. So this patch fix it by reverting
>> commit e9532e69b8d1.
>
> This is still very hard to read.  I think the justification for this
> patch is much simpler:
>
> -------
> Commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU
> hotplug") set rq->prev_* to 0 after a cpu hotplug comes back in order to
> fix the case where (after CPU hotplug) steal is smaller than
> rq->prev_steal_time.
>
> However, this should never happen. steal was only smaller because of the
> KVM-specific bug fixed by the previous patch.  Worse, the previous patch
> triggers a bug on CPU hot-unplug/plug operation: because
> rq->prev_steal_time is cleared, all of the CPU's past steal time will be
> accounted again on hot-plug.
>
> Since the root cause has been fixed, we can just revert commit e9532e69b8d1.
> -------

Thanks Paolo! :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-08 10:14   ` Paolo Bonzini
@ 2016-06-08 11:11     ` Wanpeng Li
  2016-06-13  3:38     ` Wanpeng Li
  1 sibling, 0 replies; 18+ messages in thread
From: Wanpeng Li @ 2016-06-08 11:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Wanpeng Li, Ingo Molnar,
	Peter Zijlstra (Intel),
	Rik van Riel, Thomas Gleixner, Frederic Weisbecker,
	Radim Krčmář

2016-06-08 18:14 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 08/06/2016 05:05, Wanpeng Li wrote:
>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>
>> This patch adds guest steal-time support to full dynticks CPU
>> time accounting. After the following commit:
>>
>> ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")
>>
>> ... time sampling became jiffy based, even if it's still listened
>> to ring boundaries, so steal_account_process_tick() is reused
>> to account how many 'ticks' are stolen-time, after the last accumulation.
>>
>> Suggested-by: Rik van Riel <riel@redhat.com>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Frederic Weisbecker <fweisbec@gmail.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
>> ---
>> v4 -> v5:
>>  * apply same logic to account_idle_time, so change get_vtime_delta instead
>> v3 -> v4:
>>  * fix grammar errors, thanks Ingo
>>  * cleanup fragile codes, thanks Ingo
>> v2 -> v3:
>>  * convert steal time jiffies to cputime
>> v1 -> v2:
>>  * fix divide zero bug, thanks Rik
>>
>>  kernel/sched/cputime.c | 13 +++++++++----
>>  1 file changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>> index 75f98c5..b62f9f8 100644
>> --- a/kernel/sched/cputime.c
>> +++ b/kernel/sched/cputime.c
>> @@ -257,7 +257,7 @@ void account_idle_time(cputime_t cputime)
>>               cpustat[CPUTIME_IDLE] += (__force u64) cputime;
>>  }
>>
>> -static __always_inline bool steal_account_process_tick(void)
>> +static __always_inline unsigned long steal_account_process_tick(void)
>>  {
>>  #ifdef CONFIG_PARAVIRT
>>       if (static_key_false(&paravirt_steal_enabled)) {
>> @@ -279,7 +279,7 @@ static __always_inline bool steal_account_process_tick(void)
>>               return steal_jiffies;
>>       }
>>  #endif
>> -     return false;
>> +     return 0;
>>  }
>>
>>  /*
>> @@ -681,12 +681,17 @@ static cputime_t vtime_delta(struct task_struct *tsk)
>>  static cputime_t get_vtime_delta(struct task_struct *tsk)
>>  {
>>       unsigned long now = READ_ONCE(jiffies);
>> -     unsigned long delta = now - tsk->vtime_snap;
>> +     cputime_t delta_time, steal_time;
>>
>> +     steal_time = jiffies_to_cputime(steal_account_process_tick());
>> +     delta_time = jiffies_to_cputime(now - tsk->vtime_snap);
>>       WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
>>       tsk->vtime_snap = now;
>>
>> -     return jiffies_to_cputime(delta);
>> +     if (steal_time < delta_time)
>> +             delta_time -= steal_time;
>> +
>> +     return delta_time;
>
> I think this is wrong.  If you get more steal time than delta time
> (which as Rik noticed can happen due to partial jiffies), you will end
> up accounting things twice, once in steal_account_process_tick and once
> here.  In other words you'll get the exact bug you're trying to fix.
>
> The right thing is to add a max_jiffies argument to
> steal_account_process_tick.  steal_account_process_tick will not attempt
> to remove more than max_jiffies.  Here you pass delta_jiffies (i.e. now
> - tsk->vtime_snap) to steal_account_process_tick, existing callers can
> pass ULONG_MAX.  You can then
>
>         return jiffies_to_cputime(delta_jiffies - steal_jiffies);
>
> in get_vtime_delta and not worry about underflow.

I see, I will do it in next version.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-08  3:05 ` [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting Wanpeng Li
  2016-06-08  7:22   ` Ingo Molnar
  2016-06-08 10:14   ` Paolo Bonzini
@ 2016-06-08 19:05   ` Rik van Riel
  2016-06-08 23:57     ` Wanpeng Li
  2 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2016-06-08 19:05 UTC (permalink / raw)
  To: Wanpeng Li, linux-kernel, kvm
  Cc: Wanpeng Li, Ingo Molnar, Peter Zijlstra (Intel),
	Thomas Gleixner, Frederic Weisbecker, Paolo Bonzini,
	Radim Krčmář

[-- Attachment #1: Type: text/plain, Size: 929 bytes --]

On Wed, 2016-06-08 at 11:05 +0800, Wanpeng Li wrote:
> 
> @@ -681,12 +681,17 @@ static cputime_t vtime_delta(struct task_struct
> *tsk)
>  static cputime_t get_vtime_delta(struct task_struct *tsk)
>  {
>  	unsigned long now = READ_ONCE(jiffies);
> -	unsigned long delta = now - tsk->vtime_snap;
> +	cputime_t delta_time, steal_time;
>  
> +	steal_time =
> jiffies_to_cputime(steal_account_process_tick());
> +	delta_time = jiffies_to_cputime(now - tsk->vtime_snap);
>  	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
>  	tsk->vtime_snap = now;
>  
> -	return jiffies_to_cputime(delta);
> +	if (steal_time < delta_time)
> +		delta_time -= steal_time;
> +
> +	return delta_time;
>  }

This isn't right.

If steal_time is equal to or larger than delta_time,
get_vtime_delta needs to return 0, not delta_time.

Otherwise the same time will be counted twice.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-08 19:05   ` Rik van Riel
@ 2016-06-08 23:57     ` Wanpeng Li
  2016-06-09  1:20       ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Wanpeng Li @ 2016-06-08 23:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, kvm, Wanpeng Li, Ingo Molnar,
	Peter Zijlstra (Intel),
	Thomas Gleixner, Frederic Weisbecker, Paolo Bonzini,
	Radim Krčmář

2016-06-09 3:05 GMT+08:00 Rik van Riel <riel@redhat.com>:
> On Wed, 2016-06-08 at 11:05 +0800, Wanpeng Li wrote:
>>
>> @@ -681,12 +681,17 @@ static cputime_t vtime_delta(struct task_struct
>> *tsk)
>>  static cputime_t get_vtime_delta(struct task_struct *tsk)
>>  {
>>       unsigned long now = READ_ONCE(jiffies);
>> -     unsigned long delta = now - tsk->vtime_snap;
>> +     cputime_t delta_time, steal_time;
>>
>> +     steal_time =
>> jiffies_to_cputime(steal_account_process_tick());
>> +     delta_time = jiffies_to_cputime(now - tsk->vtime_snap);
>>       WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
>>       tsk->vtime_snap = now;
>>
>> -     return jiffies_to_cputime(delta);
>> +     if (steal_time < delta_time)
>> +             delta_time -= steal_time;
>> +
>> +     return delta_time;
>>  }
>
> This isn't right.
>
> If steal_time is equal to or larger than delta_time,
> get_vtime_delta needs to return 0, not delta_time.
>
> Otherwise the same time will be counted twice.

Paolo also pointed out this yesterday, so his proposal looks good to you, right?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-08 23:57     ` Wanpeng Li
@ 2016-06-09  1:20       ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2016-06-09  1:20 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, kvm, Wanpeng Li, Ingo Molnar,
	Peter Zijlstra (Intel),
	Thomas Gleixner, Frederic Weisbecker, Paolo Bonzini,
	Radim Krčmář

[-- Attachment #1: Type: text/plain, Size: 1601 bytes --]

On Thu, 2016-06-09 at 07:57 +0800, Wanpeng Li wrote:
> 2016-06-09 3:05 GMT+08:00 Rik van Riel <riel@redhat.com>:
> > 
> > On Wed, 2016-06-08 at 11:05 +0800, Wanpeng Li wrote:
> > > 
> > > 
> > > @@ -681,12 +681,17 @@ static cputime_t vtime_delta(struct
> > > task_struct
> > > *tsk)
> > >  static cputime_t get_vtime_delta(struct task_struct *tsk)
> > >  {
> > >       unsigned long now = READ_ONCE(jiffies);
> > > -     unsigned long delta = now - tsk->vtime_snap;
> > > +     cputime_t delta_time, steal_time;
> > > 
> > > +     steal_time =
> > > jiffies_to_cputime(steal_account_process_tick());
> > > +     delta_time = jiffies_to_cputime(now - tsk->vtime_snap);
> > >       WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
> > >       tsk->vtime_snap = now;
> > > 
> > > -     return jiffies_to_cputime(delta);
> > > +     if (steal_time < delta_time)
> > > +             delta_time -= steal_time;
> > > +
> > > +     return delta_time;
> > >  }
> > This isn't right.
> > 
> > If steal_time is equal to or larger than delta_time,
> > get_vtime_delta needs to return 0, not delta_time.
> > 
> > Otherwise the same time will be counted twice.
> Paolo also pointed out this yesterday, so his proposal looks good to
> you, right?
> 
Yes it does.

I can build the irqtime rework on top of your patches,
taking irq and softirq time out of the vtime delta as
well.

With Paolo's proposal, no time will ever be accounted
double, which is a good thing.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-08 10:14   ` Paolo Bonzini
  2016-06-08 11:11     ` Wanpeng Li
@ 2016-06-13  3:38     ` Wanpeng Li
  2016-06-13  7:55       ` Paolo Bonzini
  1 sibling, 1 reply; 18+ messages in thread
From: Wanpeng Li @ 2016-06-13  3:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Wanpeng Li, Ingo Molnar,
	Peter Zijlstra (Intel),
	Rik van Riel, Thomas Gleixner, Frederic Weisbecker,
	Radim Krčmář

2016-06-08 18:14 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 08/06/2016 05:05, Wanpeng Li wrote:
>> From: Wanpeng Li <wanpeng.li@hotmail.com>
>>
>> This patch adds guest steal-time support to full dynticks CPU
>> time accounting. After the following commit:
>>
>> ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")
>>
>> ... time sampling became jiffy based, even if it's still listened
>> to ring boundaries, so steal_account_process_tick() is reused
>> to account how many 'ticks' are stolen-time, after the last accumulation.
>>
>> Suggested-by: Rik van Riel <riel@redhat.com>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Frederic Weisbecker <fweisbec@gmail.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
>> ---
>> v4 -> v5:
>>  * apply same logic to account_idle_time, so change get_vtime_delta instead
>> v3 -> v4:
>>  * fix grammar errors, thanks Ingo
>>  * cleanup fragile codes, thanks Ingo
>> v2 -> v3:
>>  * convert steal time jiffies to cputime
>> v1 -> v2:
>>  * fix divide zero bug, thanks Rik
>>
>>  kernel/sched/cputime.c | 13 +++++++++----
>>  1 file changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>> index 75f98c5..b62f9f8 100644
>> --- a/kernel/sched/cputime.c
>> +++ b/kernel/sched/cputime.c
>> @@ -257,7 +257,7 @@ void account_idle_time(cputime_t cputime)
>>               cpustat[CPUTIME_IDLE] += (__force u64) cputime;
>>  }
>>
>> -static __always_inline bool steal_account_process_tick(void)
>> +static __always_inline unsigned long steal_account_process_tick(void)
>>  {
>>  #ifdef CONFIG_PARAVIRT
>>       if (static_key_false(&paravirt_steal_enabled)) {
>> @@ -279,7 +279,7 @@ static __always_inline bool steal_account_process_tick(void)
>>               return steal_jiffies;
>>       }
>>  #endif
>> -     return false;
>> +     return 0;
>>  }
>>
>>  /*
>> @@ -681,12 +681,17 @@ static cputime_t vtime_delta(struct task_struct *tsk)
>>  static cputime_t get_vtime_delta(struct task_struct *tsk)
>>  {
>>       unsigned long now = READ_ONCE(jiffies);
>> -     unsigned long delta = now - tsk->vtime_snap;
>> +     cputime_t delta_time, steal_time;
>>
>> +     steal_time = jiffies_to_cputime(steal_account_process_tick());
>> +     delta_time = jiffies_to_cputime(now - tsk->vtime_snap);
>>       WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
>>       tsk->vtime_snap = now;
>>
>> -     return jiffies_to_cputime(delta);
>> +     if (steal_time < delta_time)
>> +             delta_time -= steal_time;
>> +
>> +     return delta_time;
>
> I think this is wrong.  If you get more steal time than delta time
> (which as Rik noticed can happen due to partial jiffies), you will end
> up accounting things twice, once in steal_account_process_tick and once
> here.  In other words you'll get the exact bug you're trying to fix.
>
> The right thing is to add a max_jiffies argument to
> steal_account_process_tick.  steal_account_process_tick will not attempt
> to remove more than max_jiffies.  Here you pass delta_jiffies (i.e. now
> - tsk->vtime_snap) to steal_account_process_tick, existing callers can
> pass ULONG_MAX.  You can then
>
>         return jiffies_to_cputime(delta_jiffies - steal_jiffies);
>
> in get_vtime_delta and not worry about underflow.

Do you mean something like below, actually I see delta_jiffies <
steal_jiffies sometimes.

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 75f98c5..a7606a9 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -257,7 +257,7 @@ void account_idle_time(cputime_t cputime)
  cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }

-static __always_inline bool steal_account_process_tick(void)
+static __always_inline unsigned long
steal_account_process_tick(unsigned long max_jiffies)
 {
 #ifdef CONFIG_PARAVIRT
  if (static_key_false(&paravirt_steal_enabled)) {
@@ -272,14 +272,14 @@ static __always_inline bool
steal_account_process_tick(void)
  * time in jiffies. Lets cast the result to jiffies
  * granularity and account the rest on the next rounds.
  */
- steal_jiffies = nsecs_to_jiffies(steal);
+ steal_jiffies = min(nsecs_to_jiffies(steal), max_jiffies);
  this_rq()->prev_steal_time += jiffies_to_nsecs(steal_jiffies);

  account_steal_time(jiffies_to_cputime(steal_jiffies));
  return steal_jiffies;
  }
 #endif
- return false;
+ return 0;
 }

 /*
@@ -346,7 +346,7 @@ static void irqtime_account_process_tick(struct
task_struct *p, int user_tick,
  u64 cputime = (__force u64) cputime_one_jiffy;
  u64 *cpustat = kcpustat_this_cpu->cpustat;

- if (steal_account_process_tick())
+ if (steal_account_process_tick(ULONG_MAX))
  return;

  cputime *= ticks;
@@ -477,7 +477,7 @@ void account_process_tick(struct task_struct *p,
int user_tick)
  return;
  }

- if (steal_account_process_tick())
+ if (steal_account_process_tick(ULONG_MAX))
  return;

  if (user_tick)
@@ -681,12 +681,14 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
  unsigned long now = READ_ONCE(jiffies);
- unsigned long delta = now - tsk->vtime_snap;
+ unsigned long delta_jiffies, steal_jiffies;

+ delta_jiffies = jiffies_to_cputime(now - tsk->vtime_snap);
+ steal_jiffies = jiffies_to_cputime(steal_account_process_tick(delta_jiffies));
  WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
  tsk->vtime_snap = now;

- return jiffies_to_cputime(delta);
+ return jiffies_to_cputime(delta_jiffies - steal_jiffies);
 }

 static void __vtime_account_system(struct task_struct *tsk)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-13  3:38     ` Wanpeng Li
@ 2016-06-13  7:55       ` Paolo Bonzini
  2016-06-13  8:09         ` Wanpeng Li
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2016-06-13  7:55 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, kvm, Wanpeng Li, Ingo Molnar,
	Peter Zijlstra (Intel),
	Rik van Riel, Thomas Gleixner, Frederic Weisbecker,
	Radim Krčmář



On 13/06/2016 05:38, Wanpeng Li wrote:
> + delta_jiffies = jiffies_to_cputime(now - tsk->vtime_snap);
> + steal_jiffies = jiffies_to_cputime(steal_account_process_tick(delta_jiffies));

Without jiffies_to_cputime here.  Apart from this, yes, this is what I
meant.

Paolo

>   WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
>   tsk->vtime_snap = now;
> 
> - return jiffies_to_cputime(delta);
> + return jiffies_to_cputime(delta_jiffies - steal_jiffies);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting
  2016-06-13  7:55       ` Paolo Bonzini
@ 2016-06-13  8:09         ` Wanpeng Li
  0 siblings, 0 replies; 18+ messages in thread
From: Wanpeng Li @ 2016-06-13  8:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Wanpeng Li, Ingo Molnar,
	Peter Zijlstra (Intel),
	Rik van Riel, Thomas Gleixner, Frederic Weisbecker,
	Radim Krčmář

2016-06-13 15:55 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>
>
> On 13/06/2016 05:38, Wanpeng Li wrote:
>> + delta_jiffies = jiffies_to_cputime(now - tsk->vtime_snap);
>> + steal_jiffies = jiffies_to_cputime(steal_account_process_tick(delta_jiffies));
>
> Without jiffies_to_cputime here.  Apart from this, yes, this is what I
> meant.

You are right, I miss that. :)

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-06-13  8:09 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-08  3:05 [PATCH v5 1/3] KVM: fix steal clock warp during guest cpu hotplug Wanpeng Li
2016-06-08  3:05 ` [PATCH v5 2/3] sched/cputime: Fix prev steal time accouting during " Wanpeng Li
2016-06-08 10:01   ` Paolo Bonzini
2016-06-08 10:59     ` Wanpeng Li
2016-06-08  3:05 ` [PATCH v5 3/3] sched/cputime: Add steal time support to full dynticks CPU time accounting Wanpeng Li
2016-06-08  7:22   ` Ingo Molnar
2016-06-08  7:27     ` Wanpeng Li
2016-06-08  7:52       ` Ingo Molnar
2016-06-08  8:04         ` Wanpeng Li
2016-06-08  8:12           ` Wanpeng Li
2016-06-08 10:14   ` Paolo Bonzini
2016-06-08 11:11     ` Wanpeng Li
2016-06-13  3:38     ` Wanpeng Li
2016-06-13  7:55       ` Paolo Bonzini
2016-06-13  8:09         ` Wanpeng Li
2016-06-08 19:05   ` Rik van Riel
2016-06-08 23:57     ` Wanpeng Li
2016-06-09  1:20       ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).