linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling
@ 2017-04-06  6:36 Wanpeng Li
  2017-04-06 14:36 ` Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Wanpeng Li @ 2017-04-06  6:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Wanpeng Li, Frederic Weisbecker,
	Rik van Riel, Mike Galbraith, Luiz Capitulino, Thomas Gleixner

From: Wanpeng Li <wanpeng.li@hotmail.com>

This can be reproduced by setting up a CPU for nohz_full and pin two tasks 
that hog the CPU 100% of the time to that CPU, top reports 70% system time 
and 30% user time utilization sometimes. The trace like this:

   hog-11980 [015]   341.494491: function:    enter_from_user_mode <-- apic_timer_interrupt
<idle>-0     [000]   341.494492: function:    smp_apic_timer_interrupt <-- apic_timer_interrupt
   hog-11980 [015]   341.494492: function:    __context_tracking_exit <-- enter_from_user_mode
<idle>-0     [000]   341.494492: function:    irq_enter <-- smp_apic_timer_interrupt
   hog-11980 [015]   341.494492: bprint:      vtime_delta: diff=0 (now=4295008339 vtime_snap=4295008339)
   hog-11980 [015]   341.494492: function:    smp_apic_timer_interrupt <-- apic_timer_interrupt
   hog-11980 [015]   341.494492: function:    irq_enter <-- smp_apic_timer_interrupt
   hog-11980 [015]   341.494493: function:    tick_sched_timer <-- __hrtimer_run_queues
<idle>-0     [000]   341.494493: function:    tick_sched_timer <-- __hrtimer_run_queues
<idle>-0     [000]   341.494493: function:    tick_do_update_jiffies64.part.14 <-- tick_sched_do_timer
<idle>-0     [000]   341.494494: function:    do_timer <-- tick_do_update_jiffies64.part.14
   hog-11980 [015]   341.494494: function:    irq_exit <-- smp_apic_timer_interrupt
<idle>-0     [000]   341.494494: bprint:      do_timer: updated jiffies_64=4295008340 ticks=1
   hog-11980 [015]   341.494494: function:    __context_tracking_enter <-- prepare_exit_to_usermode
   hog-11980 [015]   341.494494: function:    vtime_user_enter <-- __context_tracking_enter
   hog-11980 [015]   341.494495: bprint:      vtime_delta: diff=1000000 (now=4295008340 vtime_snap=4295008339)
   hog-11980 [015]   341.494495: function:    __vtime_account_system <-- vtime_user_enter
   hog-11980 [015]   341.494495: bprint:      get_vtime_delta: vtime_snap=4295008339 now=4295008340
   hog-11980 [015]   341.494495: function:    account_system_time <-- __vtime_account_system
   hog-11980 [015]   341.494495: bprint:      account_system_time: cputime=995488
<idle>-0     [000]   341.494497: function:    irq_exit <-- smp_apic_timer_interrupt

In this trace, we see the following:

 1. On CPU15, we transition from user-space to kernel-space because
    of a timer interrupt (it's the tick)

 2. vtimer_delta() returns 0, because jiffies didn't change since the
    last accounting

 3. While CPU15 is executing in kernel-space, jiffies is updated
    by CPU0

 4. When going back to user-space, vtime_delta() returns non-zero
    and the whole time is accounted for system time (observe how
    the cputime parameter in account_system_time() is less than 1ms)


After commit ff9a9b4c433 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN 
to jiffy granularity"), it reduces timing accuracy on nohz_full CPUs to 
jiffy based sampling. So the issue can happen if all CPUs fire their ticks 
at the same time:


               CPU 0                  CPU 1
               -----                  -----
                                      exit_user() // no cputime update 
tick X         update_jiffies
                                      enter_user() // cputime update


                                      exit_user() // no cputime update
tick X+1       update_jiffies
                                      enter_user() // cputime update

The nohz_full CPU receives an interrupt at the same time the timer 
interrupt fires on the housekeeping CPU.

This patch offsets the tick to avert all ticks alignment in order 
that the vtime sampling does not end up "in phase" with the jiffies 
incrementing.

Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
Suggested-by: Rik van Riel <riel@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
---
 kernel/time/tick-sched.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7fe53be..35d79f7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1197,8 +1197,12 @@ void tick_setup_sched_timer(void)
 	/* Get the next period (per-CPU) */
 	hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
 
-	/* Offset the tick to avert jiffies_lock contention. */
-	if (sched_skew_tick) {
+	/*
+	 * Offset the tick to avert jiffies_lock contention, and all ticks
+	 * alignment in order that the vtime sampling does not end up "in
+	 * phase" with the jiffies incrementing.
+	 */
+	if (sched_skew_tick || tick_nohz_full_enabled()) {
 		u64 offset = ktime_to_ns(tick_period) >> 1;
 		do_div(offset, num_possible_cpus());
 		offset *= smp_processor_id();
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling
  2017-04-06  6:36 [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling Wanpeng Li
@ 2017-04-06 14:36 ` Rik van Riel
  2017-04-07 16:58 ` Luiz Capitulino
  2017-04-10 15:45 ` Thomas Gleixner
  2 siblings, 0 replies; 6+ messages in thread
From: Rik van Riel @ 2017-04-06 14:36 UTC (permalink / raw)
  To: Wanpeng Li, linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Wanpeng Li, Frederic Weisbecker,
	Mike Galbraith, Luiz Capitulino, Thomas Gleixner

On Wed, 2017-04-05 at 23:36 -0700, Wanpeng Li wrote:

> This patch offsets the tick to avert all ticks alignment in order 
> that the vtime sampling does not end up "in phase" with the jiffies 
> incrementing.
> 
> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> Suggested-by: Rik van Riel <riel@redhat.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

> +++ b/kernel/time/tick-sched.c
> @@ -1197,8 +1197,12 @@ void tick_setup_sched_timer(void)
>  	/* Get the next period (per-CPU) */
>  	hrtimer_set_expires(&ts->sched_timer,
> tick_init_jiffy_update());
>  
> -	/* Offset the tick to avert jiffies_lock contention. */
> -	if (sched_skew_tick) {
> +	/*
> +	 * Offset the tick to avert jiffies_lock contention, and all
> ticks
> +	 * alignment in order that the vtime sampling does not end
> up "in
> +	 * phase" with the jiffies incrementing.
> +	 */

I feel like part of the explanation is missing from this
comment, but I am not sure how to make it better without
making it way too long :)

> +	if (sched_skew_tick || tick_nohz_full_enabled()) {
>  		u64 offset = ktime_to_ns(tick_period) >> 1;
>  		do_div(offset, num_possible_cpus());
>  		offset *= smp_processor_id();

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling
  2017-04-06  6:36 [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling Wanpeng Li
  2017-04-06 14:36 ` Rik van Riel
@ 2017-04-07 16:58 ` Luiz Capitulino
  2017-04-10 15:45 ` Thomas Gleixner
  2 siblings, 0 replies; 6+ messages in thread
From: Luiz Capitulino @ 2017-04-07 16:58 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Wanpeng Li,
	Frederic Weisbecker, Rik van Riel, Mike Galbraith,
	Thomas Gleixner

On Wed,  5 Apr 2017 23:36:24 -0700
Wanpeng Li <kernellwp@gmail.com> wrote:

> From: Wanpeng Li <wanpeng.li@hotmail.com>
> 
> This can be reproduced by setting up a CPU for nohz_full and pin two tasks 
> that hog the CPU 100% of the time to that CPU, top reports 70% system time 
> and 30% user time utilization sometimes. The trace like this:
> 
>    hog-11980 [015]   341.494491: function:    enter_from_user_mode <-- apic_timer_interrupt
> <idle>-0     [000]   341.494492: function:    smp_apic_timer_interrupt <-- apic_timer_interrupt
>    hog-11980 [015]   341.494492: function:    __context_tracking_exit <-- enter_from_user_mode
> <idle>-0     [000]   341.494492: function:    irq_enter <-- smp_apic_timer_interrupt
>    hog-11980 [015]   341.494492: bprint:      vtime_delta: diff=0 (now=4295008339 vtime_snap=4295008339)
>    hog-11980 [015]   341.494492: function:    smp_apic_timer_interrupt <-- apic_timer_interrupt
>    hog-11980 [015]   341.494492: function:    irq_enter <-- smp_apic_timer_interrupt
>    hog-11980 [015]   341.494493: function:    tick_sched_timer <-- __hrtimer_run_queues
> <idle>-0     [000]   341.494493: function:    tick_sched_timer <-- __hrtimer_run_queues
> <idle>-0     [000]   341.494493: function:    tick_do_update_jiffies64.part.14 <-- tick_sched_do_timer
> <idle>-0     [000]   341.494494: function:    do_timer <-- tick_do_update_jiffies64.part.14
>    hog-11980 [015]   341.494494: function:    irq_exit <-- smp_apic_timer_interrupt
> <idle>-0     [000]   341.494494: bprint:      do_timer: updated jiffies_64=4295008340 ticks=1
>    hog-11980 [015]   341.494494: function:    __context_tracking_enter <-- prepare_exit_to_usermode
>    hog-11980 [015]   341.494494: function:    vtime_user_enter <-- __context_tracking_enter
>    hog-11980 [015]   341.494495: bprint:      vtime_delta: diff=1000000 (now=4295008340 vtime_snap=4295008339)
>    hog-11980 [015]   341.494495: function:    __vtime_account_system <-- vtime_user_enter
>    hog-11980 [015]   341.494495: bprint:      get_vtime_delta: vtime_snap=4295008339 now=4295008340
>    hog-11980 [015]   341.494495: function:    account_system_time <-- __vtime_account_system
>    hog-11980 [015]   341.494495: bprint:      account_system_time: cputime=995488
> <idle>-0     [000]   341.494497: function:    irq_exit <-- smp_apic_timer_interrupt
> 
> In this trace, we see the following:
> 
>  1. On CPU15, we transition from user-space to kernel-space because
>     of a timer interrupt (it's the tick)
> 
>  2. vtimer_delta() returns 0, because jiffies didn't change since the
>     last accounting
> 
>  3. While CPU15 is executing in kernel-space, jiffies is updated
>     by CPU0
> 
>  4. When going back to user-space, vtime_delta() returns non-zero
>     and the whole time is accounted for system time (observe how
>     the cputime parameter in account_system_time() is less than 1ms)
> 
> 
> After commit ff9a9b4c433 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN 
> to jiffy granularity"), it reduces timing accuracy on nohz_full CPUs to 
> jiffy based sampling. So the issue can happen if all CPUs fire their ticks 
> at the same time:
> 
> 
>                CPU 0                  CPU 1
>                -----                  -----
>                                       exit_user() // no cputime update 
> tick X         update_jiffies
>                                       enter_user() // cputime update
> 
> 
>                                       exit_user() // no cputime update
> tick X+1       update_jiffies
>                                       enter_user() // cputime update
> 
> The nohz_full CPU receives an interrupt at the same time the timer 
> interrupt fires on the housekeeping CPU.
> 
> This patch offsets the tick to avert all ticks alignment in order 
> that the vtime sampling does not end up "in phase" with the jiffies 
> incrementing.
> 
> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> Suggested-by: Rik van Riel <riel@redhat.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>

Reviewed-and-Tested-by: Luiz Capitulino <lcapitulino@redhat.com>

Thanks for everyone who participated on tracking this down
and fixing it!

> ---
>  kernel/time/tick-sched.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 7fe53be..35d79f7 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1197,8 +1197,12 @@ void tick_setup_sched_timer(void)
>  	/* Get the next period (per-CPU) */
>  	hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
>  
> -	/* Offset the tick to avert jiffies_lock contention. */
> -	if (sched_skew_tick) {
> +	/*
> +	 * Offset the tick to avert jiffies_lock contention, and all ticks
> +	 * alignment in order that the vtime sampling does not end up "in
> +	 * phase" with the jiffies incrementing.
> +	 */
> +	if (sched_skew_tick || tick_nohz_full_enabled()) {
>  		u64 offset = ktime_to_ns(tick_period) >> 1;
>  		do_div(offset, num_possible_cpus());
>  		offset *= smp_processor_id();

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling
  2017-04-06  6:36 [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling Wanpeng Li
  2017-04-06 14:36 ` Rik van Riel
  2017-04-07 16:58 ` Luiz Capitulino
@ 2017-04-10 15:45 ` Thomas Gleixner
  2017-04-10 18:01   ` Frederic Weisbecker
  2 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2017-04-10 15:45 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Wanpeng Li,
	Frederic Weisbecker, Rik van Riel, Mike Galbraith,
	Luiz Capitulino

On Wed, 5 Apr 2017, Wanpeng Li wrote:
> +	/*
> +	 * Offset the tick to avert jiffies_lock contention, and all ticks
> +	 * alignment in order that the vtime sampling does not end up "in
> +	 * phase" with the jiffies incrementing.
> +	 */
> +	if (sched_skew_tick || tick_nohz_full_enabled()) {
>  		u64 offset = ktime_to_ns(tick_period) >> 1;
>  		do_div(offset, num_possible_cpus());
>  		offset *= smp_processor_id();

That's not a fix, that's just papering over the problem.

       offset = 1ms / 2 = 500us = 500000ns;
       offset /= 144 = 3472ns

So CPU0 and CPU1 ticks are ~3 microseconds apart. That merily reduces the
probability of the issue, but does not prevent it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling
  2017-04-10 15:45 ` Thomas Gleixner
@ 2017-04-10 18:01   ` Frederic Weisbecker
  2017-04-10 21:51     ` Wanpeng Li
  0 siblings, 1 reply; 6+ messages in thread
From: Frederic Weisbecker @ 2017-04-10 18:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Wanpeng Li, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Wanpeng Li, Rik van Riel, Mike Galbraith, Luiz Capitulino

On Mon, Apr 10, 2017 at 05:45:56PM +0200, Thomas Gleixner wrote:
> On Wed, 5 Apr 2017, Wanpeng Li wrote:
> > +	/*
> > +	 * Offset the tick to avert jiffies_lock contention, and all ticks
> > +	 * alignment in order that the vtime sampling does not end up "in
> > +	 * phase" with the jiffies incrementing.
> > +	 */
> > +	if (sched_skew_tick || tick_nohz_full_enabled()) {
> >  		u64 offset = ktime_to_ns(tick_period) >> 1;
> >  		do_div(offset, num_possible_cpus());
> >  		offset *= smp_processor_id();
> 
> That's not a fix, that's just papering over the problem.
> 
>        offset = 1ms / 2 = 500us = 500000ns;
>        offset /= 144 = 3472ns
> 
> So CPU0 and CPU1 ticks are ~3 microseconds apart. That merily reduces the
> probability of the issue, but does not prevent it.

I worried about it but didn't realize it could be that tight.

So the alternative is the solution involving sched_clock() as the source for
cputime. Wanpeng Li could you please resubmit your patch that does that?

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling
  2017-04-10 18:01   ` Frederic Weisbecker
@ 2017-04-10 21:51     ` Wanpeng Li
  0 siblings, 0 replies; 6+ messages in thread
From: Wanpeng Li @ 2017-04-10 21:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Wanpeng Li, Rik van Riel, Mike Galbraith, Luiz Capitulino

2017-04-11 2:01 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.com>:
> On Mon, Apr 10, 2017 at 05:45:56PM +0200, Thomas Gleixner wrote:
>> On Wed, 5 Apr 2017, Wanpeng Li wrote:
>> > +   /*
>> > +    * Offset the tick to avert jiffies_lock contention, and all ticks
>> > +    * alignment in order that the vtime sampling does not end up "in
>> > +    * phase" with the jiffies incrementing.
>> > +    */
>> > +   if (sched_skew_tick || tick_nohz_full_enabled()) {
>> >             u64 offset = ktime_to_ns(tick_period) >> 1;
>> >             do_div(offset, num_possible_cpus());
>> >             offset *= smp_processor_id();
>>
>> That's not a fix, that's just papering over the problem.
>>
>>        offset = 1ms / 2 = 500us = 500000ns;
>>        offset /= 144 = 3472ns
>>
>> So CPU0 and CPU1 ticks are ~3 microseconds apart. That merily reduces the
>> probability of the issue, but does not prevent it.
>
> I worried about it but didn't realize it could be that tight.
>
> So the alternative is the solution involving sched_clock() as the source for
> cputime. Wanpeng Li could you please resubmit your patch that does that?

Yeah, I will try it.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-04-10 21:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-06  6:36 [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling Wanpeng Li
2017-04-06 14:36 ` Rik van Riel
2017-04-07 16:58 ` Luiz Capitulino
2017-04-10 15:45 ` Thomas Gleixner
2017-04-10 18:01   ` Frederic Weisbecker
2017-04-10 21:51     ` Wanpeng Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).