All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wanpeng Li <kernellwp@gmail.com>
To: Mike Galbraith <efault@gmx.de>
Cc: Rik van Riel <riel@redhat.com>,
	Luiz Capitulino <lcapitulino@redhat.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [BUG nohz]: wrong user and system time accounting
Date: Thu, 30 Mar 2017 19:52:54 +0800	[thread overview]
Message-ID: <CANRm+Cz_QXzN9EM1V7YHZhhOsqc2OgujG2veh4HU_Sr+9-9M0Q@mail.gmail.com> (raw)
In-Reply-To: <CANRm+CzfqkU6iV1BFk3PmWvy7MOsYH1mSwejJXMkvWW9C5ngwg@mail.gmail.com>

2017-03-30 14:47 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> Cc Peterz, Thomas,
> 2017-03-30 12:27 GMT+08:00 Mike Galbraith <efault@gmx.de>:
>> On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote:
>>
>>> In other words, the tick on cpu0 is aligned
>>> with the tick on the nohz_full cpus, and
>>> jiffies is advanced while the nohz_full cpus
>>> with an active tick happen to be in kernel
>>> mode?
>>
>> You really want skew_tick=1, especially on big boxen.
>>
>>> Frederic, can you think of any reason why
>>> the tick on nohz_full CPUs would end up aligned
>>> with the tick on cpu0, instead of running at some
>>> random offset?
>>
>> (I or low rq->clock bits as crude NOHZ collision avoidance)
>>
>>> A random offset, or better yet a somewhat randomized
>>> tick length to make sure that simultaneous ticks are
>>> fairly rare and the vtime sampling does not end up
>>> "in phase" with the jiffies incrementing, could make
>>> the accounting work right again.
>>
>> That improves jitter, especially on big boxen.  I have an 8 socket box
>> that thinks it's an extra large PC, there, collision avoidance matters
>> hugely.  I couldn't reproduce bean counting woes, no idea if collision
>> avoidance will help that.
>
> So I implement two methods, one is from Rik's random offset proposal

If we should just add random offset to the cpu in the nohz_full mode?

> through skew tick, the other one is from Frederic's proposal and it is
> the same as my original idea through use nanosecond granularity to
> check deltas but only perform an actual cputime update when that delta
>>= TICK_NSEC. Both methods can solve the bug which Luiz reported.

This can just solves two cpu hogs running on the cpu in nohz_full
mode. However, Luiz's testcase w/ ./acct-bug 1 995 shows idle 100%.

Regards,
Wanpeng Li

> Peterz, Thomas, any ideas?
>
> --------------------------->8-------------------------------------------------------------
>
> skew tick:
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 7fe53be..9981437 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1198,7 +1198,11 @@ void tick_setup_sched_timer(void)
>      hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
>
>      /* Offset the tick to avert jiffies_lock contention. */
> +#ifdef CONFIG_NO_HZ_FULL
> +    if (sched_skew_tick || tick_nohz_full_running) {
> +#else
>      if (sched_skew_tick) {
> +#endif
>          u64 offset = ktime_to_ns(tick_period) >> 1;
>          do_div(offset, num_possible_cpus());
>          offset *= smp_processor_id();
>
> -------------------------------------->8-----------------------------------------------------
>
> use nanosecond granularity to check deltas but only perform an actual
> cputime update when that delta >= TICK_NSEC.
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..f1ee393 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct
> task_struct *p, u64 *ut, u64 *st)
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>  static u64 vtime_delta(struct task_struct *tsk)
>  {
> -    unsigned long now = READ_ONCE(jiffies);
> +    u64 now = local_clock();
> +    u64 delta;
> +
> +    delta = now - tsk->vtime_snap;
>
> -    if (time_before(now, (unsigned long)tsk->vtime_snap))
> +    if (delta < TICK_NSEC)
>          return 0;
>
> -    return jiffies_to_nsecs(now - tsk->vtime_snap);
> +    return jiffies_to_nsecs(delta / TICK_NSEC);
>  }
>
>  static u64 get_vtime_delta(struct task_struct *tsk)
>  {
> -    unsigned long now = READ_ONCE(jiffies);
> -    u64 delta, other;
> +    u64 delta = vtime_delta(tsk);
> +    u64 other;
>
>      /*
>       * Unlike tick based timing, vtime based timing never has lost
> @@ -696,10 +699,9 @@ static u64 get_vtime_delta(struct task_struct *tsk)
>       * elapsed time. Limit account_other_time to prevent rounding
>       * errors from causing elapsed vtime to go negative.
>       */
> -    delta = jiffies_to_nsecs(now - tsk->vtime_snap);
>      other = account_other_time(delta);
>      WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
> -    tsk->vtime_snap = now;
> +    tsk->vtime_snap += delta;
>
>      return delta - other;
>  }
> @@ -776,7 +778,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
>
>      write_seqcount_begin(&current->vtime_seqcount);
>      current->vtime_snap_whence = VTIME_SYS;
> -    current->vtime_snap = jiffies;
> +    current->vtime_snap = sched_clock_cpu(smp_processor_id());
>      write_seqcount_end(&current->vtime_seqcount);
>  }
>
> @@ -787,7 +789,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
>      local_irq_save(flags);
>      write_seqcount_begin(&t->vtime_seqcount);
>      t->vtime_snap_whence = VTIME_SYS;
> -    t->vtime_snap = jiffies;
> +    t->vtime_snap = sched_clock_cpu(cpu);
>      write_seqcount_end(&t->vtime_seqcount);
>      local_irq_restore(flags);
>  }
>
> Regards,
> Wanpeng Li

  reply	other threads:[~2017-03-30 11:52 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-23 20:55 [BUG nohz]: wrong user and system time accounting Luiz Capitulino
2017-03-24  0:56 ` Rik van Riel
2017-03-24  1:05   ` Luiz Capitulino
2017-03-24  1:08     ` Rik van Riel
2017-03-24  1:39       ` Luiz Capitulino
2017-03-27  5:33   ` lkml
2017-03-24  1:52 ` Wanpeng Li
2017-03-24  3:56   ` Luiz Capitulino
2017-03-27  1:56 ` Wanpeng Li
2017-03-27 17:35   ` Rik van Riel
2017-03-28  7:19     ` Wanpeng Li
     [not found]     ` <20170328132406.7d23579c@redhat.com>
     [not found]       ` <20170328161454.4a5d9e8b@redhat.com>
2017-03-28 21:01         ` Rik van Riel
2017-03-28 21:26           ` Luiz Capitulino
2017-03-29  9:56             ` Wanpeng Li
2017-03-29 12:56               ` Frederic Weisbecker
2017-03-28 21:24         ` Rik van Riel
2017-03-28 21:30           ` Luiz Capitulino
     [not found]       ` <20170329131656.1d6cb743@redhat.com>
2017-03-29 20:08         ` Rik van Riel
2017-03-29 22:54           ` Frederic Weisbecker
2017-03-30 12:57             ` Rik van Riel
2017-03-30  1:58           ` Wanpeng Li
2017-03-30 12:40             ` Frederic Weisbecker
2017-03-30 13:19               ` Mike Galbraith
2017-03-30  4:27           ` Mike Galbraith
2017-03-30  6:47             ` Wanpeng Li
2017-03-30 11:52               ` Wanpeng Li [this message]
2017-03-30 12:33                 ` Mike Galbraith
2017-03-30 13:38               ` Frederic Weisbecker
2017-03-30 13:59                 ` Wanpeng Li
2017-03-30 14:18                   ` Frederic Weisbecker
2017-03-30 21:25                     ` Luiz Capitulino
2017-03-31 20:09                       ` Luiz Capitulino
2017-03-31 23:24                         ` Frederic Weisbecker
2017-04-01  3:11                           ` Luiz Capitulino
2017-04-03 15:23                             ` Frederic Weisbecker
2017-04-03 19:06                               ` Luiz Capitulino
2017-04-04 17:36                                 ` Luiz Capitulino
2017-04-05 14:26                                   ` Rik van Riel
2017-04-11 11:03                 ` Wanpeng Li
2017-04-11 11:36                   ` Peter Zijlstra
2017-04-11 11:43                     ` Wanpeng Li
2017-04-11 14:22               ` Thomas Gleixner
2017-04-12 13:18                 ` Frederic Weisbecker
2017-04-12 14:57                   ` Thomas Gleixner
2017-04-12 15:14                     ` Frederic Weisbecker
2017-04-13  4:31                     ` Wanpeng Li
2017-04-13 13:32                       ` Frederic Weisbecker
2017-05-02 10:01                         ` Wanpeng Li
2017-05-15  8:17                           ` Wanpeng Li
2017-06-29 17:22                             ` Frederic Weisbecker
2017-03-30 12:51             ` Frederic Weisbecker
2017-03-30 13:02               ` Rik van Riel
2017-03-30 13:35                 ` Mike Galbraith
2017-04-03 14:40                   ` Frederic Weisbecker
2017-04-04  7:32                     ` Mike Galbraith
2017-03-30 13:44                 ` Frederic Weisbecker
     [not found]         ` <20170329221700.GB23895@lerouge>
2017-03-29 22:46           ` Wanpeng Li
2017-03-30  2:14             ` Luiz Capitulino
2017-03-30 12:27               ` Wanpeng Li
2017-03-27 18:38   ` Luiz Capitulino
2017-03-28  5:28     ` Wanpeng Li
2017-03-28 13:44       ` Luiz Capitulino
2017-03-29 13:04 ` Frederic Weisbecker
2017-03-29 13:14   ` Rik van Riel
2017-03-29 13:23     ` Luiz Capitulino
2017-03-29 21:12       ` Frederic Weisbecker
2017-03-30  1:48         ` Luiz Capitulino

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANRm+Cz_QXzN9EM1V7YHZhhOsqc2OgujG2veh4HU_Sr+9-9M0Q@mail.gmail.com \
    --to=kernellwp@gmail.com \
    --cc=efault@gmx.de \
    --cc=fweisbec@gmail.com \
    --cc=lcapitulino@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.