linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Wanpeng Li <kernellwp@gmail.com>
To: Mike Galbraith <efault@gmx.de>
Cc: Rik van Riel <riel@redhat.com>,
	Luiz Capitulino <lcapitulino@redhat.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [BUG nohz]: wrong user and system time accounting
Date: Thu, 30 Mar 2017 14:47:11 +0800	[thread overview]
Message-ID: <CANRm+CzfqkU6iV1BFk3PmWvy7MOsYH1mSwejJXMkvWW9C5ngwg@mail.gmail.com> (raw)
In-Reply-To: <1490848051.4167.57.camel@gmx.de>

Cc Peterz, Thomas,
2017-03-30 12:27 GMT+08:00 Mike Galbraith <efault@gmx.de>:
> On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote:
>
>> In other words, the tick on cpu0 is aligned
>> with the tick on the nohz_full cpus, and
>> jiffies is advanced while the nohz_full cpus
>> with an active tick happen to be in kernel
>> mode?
>
> You really want skew_tick=1, especially on big boxen.
>
>> Frederic, can you think of any reason why
>> the tick on nohz_full CPUs would end up aligned
>> with the tick on cpu0, instead of running at some
>> random offset?
>
> (I or low rq->clock bits as crude NOHZ collision avoidance)
>
>> A random offset, or better yet a somewhat randomized
>> tick length to make sure that simultaneous ticks are
>> fairly rare and the vtime sampling does not end up
>> "in phase" with the jiffies incrementing, could make
>> the accounting work right again.
>
> That improves jitter, especially on big boxen.  I have an 8 socket box
> that thinks it's an extra large PC, there, collision avoidance matters
> hugely.  I couldn't reproduce bean counting woes, no idea if collision
> avoidance will help that.

So I implement two methods, one is from Rik's random offset proposal
through skew tick, the other one is from Frederic's proposal and it is
the same as my original idea through use nanosecond granularity to
check deltas but only perform an actual cputime update when that delta
>= TICK_NSEC. Both methods can solve the bug which Luiz reported.
Peterz, Thomas, any ideas?

--------------------------->8-------------------------------------------------------------

skew tick:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7fe53be..9981437 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1198,7 +1198,11 @@ void tick_setup_sched_timer(void)
     hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());

     /* Offset the tick to avert jiffies_lock contention. */
+#ifdef CONFIG_NO_HZ_FULL
+    if (sched_skew_tick || tick_nohz_full_running) {
+#else
     if (sched_skew_tick) {
+#endif
         u64 offset = ktime_to_ns(tick_period) >> 1;
         do_div(offset, num_possible_cpus());
         offset *= smp_processor_id();

-------------------------------------->8-----------------------------------------------------

use nanosecond granularity to check deltas but only perform an actual
cputime update when that delta >= TICK_NSEC.

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f3778e2b..f1ee393 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct
task_struct *p, u64 *ut, u64 *st)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static u64 vtime_delta(struct task_struct *tsk)
 {
-    unsigned long now = READ_ONCE(jiffies);
+    u64 now = local_clock();
+    u64 delta;
+
+    delta = now - tsk->vtime_snap;

-    if (time_before(now, (unsigned long)tsk->vtime_snap))
+    if (delta < TICK_NSEC)
         return 0;

-    return jiffies_to_nsecs(now - tsk->vtime_snap);
+    return jiffies_to_nsecs(delta / TICK_NSEC);
 }

 static u64 get_vtime_delta(struct task_struct *tsk)
 {
-    unsigned long now = READ_ONCE(jiffies);
-    u64 delta, other;
+    u64 delta = vtime_delta(tsk);
+    u64 other;

     /*
      * Unlike tick based timing, vtime based timing never has lost
@@ -696,10 +699,9 @@ static u64 get_vtime_delta(struct task_struct *tsk)
      * elapsed time. Limit account_other_time to prevent rounding
      * errors from causing elapsed vtime to go negative.
      */
-    delta = jiffies_to_nsecs(now - tsk->vtime_snap);
     other = account_other_time(delta);
     WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
-    tsk->vtime_snap = now;
+    tsk->vtime_snap += delta;

     return delta - other;
 }
@@ -776,7 +778,7 @@ void arch_vtime_task_switch(struct task_struct *prev)

     write_seqcount_begin(&current->vtime_seqcount);
     current->vtime_snap_whence = VTIME_SYS;
-    current->vtime_snap = jiffies;
+    current->vtime_snap = sched_clock_cpu(smp_processor_id());
     write_seqcount_end(&current->vtime_seqcount);
 }

@@ -787,7 +789,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
     local_irq_save(flags);
     write_seqcount_begin(&t->vtime_seqcount);
     t->vtime_snap_whence = VTIME_SYS;
-    t->vtime_snap = jiffies;
+    t->vtime_snap = sched_clock_cpu(cpu);
     write_seqcount_end(&t->vtime_seqcount);
     local_irq_restore(flags);
 }

Regards,
Wanpeng Li

  reply	other threads:[~2017-03-30  6:47 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-23 20:55 [BUG nohz]: wrong user and system time accounting Luiz Capitulino
2017-03-24  0:56 ` Rik van Riel
2017-03-24  1:05   ` Luiz Capitulino
2017-03-24  1:08     ` Rik van Riel
2017-03-24  1:39       ` Luiz Capitulino
2017-03-27  5:33   ` lkml
2017-03-24  1:52 ` Wanpeng Li
2017-03-24  3:56   ` Luiz Capitulino
2017-03-27  1:56 ` Wanpeng Li
2017-03-27 17:35   ` Rik van Riel
2017-03-28  7:19     ` Wanpeng Li
     [not found]     ` <20170328132406.7d23579c@redhat.com>
     [not found]       ` <20170328161454.4a5d9e8b@redhat.com>
2017-03-28 21:01         ` Rik van Riel
2017-03-28 21:26           ` Luiz Capitulino
2017-03-29  9:56             ` Wanpeng Li
2017-03-29 12:56               ` Frederic Weisbecker
2017-03-28 21:24         ` Rik van Riel
2017-03-28 21:30           ` Luiz Capitulino
     [not found]       ` <20170329131656.1d6cb743@redhat.com>
2017-03-29 20:08         ` Rik van Riel
2017-03-29 22:54           ` Frederic Weisbecker
2017-03-30 12:57             ` Rik van Riel
2017-03-30  1:58           ` Wanpeng Li
2017-03-30 12:40             ` Frederic Weisbecker
2017-03-30 13:19               ` Mike Galbraith
2017-03-30  4:27           ` Mike Galbraith
2017-03-30  6:47             ` Wanpeng Li [this message]
2017-03-30 11:52               ` Wanpeng Li
2017-03-30 12:33                 ` Mike Galbraith
2017-03-30 13:38               ` Frederic Weisbecker
2017-03-30 13:59                 ` Wanpeng Li
2017-03-30 14:18                   ` Frederic Weisbecker
2017-03-30 21:25                     ` Luiz Capitulino
2017-03-31 20:09                       ` Luiz Capitulino
2017-03-31 23:24                         ` Frederic Weisbecker
2017-04-01  3:11                           ` Luiz Capitulino
2017-04-03 15:23                             ` Frederic Weisbecker
2017-04-03 19:06                               ` Luiz Capitulino
2017-04-04 17:36                                 ` Luiz Capitulino
2017-04-05 14:26                                   ` Rik van Riel
2017-04-11 11:03                 ` Wanpeng Li
2017-04-11 11:36                   ` Peter Zijlstra
2017-04-11 11:43                     ` Wanpeng Li
2017-04-11 14:22               ` Thomas Gleixner
2017-04-12 13:18                 ` Frederic Weisbecker
2017-04-12 14:57                   ` Thomas Gleixner
2017-04-12 15:14                     ` Frederic Weisbecker
2017-04-13  4:31                     ` Wanpeng Li
2017-04-13 13:32                       ` Frederic Weisbecker
2017-05-02 10:01                         ` Wanpeng Li
2017-05-15  8:17                           ` Wanpeng Li
2017-06-29 17:22                             ` Frederic Weisbecker
2017-03-30 12:51             ` Frederic Weisbecker
2017-03-30 13:02               ` Rik van Riel
2017-03-30 13:35                 ` Mike Galbraith
2017-04-03 14:40                   ` Frederic Weisbecker
2017-04-04  7:32                     ` Mike Galbraith
2017-03-30 13:44                 ` Frederic Weisbecker
     [not found]         ` <20170329221700.GB23895@lerouge>
2017-03-29 22:46           ` Wanpeng Li
2017-03-30  2:14             ` Luiz Capitulino
2017-03-30 12:27               ` Wanpeng Li
2017-03-27 18:38   ` Luiz Capitulino
2017-03-28  5:28     ` Wanpeng Li
2017-03-28 13:44       ` Luiz Capitulino
2017-03-29 13:04 ` Frederic Weisbecker
2017-03-29 13:14   ` Rik van Riel
2017-03-29 13:23     ` Luiz Capitulino
2017-03-29 21:12       ` Frederic Weisbecker
2017-03-30  1:48         ` Luiz Capitulino

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANRm+CzfqkU6iV1BFk3PmWvy7MOsYH1mSwejJXMkvWW9C5ngwg@mail.gmail.com \
    --to=kernellwp@gmail.com \
    --cc=efault@gmx.de \
    --cc=fweisbec@gmail.com \
    --cc=lcapitulino@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).