linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Frederic Weisbecker <fweisbec@gmail.com>
To: Wanpeng Li <kernellwp@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Mike Galbraith <efault@gmx.de>, Rik van Riel <riel@redhat.com>,
	Luiz Capitulino <lcapitulino@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [BUG nohz]: wrong user and system time accounting
Date: Thu, 13 Apr 2017 15:32:57 +0200	[thread overview]
Message-ID: <20170413133255.GA2608@lerouge> (raw)
In-Reply-To: <CANRm+Cw8g_4J_Mo-P8ZOMST3+Uh9eOn6o2T020BYt5XYBTjxSA@mail.gmail.com>

On Thu, Apr 13, 2017 at 12:31:12PM +0800, Wanpeng Li wrote:
> 2017-04-12 22:57 GMT+08:00 Thomas Gleixner <tglx@linutronix.de>:
> > On Wed, 12 Apr 2017, Frederic Weisbecker wrote:
> >> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote:
> >> > It's not different from the current jiffies based stuff at all. Same
> >> > failure mode.
> >>
> >> Yes you're right, I got confused again. So to fix this we could do our snapshots
> >> at a frequency lower than HZ but still high enough to avoid overhead.
> >>
> >> Something like TICK_NSEC / 2 ?
> >
> > If you are using TSC anyway then you can do proper accumulation for both
> > system and user and only account the data when the accumulation is more
> > than a jiffie.
> 
> So I implement it as below:
> 
> - HZ=1000.
>   1) two cpu hogs on cpu in nohz_full mode, 100% user time
>   2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected)
> - HZ=250
>    1) two cpu hogs on cpu in nohz_full mode, 100% user time
>    2) Luzi's testcase, 100% idle
> 
> So the codes below still not work correctly for HZ=250, any suggestions?

Right, so first lets reorder that code a bit so we can see clear inside :-)

> 
> -------------------------------------->8-----------------------------------------------------
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d67eee8..6a11771 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -668,6 +668,8 @@ struct task_struct {
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>      seqcount_t            vtime_seqcount;
>      unsigned long long        vtime_snap;
> +    u64                vtime_acct_utime;
> +    u64                vtime_acct_stime;

You need to accumulate guest and steal time as well.

>      enum {
>          /* Task is sleeping or running in a CPU with VTIME inactive: */
>          VTIME_INACTIVE = 0,
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..f8e54ba 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -674,20 +674,41 @@ void thread_group_cputime_adjusted(struct
> task_struct *p, u64 *ut, u64 *st)
>  #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
> 
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> -static u64 vtime_delta(struct task_struct *tsk)
> +static u64 vtime_delta(struct task_struct *tsk, bool user)
>  {
> -    unsigned long now = READ_ONCE(jiffies);
> +    u64 delta, ret = 0;
> 
> -    if (time_before(now, (unsigned long)tsk->vtime_snap))
> -        return 0;
> +    delta = sched_clock() - tsk->vtime_snap;
> 
> -    return jiffies_to_nsecs(now - tsk->vtime_snap);
> +    if (is_idle_task(tsk)) {
> +        if (delta >= TICK_NSEC)
> +            ret = delta;
> +    } else {
> +        if (user) {
> +            tsk->vtime_acct_utime += delta;
> +            if (tsk->vtime_acct_utime >= TICK_NSEC)
> +                ret = tsk->vtime_acct_utime;
> +        } else {
> +            tsk->vtime_acct_stime += delta;
> +            if (tsk->vtime_acct_utime >= TICK_NSEC)
> +                ret = tsk->vtime_acct_stime;
> +        }

We already have vtime_account_idle, vtime_account_user, etc...
The accumulation should be done by these functions that know what and where
to account. vtime_delta() should really just return the difference against vtime_snap,
it's too low level to care about these details.

> +    }
> +
> +    return ret;
>  }
> 
> -static u64 get_vtime_delta(struct task_struct *tsk)
> +static u64 get_vtime_delta(struct task_struct *tsk, bool user)
>  {
> -    unsigned long now = READ_ONCE(jiffies);
> -    u64 delta, other;
> +    u64 delta = vtime_delta(tsk, user);
> +    u64 other;
> +
> +    if (!is_idle_task(tsk)) {
> +        if (user)
> +            tsk->vtime_acct_utime = 0;
> +        else
> +            tsk->vtime_acct_stime = 0;
> +    }

Like vtime_delta(), get_vtime_delta() shouldn't touch these accumulators.
Reset and accounting really should be done by the upper level functions
vtime_account_*()

Thanks.

  reply	other threads:[~2017-04-13 13:33 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-23 20:55 [BUG nohz]: wrong user and system time accounting Luiz Capitulino
2017-03-24  0:56 ` Rik van Riel
2017-03-24  1:05   ` Luiz Capitulino
2017-03-24  1:08     ` Rik van Riel
2017-03-24  1:39       ` Luiz Capitulino
2017-03-27  5:33   ` lkml
2017-03-24  1:52 ` Wanpeng Li
2017-03-24  3:56   ` Luiz Capitulino
2017-03-27  1:56 ` Wanpeng Li
2017-03-27 17:35   ` Rik van Riel
2017-03-28  7:19     ` Wanpeng Li
     [not found]     ` <20170328132406.7d23579c@redhat.com>
     [not found]       ` <20170328161454.4a5d9e8b@redhat.com>
2017-03-28 21:01         ` Rik van Riel
2017-03-28 21:26           ` Luiz Capitulino
2017-03-29  9:56             ` Wanpeng Li
2017-03-29 12:56               ` Frederic Weisbecker
2017-03-28 21:24         ` Rik van Riel
2017-03-28 21:30           ` Luiz Capitulino
     [not found]       ` <20170329131656.1d6cb743@redhat.com>
2017-03-29 20:08         ` Rik van Riel
2017-03-29 22:54           ` Frederic Weisbecker
2017-03-30 12:57             ` Rik van Riel
2017-03-30  1:58           ` Wanpeng Li
2017-03-30 12:40             ` Frederic Weisbecker
2017-03-30 13:19               ` Mike Galbraith
2017-03-30  4:27           ` Mike Galbraith
2017-03-30  6:47             ` Wanpeng Li
2017-03-30 11:52               ` Wanpeng Li
2017-03-30 12:33                 ` Mike Galbraith
2017-03-30 13:38               ` Frederic Weisbecker
2017-03-30 13:59                 ` Wanpeng Li
2017-03-30 14:18                   ` Frederic Weisbecker
2017-03-30 21:25                     ` Luiz Capitulino
2017-03-31 20:09                       ` Luiz Capitulino
2017-03-31 23:24                         ` Frederic Weisbecker
2017-04-01  3:11                           ` Luiz Capitulino
2017-04-03 15:23                             ` Frederic Weisbecker
2017-04-03 19:06                               ` Luiz Capitulino
2017-04-04 17:36                                 ` Luiz Capitulino
2017-04-05 14:26                                   ` Rik van Riel
2017-04-11 11:03                 ` Wanpeng Li
2017-04-11 11:36                   ` Peter Zijlstra
2017-04-11 11:43                     ` Wanpeng Li
2017-04-11 14:22               ` Thomas Gleixner
2017-04-12 13:18                 ` Frederic Weisbecker
2017-04-12 14:57                   ` Thomas Gleixner
2017-04-12 15:14                     ` Frederic Weisbecker
2017-04-13  4:31                     ` Wanpeng Li
2017-04-13 13:32                       ` Frederic Weisbecker [this message]
2017-05-02 10:01                         ` Wanpeng Li
2017-05-15  8:17                           ` Wanpeng Li
2017-06-29 17:22                             ` Frederic Weisbecker
2017-03-30 12:51             ` Frederic Weisbecker
2017-03-30 13:02               ` Rik van Riel
2017-03-30 13:35                 ` Mike Galbraith
2017-04-03 14:40                   ` Frederic Weisbecker
2017-04-04  7:32                     ` Mike Galbraith
2017-03-30 13:44                 ` Frederic Weisbecker
     [not found]         ` <20170329221700.GB23895@lerouge>
2017-03-29 22:46           ` Wanpeng Li
2017-03-30  2:14             ` Luiz Capitulino
2017-03-30 12:27               ` Wanpeng Li
2017-03-27 18:38   ` Luiz Capitulino
2017-03-28  5:28     ` Wanpeng Li
2017-03-28 13:44       ` Luiz Capitulino
2017-03-29 13:04 ` Frederic Weisbecker
2017-03-29 13:14   ` Rik van Riel
2017-03-29 13:23     ` Luiz Capitulino
2017-03-29 21:12       ` Frederic Weisbecker
2017-03-30  1:48         ` Luiz Capitulino

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170413133255.GA2608@lerouge \
    --to=fweisbec@gmail.com \
    --cc=efault@gmx.de \
    --cc=kernellwp@gmail.com \
    --cc=lcapitulino@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).