From: Frederic Weisbecker <fweisbec@gmail.com>
To: Wanpeng Li <kernellwp@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Mike Galbraith <efault@gmx.de>, Rik van Riel <riel@redhat.com>,
Luiz Capitulino <lcapitulino@redhat.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Peter Zijlstra <peterz@infradead.org>
Subject: Re: [BUG nohz]: wrong user and system time accounting
Date: Thu, 13 Apr 2017 15:32:57 +0200 [thread overview]
Message-ID: <20170413133255.GA2608@lerouge> (raw)
In-Reply-To: <CANRm+Cw8g_4J_Mo-P8ZOMST3+Uh9eOn6o2T020BYt5XYBTjxSA@mail.gmail.com>
On Thu, Apr 13, 2017 at 12:31:12PM +0800, Wanpeng Li wrote:
> 2017-04-12 22:57 GMT+08:00 Thomas Gleixner <tglx@linutronix.de>:
> > On Wed, 12 Apr 2017, Frederic Weisbecker wrote:
> >> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote:
> >> > It's not different from the current jiffies based stuff at all. Same
> >> > failure mode.
> >>
> >> Yes you're right, I got confused again. So to fix this we could do our snapshots
> >> at a frequency lower than HZ but still high enough to avoid overhead.
> >>
> >> Something like TICK_NSEC / 2 ?
> >
> > If you are using TSC anyway then you can do proper accumulation for both
> > system and user and only account the data when the accumulation is more
> > than a jiffie.
>
> So I implement it as below:
>
> - HZ=1000.
> 1) two cpu hogs on cpu in nohz_full mode, 100% user time
> 2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected)
> - HZ=250
> 1) two cpu hogs on cpu in nohz_full mode, 100% user time
> 2) Luzi's testcase, 100% idle
>
> So the codes below still not work correctly for HZ=250, any suggestions?
Right, so first lets reorder that code a bit so we can see clear inside :-)
>
> -------------------------------------->8-----------------------------------------------------
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d67eee8..6a11771 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -668,6 +668,8 @@ struct task_struct {
> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> seqcount_t vtime_seqcount;
> unsigned long long vtime_snap;
> + u64 vtime_acct_utime;
> + u64 vtime_acct_stime;
You need to accumulate guest and steal time as well.
> enum {
> /* Task is sleeping or running in a CPU with VTIME inactive: */
> VTIME_INACTIVE = 0,
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..f8e54ba 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -674,20 +674,41 @@ void thread_group_cputime_adjusted(struct
> task_struct *p, u64 *ut, u64 *st)
> #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>
> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> -static u64 vtime_delta(struct task_struct *tsk)
> +static u64 vtime_delta(struct task_struct *tsk, bool user)
> {
> - unsigned long now = READ_ONCE(jiffies);
> + u64 delta, ret = 0;
>
> - if (time_before(now, (unsigned long)tsk->vtime_snap))
> - return 0;
> + delta = sched_clock() - tsk->vtime_snap;
>
> - return jiffies_to_nsecs(now - tsk->vtime_snap);
> + if (is_idle_task(tsk)) {
> + if (delta >= TICK_NSEC)
> + ret = delta;
> + } else {
> + if (user) {
> + tsk->vtime_acct_utime += delta;
> + if (tsk->vtime_acct_utime >= TICK_NSEC)
> + ret = tsk->vtime_acct_utime;
> + } else {
> + tsk->vtime_acct_stime += delta;
> + if (tsk->vtime_acct_utime >= TICK_NSEC)
> + ret = tsk->vtime_acct_stime;
> + }
We already have vtime_account_idle, vtime_account_user, etc...
The accumulation should be done by these functions that know what and where
to account. vtime_delta() should really just return the difference against vtime_snap,
it's too low level to care about these details.
> + }
> +
> + return ret;
> }
>
> -static u64 get_vtime_delta(struct task_struct *tsk)
> +static u64 get_vtime_delta(struct task_struct *tsk, bool user)
> {
> - unsigned long now = READ_ONCE(jiffies);
> - u64 delta, other;
> + u64 delta = vtime_delta(tsk, user);
> + u64 other;
> +
> + if (!is_idle_task(tsk)) {
> + if (user)
> + tsk->vtime_acct_utime = 0;
> + else
> + tsk->vtime_acct_stime = 0;
> + }
Like vtime_delta(), get_vtime_delta() shouldn't touch these accumulators.
Reset and accounting really should be done by the upper level functions
vtime_account_*()
Thanks.
next prev parent reply other threads:[~2017-04-13 13:33 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-23 20:55 [BUG nohz]: wrong user and system time accounting Luiz Capitulino
2017-03-24 0:56 ` Rik van Riel
2017-03-24 1:05 ` Luiz Capitulino
2017-03-24 1:08 ` Rik van Riel
2017-03-24 1:39 ` Luiz Capitulino
2017-03-27 5:33 ` lkml
2017-03-24 1:52 ` Wanpeng Li
2017-03-24 3:56 ` Luiz Capitulino
2017-03-27 1:56 ` Wanpeng Li
2017-03-27 17:35 ` Rik van Riel
2017-03-28 7:19 ` Wanpeng Li
[not found] ` <20170328132406.7d23579c@redhat.com>
[not found] ` <20170328161454.4a5d9e8b@redhat.com>
2017-03-28 21:01 ` Rik van Riel
2017-03-28 21:26 ` Luiz Capitulino
2017-03-29 9:56 ` Wanpeng Li
2017-03-29 12:56 ` Frederic Weisbecker
2017-03-28 21:24 ` Rik van Riel
2017-03-28 21:30 ` Luiz Capitulino
[not found] ` <20170329131656.1d6cb743@redhat.com>
2017-03-29 20:08 ` Rik van Riel
2017-03-29 22:54 ` Frederic Weisbecker
2017-03-30 12:57 ` Rik van Riel
2017-03-30 1:58 ` Wanpeng Li
2017-03-30 12:40 ` Frederic Weisbecker
2017-03-30 13:19 ` Mike Galbraith
2017-03-30 4:27 ` Mike Galbraith
2017-03-30 6:47 ` Wanpeng Li
2017-03-30 11:52 ` Wanpeng Li
2017-03-30 12:33 ` Mike Galbraith
2017-03-30 13:38 ` Frederic Weisbecker
2017-03-30 13:59 ` Wanpeng Li
2017-03-30 14:18 ` Frederic Weisbecker
2017-03-30 21:25 ` Luiz Capitulino
2017-03-31 20:09 ` Luiz Capitulino
2017-03-31 23:24 ` Frederic Weisbecker
2017-04-01 3:11 ` Luiz Capitulino
2017-04-03 15:23 ` Frederic Weisbecker
2017-04-03 19:06 ` Luiz Capitulino
2017-04-04 17:36 ` Luiz Capitulino
2017-04-05 14:26 ` Rik van Riel
2017-04-11 11:03 ` Wanpeng Li
2017-04-11 11:36 ` Peter Zijlstra
2017-04-11 11:43 ` Wanpeng Li
2017-04-11 14:22 ` Thomas Gleixner
2017-04-12 13:18 ` Frederic Weisbecker
2017-04-12 14:57 ` Thomas Gleixner
2017-04-12 15:14 ` Frederic Weisbecker
2017-04-13 4:31 ` Wanpeng Li
2017-04-13 13:32 ` Frederic Weisbecker [this message]
2017-05-02 10:01 ` Wanpeng Li
2017-05-15 8:17 ` Wanpeng Li
2017-06-29 17:22 ` Frederic Weisbecker
2017-03-30 12:51 ` Frederic Weisbecker
2017-03-30 13:02 ` Rik van Riel
2017-03-30 13:35 ` Mike Galbraith
2017-04-03 14:40 ` Frederic Weisbecker
2017-04-04 7:32 ` Mike Galbraith
2017-03-30 13:44 ` Frederic Weisbecker
[not found] ` <20170329221700.GB23895@lerouge>
2017-03-29 22:46 ` Wanpeng Li
2017-03-30 2:14 ` Luiz Capitulino
2017-03-30 12:27 ` Wanpeng Li
2017-03-27 18:38 ` Luiz Capitulino
2017-03-28 5:28 ` Wanpeng Li
2017-03-28 13:44 ` Luiz Capitulino
2017-03-29 13:04 ` Frederic Weisbecker
2017-03-29 13:14 ` Rik van Riel
2017-03-29 13:23 ` Luiz Capitulino
2017-03-29 21:12 ` Frederic Weisbecker
2017-03-30 1:48 ` Luiz Capitulino
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170413133255.GA2608@lerouge \
--to=fweisbec@gmail.com \
--cc=efault@gmx.de \
--cc=kernellwp@gmail.com \
--cc=lcapitulino@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).