From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S932853AbdC2Wyf (ORCPT <rfc822;w@1wt.eu>);
        Wed, 29 Mar 2017 18:54:35 -0400
Received: from mail-wr0-f193.google.com ([209.85.128.193]:35473 "EHLO
        mail-wr0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1752656AbdC2Wyd (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 29 Mar 2017 18:54:33 -0400
Date: Thu, 30 Mar 2017 00:54:30 +0200
From: Frederic Weisbecker <fweisbec@gmail.com>
To: Rik van Riel <riel@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>,
        Wanpeng Li <kernellwp@gmail.com>, linux-kernel@vger.kernel.org,
        Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [BUG nohz]: wrong user and system time accounting
Message-ID: <20170329225428.GC23895@lerouge>
References: <20170323165512.60945ac6@redhat.com>
 <CANRm+CxcgSP2-x+A822DmHLvFLzFmTptS6oYwYtwVdErTpiB=Q@mail.gmail.com>
 <1490636129.8850.76.camel@redhat.com>
 <20170328132406.7d23579c@redhat.com>
 <20170329131656.1d6cb743@redhat.com>
 <1490818125.28917.11.camel@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <1490818125.28917.11.camel@redhat.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

(Adding Thomas in Cc)

On Wed, Mar 29, 2017 at 04:08:45PM -0400, Rik van Riel wrote:
> On Wed, 2017-03-29 at 13:16 -0400, Luiz Capitulino wrote:
> > On Tue, 28 Mar 2017 13:24:06 -0400
> > Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > 
> > >  1. In my tracing I'm seeing that sometimes (always?) the
> > >     time interval between two timer interrupts is less than 1ms
> > 
> > I think that's the root cause.
> > 
> > In this trace, we see the following:
> > 
> >  1. On CPU15, we transition from user-space to kernel-space because
> >     of a timer interrupt (it's the tick)
> > 
> >  2. vtimer_delta() returns 0, because jiffies didn't change since the
> >     last accounting
> > 
> >  3. While CPU15 is executing in kernel-space, jiffies is updated
> >     by CPU0
> > 
> >  4. When going back to user-space, vtime_delta() returns non-zero
> >     and the whole time is accounted for system time (observe how
> >     the cputime parameter in account_system_time() is less than 1ms)
> 
> In other words, the tick on cpu0 is aligned
> with the tick on the nohz_full cpus, and
> jiffies is advanced while the nohz_full cpus
> with an active tick happen to be in kernel
> mode?

Ah you found out faster than me :-)

> Frederic, can you think of any reason why
> the tick on nohz_full CPUs would end up aligned
> with the tick on cpu0, instead of running at some
> random offset?

tick_init_jiffy_update() takes that decision to align all ticks.

I'm not sure why. I don't see anything that could depend on that
wide tick synchronization. The jiffies update itself relies on ktime
to check when to update it. So even if the tick fires a bit later
on CPU 1 than on CPU 0, the jiffies updates should stay coherent and
should never exceed 999us delay in the worst case (for HZ=1000)

Now I might overlook something.

> 
> A random offset, or better yet a somewhat randomized
> tick length to make sure that simultaneous ticks are
> fairly rare and the vtime sampling does not end up
> "in phase" with the jiffies incrementing, could make
> the accounting work right again.
> 
> Of course, that assumes the above hypothesis is correct :)

I'm not sure that randomizing the tick start per CPU would be a
right solution. Somewhere in the world you can be sure the tick
randomization of some nohz_full CPU will coincide with the tick
of CPU 0 :o)

Or we could force that tick on nohz_full CPUs to be far from
CPU 0's tick... I'm not sure such a solution would be accepted though.