From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752461AbcHJV0Q (ORCPT <rfc822;w@1wt.eu>);
	Wed, 10 Aug 2016 17:26:16 -0400
Received: from mx1.redhat.com ([209.132.183.28]:28845 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S932486AbcHJSDc (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Wed, 10 Aug 2016 14:03:32 -0400
Message-ID: <1470805661.13905.95.camel@redhat.com>
Subject: Re: [PATCH 1/5] sched,time: Count actually elapsed irq & softirq
 time
From: Rik van Riel <riel@redhat.com>
To: Wanpeng Li <kernellwp@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>, Ingo Molnar <mingo@kernel.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Wanpeng Li <wanpeng.li@hotmail.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Radim Krcmar <rkrcmar@redhat.com>, Mike Galbraith <efault@gmx.de>
Date: Wed, 10 Aug 2016 01:07:41 -0400
In-Reply-To: <CANRm+CzXSCNGoVbuOB0Ruj2nmfHNRfcO3eB-91Z-fnBOnn-gbQ@mail.gmail.com>
References: <1468421405-20056-1-git-send-email-fweisbec@gmail.com>
 <1468421405-20056-2-git-send-email-fweisbec@gmail.com>
	 <CANRm+CyYHLaihuk+ckQg42-Lo_3vFHaS8gU=GmXh8Rfq5mMpaA@mail.gmail.com>
 <1470751579.13905.77.camel@redhat.com>
	 <CANRm+CzxWo9i_=Ygn4CAF0L5=1avtnX=ahwMK96fj3hnQW-_zA@mail.gmail.com>
	 <CANRm+CzXSCNGoVbuOB0Ruj2nmfHNRfcO3eB-91Z-fnBOnn-gbQ@mail.gmail.com>
Content-Type: multipart/signed; micalg="pgp-sha256";
	protocol="application/pgp-signature"; boundary="=-YEkeJS99YQwEDixj33Wv"
Mime-Version: 1.0
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Wed, 10 Aug 2016 05:07:44 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


--=-YEkeJS99YQwEDixj33Wv
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Wed, 2016-08-10 at 07:39 +0800, Wanpeng Li wrote:
> 2016-08-10 7:25 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
> > 2016-08-09 22:06 GMT+08:00 Rik van Riel <riel@redhat.com>:
> > > On Tue, 2016-08-09 at 11:59 +0800, Wanpeng Li wrote:
> > > > Hi Rik,
> > > > 2016-07-13 22:50 GMT+08:00 Frederic Weisbecker <fweisbec@gmail.
> > > > com>:
> > > > > From: Rik van Riel <riel@redhat.com>
> > > > >=20
> > > > > Currently, if there was any irq or softirq time during
> > > > > 'ticks'
> > > > > jiffies, the entire period will be accounted as irq or
> > > > > softirq
> > > > > time.
> > > > >=20
> > > > > This is inaccurate if only a subset of the time was actually
> > > > > spent
> > > > > handling irqs, and could conceivably mis-count all of the
> > > > > ticks
> > > > > during
> > > > > a period as irq time, when there was some irq and some
> > > > > softirq
> > > > > time.
> > > > >=20
> > > > > This can actually happen when irqtime_account_process_tick is
> > > > > called
> > > > > from account_idle_ticks, which can pass a larger number of
> > > > > ticks
> > > > > down
> > > > > all at once.
> > > > >=20
> > > > > Fix this by changing irqtime_account_hi_update,
> > > > > irqtime_account_si_update,
> > > > > and steal_account_process_ticks to work with cputime_t time
> > > > > units,
> > > > > and
> > > > > return the amount of time spent in each mode.
> > > >=20
> > > > Do we need to minus st cputime from idle cputime in
> > > > account_idle_ticks() when noirqtime is true? I try to add this
> > > > logic
> > > > w/ noirqtime and idle=3Dpoll boot parameter for a full dynticks
> > > > guest,
> > > > however, there is no difference, where I miss?
> > >=20
> > > Yes, you are right. The code in account_idle_ticks()
> > > could use the same treatment.
> > >=20
> > > I am not sure why it would not work, though...
> >=20
> > Actually I observed a regression caused by this patch. I use a i5
>=20
> The regression is caused by your commit "sched,time: Count actually
> elapsed irq & softirq time".

Wanpeng and I discussed this issue, and discovered
that this bug is triggered by my patch, specifically
this bit:

-=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (steal_account_process_tick(U=
LONG_MAX))
+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0other =3D account_other_time(cpu=
time);
+=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (other >=3D cputime)
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0return;

Replacing "cputime" with "ULONG_MAX" as the argument
to account_other_time makes the bug disappear.

However, this is not the cause of the bug.

The cause of the bug appears to be that the values
used to figure out how much steal time has passed
are never initialized.

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 steal =3D paravirt_=
steal_clock(smp_processor_id());
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0steal -=3D this_rq()->prev_steal_time;

The first of the two may be initialized by the host
(I did not verify that), but the second one does not
have any explicit initializers anywhere in the kernel
tree.

This can lead to an arbitrarily large difference between
paravirt_steal_clock(smp_processor_id()) and
this_rq()->prev_steal_time, which results in nothing but
steal time getting accounted for a potentially a very
long amount of time.

Previously we carried this patch to initialize the
various rq->prev_* values at CPU hotplug time:

https://patchwork.codeaurora.org/patch/27699/

Which got reverted by Paolo here:

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=3Dsche
d/core&id=3D3d89e5478bf550a50c99e93adf659369798263b0

Which leads me to this question:

Paulo, how would you like us to fix this bug?

It seems like the host and guest steal time CAN get out
of sync, sometimes by a ridiculous amount, and we need
some way to get the excessive difference out of the way,
without it getting accounted as steal time (not immediately,
and not over the next 17 hours, or months).

--=20

All Rights Reversed.
--=-YEkeJS99YQwEDixj33Wv
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAABCAAGBQJXqradAAoJEM553pKExN6D7/IH/0CNCsZZj5gpiBibUQxPTj2I
wCQykGvSDXkr3kKnoHLyxMsGO0DPtklQxWVaV9KfPNAvCSCxQAPdQhoqU4pGHm3q
lz2nhXJRtKn+VMt/879zIsSBsfelVk+nBnEj8qQDBXZ1oT3/QMF2+67L4ojqYBNs
+Laf5mOaNdVGxS97th8zOD49ftPUyVLh8d9xBk0OCHdK4ZzVW6VllTlDNRBY7LZu
b2+T0/lRcKFNgq+QnHARlzlZL1G4cLaLPN/O1lROJpYc9K6sc3ktGQxCsiYPv+/k
POsBTaPE5/3rtfFhcEBq39lOdxodRzauRFVn85WoKkyrkBnCqNH5AeUv6TpcxSk=
=nsE9
-----END PGP SIGNATURE-----

--=-YEkeJS99YQwEDixj33Wv--