From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jeremy Fitzhardinge <jeremy@goop.org>
Subject: Re: Runaway real/sys time in newer paravirt domUs?
Date: Tue, 06 Jul 2010 12:05:01 -0700
Message-ID: <4C337E5D.7040407@goop.org>
References: <B64CF51D-D1D4-494C-BEA8-F5C6F0A926B6@linode.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <B64CF51D-D1D4-494C-BEA8-F5C6F0A926B6@linode.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Jed Smith <jed@linode.com>
Cc: xen-devel@lists.xensource.com
List-Id: xen-devel@lists.xenproject.org

On 07/06/2010 09:32 AM, Jed Smith wrote:
> Good morning,
>
> We've had a few reports from domU customers[1] - confirmed by myself - =
that CPU
> time accounting is very inaccurate in certain circumstances.  This issu=
e seems
> to be limited to x86_64 domUs, starting around the 2.6.32 family (but I=
 can't be
> sure of that).
>
> The symptoms of the flaw include top reporting hours and days of CPU co=
nsumed by
> a task which has been running for mere seconds of wall time, as well as=
 the
> time(1) utility reporting hundreds of years in some cases.  Contra-indi=
catively,
> the /proc/stat timers on all four VCPUs increment at roughly the expect=
ed rate.
> Needless to say, this is puzzling.
>
> A test case which highlights the failure has been brought to our attent=
ion by
> =C6var Arnfj=F6r=F0 Bjarmason, which is a simple Perl script[2] that fo=
rks and
> executes numerous dig(1) processes.  At the end of his script, time(1) =
reports
> 268659840m0.951s of user and 38524003m13.072s of system time consumed. =
 I am
> able to confirm this demonstration using:
>
>  - Xen 3.4.1 on dom0 2.6.18.8-931-2
>  - Debian Lenny on domU 2.6.32.12-x86_64-linode12 [3]
>
> Running =C6var's test case looks like this, in that domU:
>
>  =20
>> real 0m30.741s
>> user 307399002m50.773s
>> sys 46724m44.192s
>>    =20
> However, a quick busyloop in Python seems to report the correct time:
>
>  =20
>> li21-66:~# cat doit.py=20
>> for i in xrange(10000000):
>>  a =3D i ** 5
>>
>> li21-66:~# time python doit.py
>>
>> real	0m16.600s
>> user	0m16.593s
>> sys	0m0.006s
>>    =20
> I rebooted the domU, and the problem no longer exists.  It seems to be =
transient
> in nature, and difficult to isolate.  /proc/stat seems to increment nor=
mally:
>
>  =20
>> li21-66:/proc# cat stat | grep "cpu " && sleep 1 && cat stat | grep "c=
pu "
>> cpu  3742 0 1560 700180 1326 0 27 1282 0
>> cpu  3742 0 1562 700983 1326 0 27 1282 0
>>    =20
> I'm not sure where to begin with this one - any thoughts?
>  =20

It would be helpful to identify what kernel version the change of
behaviour started in (ideally a git bisect down to a particular change,
but a pair of versions would be close enough).

I think this is the same problem as
https://bugzilla.kernel.org/show_bug.cgi?id=3D16314

Thanks,
    J