From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: Runaway real/sys time in newer paravirt domUs? Date: Tue, 06 Jul 2010 12:05:01 -0700 Message-ID: <4C337E5D.7040407@goop.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Jed Smith Cc: xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org On 07/06/2010 09:32 AM, Jed Smith wrote: > Good morning, > > We've had a few reports from domU customers[1] - confirmed by myself - = that CPU > time accounting is very inaccurate in certain circumstances. This issu= e seems > to be limited to x86_64 domUs, starting around the 2.6.32 family (but I= can't be > sure of that). > > The symptoms of the flaw include top reporting hours and days of CPU co= nsumed by > a task which has been running for mere seconds of wall time, as well as= the > time(1) utility reporting hundreds of years in some cases. Contra-indi= catively, > the /proc/stat timers on all four VCPUs increment at roughly the expect= ed rate. > Needless to say, this is puzzling. > > A test case which highlights the failure has been brought to our attent= ion by > =C6var Arnfj=F6r=F0 Bjarmason, which is a simple Perl script[2] that fo= rks and > executes numerous dig(1) processes. At the end of his script, time(1) = reports > 268659840m0.951s of user and 38524003m13.072s of system time consumed. = I am > able to confirm this demonstration using: > > - Xen 3.4.1 on dom0 2.6.18.8-931-2 > - Debian Lenny on domU 2.6.32.12-x86_64-linode12 [3] > > Running =C6var's test case looks like this, in that domU: > > =20 >> real 0m30.741s >> user 307399002m50.773s >> sys 46724m44.192s >> =20 > However, a quick busyloop in Python seems to report the correct time: > > =20 >> li21-66:~# cat doit.py=20 >> for i in xrange(10000000): >> a =3D i ** 5 >> >> li21-66:~# time python doit.py >> >> real 0m16.600s >> user 0m16.593s >> sys 0m0.006s >> =20 > I rebooted the domU, and the problem no longer exists. It seems to be = transient > in nature, and difficult to isolate. /proc/stat seems to increment nor= mally: > > =20 >> li21-66:/proc# cat stat | grep "cpu " && sleep 1 && cat stat | grep "c= pu " >> cpu 3742 0 1560 700180 1326 0 27 1282 0 >> cpu 3742 0 1562 700983 1326 0 27 1282 0 >> =20 > I'm not sure where to begin with this one - any thoughts? > =20 It would be helpful to identify what kernel version the change of behaviour started in (ideally a git bisect down to a particular change, but a pair of versions would be close enough). I think this is the same problem as https://bugzilla.kernel.org/show_bug.cgi?id=3D16314 Thanks, J