From mboxrd@z Thu Jan  1 00:00:00 1970
From: Dario Faggioli <dario.faggioli@citrix.com>
Subject: Re: BUG: NOW() seems to (sometimes) go backwards!
Date: Wed, 8 Jun 2016 15:43:37 +0200
Message-ID: <1465393417.15816.206.camel@citrix.com>
References: <1465314895.15816.165.camel@citrix.com>
 <575812BD02000078000F2F82@prv-mh.provo.novell.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============7223054000126524434=="
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta14.messagelabs.com ([193.109.254.103])
 by lists.xenproject.org with esmtp (Exim 4.84_2)
 (envelope-from <prvs=960b276de=dario.faggioli@citrix.com>)
 id 1bAdlu-0005eY-JD
 for xen-devel@lists.xenproject.org; Wed, 08 Jun 2016 13:43:50 +0000
In-Reply-To: <575812BD02000078000F2F82@prv-mh.provo.novell.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
 <mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>
To: Jan Beulich <JBeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Joao Martins <joao.m.martins@oracle.com>, Tim Deegan <tim@xen.org>, George Dunlap <george.dunlap@citrix.com>, "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
List-Id: xen-devel@lists.xenproject.org

--===============7223054000126524434==
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="=-efDKiVy67YCKrF912hqQ"

--=-efDKiVy67YCKrF912hqQ
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Wed, 2016-06-08 at 04:42 -0600, Jan Beulich wrote:
> > > > On 07.06.16 at 17:54, <dario.faggioli@citrix.com> wrote:
> > So, it looks to me that the TSC is actually ok, and it could be the
> > local_tsc_stamp and scale_delta() magic done in get_s_time_fixed()
> > (which, AFAICUI, is there to convert cycles to time) that does not
> > guarantee the result to be monotonic, even if the input is...
> > Thoughts?
> Indeed this smells like an issue in the scaling. However, the scaling
> values vary only when !CONSTANT_TSC. Can you check that this
> flag is actually set on that system?=20
>
Checked. I do have it. I instrumented the code to print stuff if it
finds it, and it prints.

Also:
root@Zhaman:~# xl debug-keys s
(XEN) [=C2=A0=C2=A0406.719464] TSC marked as reliable, warp =3D 0 (count=3D=
3)
(XEN) [=C2=A0=C2=A0406.719467] dom1(hvm): mode=3D0,ofs=3D0xffffd9279716c276=
,khz=3D2394069,inc=3D4,vtsc count: 195367 kernel, 0 user

> (I hope you're not running a
> strange Dom0 setup with FREQCTL_dom0_kernel in effect.)=20
>
I've no idea what this is. I've been running 4.1.0, built myself, and
stock Debian unstable 4.5.0, and I'm seeing the issue in both cases.

Looking FREQCTL_dom0_kernel up, I guess you mean what happens when one
passes cpufreq=3D"dom0-kernel" to xen on boot command line. In which
case, no, I'm not doing that.

> And
> at the same time that it's time_calibration_tsc_rendezvous() that
> is in use?
>=20
The code you're referring to should be this:

=C2=A0 =C2=A0 /* If we have constant-rate TSCs then scale factor can be sha=
red. */
=C2=A0=C2=A0=C2=A0=C2=A0if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
=C2=A0=C2=A0=C2=A0=C2=A0{
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0/* If TSCs are not marked a=
s 'reliable', re-sync during rendezvous. */
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if ( !boot_cpu_has(X86_FEAT=
URE_TSC_RELIABLE) )
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0tim=
e_calibration_rendezvous_fn =3D time_calibration_tsc_rendezvous;
=C2=A0=C2=A0=C2=A0=C2=A0}

And I have both X86_FEATURE_CONSTANT_TSC and X86_FEATURE_TSC_RELIABLE.

I've again instrumented the code in order to check whether it is
time_calibration_tsc_rendezvous() or time_calibration_std_rendezvous()
that is being used, and it's the latter:

(XEN) [=C2=A0=C2=A0=C2=A0=C2=A01.795916] TSC reliable. Yay!! Using ffff82d0=
80198362 for rendevousez

[dario@Solace xen.git] $ objdump -D xen/xen-syms |grep ffff82d080198362
ffff82d080198362 <time_calibration_std_rendezvous>:

which looks correct to me.

> Yet when the scaling values get set only once at boot, monotonic
> (cross-CPU) TSC means monotonic (cross-CPU) returns from NOW().
>=20
Yep. And at this point, this is what needs to be verified, I guess...

> In any event - could you try to exclude C- and P-state effects? Of
> course that makes sense only if you can reasonably repro the
> problem situation (and hence can tell from its absence over a certain
> period of time that whatever change was done did have some
> positive effect).
>=20
It's actually quite hard *NOT* to reproduce the problem... it happens
all the time, and if the load is high enough, I see the "Time went
backwards?" printk several times per second!

So, trying to do what you suggest in an online fashion, i.e., issueing:

=C2=A0# xenpm=C2=A0set-max-cstate 0
=C2=A0# xenpm=C2=A0set-scaling-governor all performance

did not change the situation (I still see the printks).

I've then tried passing both cpufreq=3D"none" and=C2=A0max_cstate=3D0 to xe=
n at
boot, but they made no difference at all either.

Most of the time, we're speaking of very small skews, e.g.:

(XEN) [=C2=A0=C2=A0=C2=A059.999959] WARNING: __update_runq_load: Time went =
backwards? now 59999946079 llu 59999946085
(XEN) [=C2=A0=C2=A0117.595508] WARNING: __update_runq_load: Time went backw=
ards? now 117595495295 llu 117595495310

i.e., 6 nanoseconds or 15 nanoseconds!

Then there are instances where the difference is bigger (microseconds
time scale, like in the first email of the thread).

> How big of system are we talking about? I'm asking to assess the
> overhead of adding some cross-CPU checks to get_s_time_fixed()
> (in a debugging patch), logging relevant values if non-monotonic
> output gets detected. (On too big a system, the overhead here
> might end up masking the problem.)
>=20
Yeah, I sort of tried doing something like that already, but was
logging the wrong thing (I was not yet suspecting a problem with
scaling).

I can try putting something together again.

In any case, the system I use most for testing is a 16 cpus (in 2 NUMA
nodes) Xeon. But I see the issue even within the same socket, so I can
easily reduce to use a subset of the available processors.

Thanks for your time. :-)
Dario
--=20
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


--=-efDKiVy67YCKrF912hqQ
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAABCAAGBQJXWCEKAAoJEBZCeImluHPuc7IP/RxMhxmEsSfEjSWXQaefpQYV
CNftMMpvljoe1k4uYhZKl+00K/jliRc6GN3eExlYo2cp+X+/FdKljiGIT8OlmHS8
Uge8Xw+0TEJXC3hsNhTuYhQ4CxxIZpLA4baZ7gav+0Fq6068vBh23I+xUQo0rmOz
IM/agyhWD9Dw54LvxjhbU52M62VGu3lswhGwsmETjrOQdxRCsuYaN+c1ZOGROJ6O
ACy2F4V2ybHDVrGX8D3k0M02snHyH/U1w5VaVLRk3wr1bJIDDXS6bfG9APrizZRQ
mLYtjQdZl1xFIdwDvEMpfDp0c2rmmK+4TvWHZNrJXbB6df8qpiHTbmzQ7Nwc+pbE
NgPj7Qz0aAB9hoZDS0x+6mbCpCrsx4vDm0DW1leNUKIkRqhLotLwn88EcP+nBy9v
/7nNPhcGtEcG7IU7Ujdj9zH7/q2EDKOqSyvL64t3AYimX3L5vrI/b7qYAp5CBSiU
1Wz6LskyUks8g67fmQbjc/D0w2yBOlLJB9s+1nBCJmkRKv2B4icMFF7W3bBx3NHr
0Tw2SPzUL7TJKMTyksEMtkxiTCbyZxAm3z32zUNsYSXo2gG7IFj1lQdxJbbW4CwW
tEE5DWrNT7X+ybwPoYfY3B+zB1ltYBUoQtLjjuGjkVlqGAVw81woH3Vi1JVVaFTS
7dzzbLf82Ak2TMmKmXTG
=m1UG
-----END PGP SIGNATURE-----

--=-efDKiVy67YCKrF912hqQ--


--===============7223054000126524434==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline

X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs
IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwOi8vbGlzdHMueGVuLm9y
Zy94ZW4tZGV2ZWwK

--===============7223054000126524434==--