xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* BUG: NOW() seems to (sometimes) go backwards!
@ 2016-06-07 15:54 Dario Faggioli
  2016-06-08 10:42 ` Jan Beulich
  2016-06-09 10:19 ` George Dunlap
  0 siblings, 2 replies; 11+ messages in thread
From: Dario Faggioli @ 2016-06-07 15:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Joao Martins, Tim Deegan, George Dunlap, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 13618 bytes --]

Hi,

I've been fighting with this during the past few days, while testing
and benchmarking some of the modification I'm doing to Credit2.

In summary, what I see is NOW() going backwards. The issues shows up
when I call NOW() on different pCPUs, and compare the results (which is
something the Credit2 scheduler does a lot, and I would like it to be
able to continue to do so. :-D).

I can elaborate on why this is really really bad for scheduling in
general, and even more so for Credit2. For now, let me just say it's
pretty bad. Feel free to ask for details if interested.

In any case, IMO, we need to fix this.

It does not only happen cross-socket, i.e., it happens even if the two
pCPUs are part of the same socket (in the example shown below, they're
two SMT threads of the same core!!).

I'm able to trigger this on more than just one box, all with invariant
TSC support. (All Intel, though, I don't have any AMD test hardware
available right now).

In some more details.

The scheduler (both in scheduler.c and in sched_credit{1,2}.c) uses
NOW() for all the timing and time sampling. In the trace below, at time
T1 (equal to 1.264582607), we do this:

 d0v9->start_time = NOW()

and we are on pCPU 3.

Then, at time T2, on pCPU2, there is this:

 if ( d0v9->start_time < NOW() )
 {
   /* OMG!! */
   __trace("csched2:time_backwards ...")
 }

and we indeed hit the __trace().

What the trace point says is that NOW() was 1972180296557 at time T1,
on pCPU 3, and it is 1972180270044 at time T2 on pCPU2 (i.e., ~26.5
microsecs in the past).

I speak about NOW() going backwards because, by looking at both the
code and the trace, I am positive about the fact that T2 happens later
than T1 (go to [1], if interested)... So, why NOW() is smaller at T2?
:-(

   1.264577304 ||x||||||||||||| d0v10 runstate_change d0v10 running->runnable
   1.264577611 ||x||||||||||||| d?v? runstate_change d0v5 runnable->running
   1.264580025 |||x||||||-||||| d0v0 vcpu_block d0v0
   1.264580795 |||x||||||-||||| d0v0 csched2:schedule current = d0v0, now = 1972180296557
   1.264581014 |||x||||||-||||| d0v0 csched2:burn_credits d0v0, credit = 9091956, delta = 39433, start_time = 1972180296557
   1.264581660 |||x||||||-||||| d0v0 sched_switch prev d0, run for 112.503us
   1.264581809 |||x||||||-||||| d0v0 sched_switch next d0, was runnable for 54.532us, next slice 2000.0us
   1.264582033 |||x||||||-||||| d0v0 sched_switch prev d0v0 next d0v9
   1.264582269 |||x||||||-||||| d0v0 runstate_change d0v0 running->blocked
T1 1.264582607 |||x||||||-||||| d?v? runstate_change d0v9 runnable->running
   1.264583109 ||x|||||||-||||| d0v5 csched2:runq_insert d0v10, position 0
   1.264583341 ||x|||||||-||||| d0v5 csched2:runq_tickle_new d0v10, credit = 5691963
   1.264583907 ||x|||||||-||||| d0v5 csched2:burn_credits d0v5, credit = 8575685, delta = 8241, start_time = 1972180270044
   1.264584236 ||x|||||||-||||| d0v5 csched2:burn_credits d1v1, credit = 8696613, delta = 12920, start_time = 1972180270044
   1.264584431 ||x|||||||-||||| d0v5 csched2:tickle_check d1v1, credit = 8696613
   1.264584745 ||x|||||||-||||| d0v5 csched2:burn_credits d1v3, credit = 10182550, delta = 12920, start_time = 1972180270044
   1.264584966 ||x|||||||-||||| d0v5 csched2:tickle_check d1v3, credit = 10182550
T2 1.264585207 ||x|||||||-||||| d0v5 csched2:time_backwards d0v9: cur_time = 1972180270044, cur_cpu = 2, old_time = 1972180296557, old_cpu = 3
   1.264585364 ||x|||||||-||||| d0v5 csched2:burn_credits d0v9, credit = 5759499, delta = -26513, start_time = 1972180270044
   1.264585537 ||x|||||||-||||| d0v5 csched2:tickle_check d0v9, credit = 5759499
   1.264585876 ||x|||||||-||||| d0v5 csched2:burn_credits d0v7, credit = 8835314, delta = 12920, start_time = 1972180270044
   1.264586041 ||x|||||||-||||| d0v5 csched2:tickle_check d0v7, credit = 8835314
   1.264586491 ||x|||||||-||||| d0v5 csched2:burn_credits d0v15, credit = 5822635, delta = 12920, start_time = 1972180270044
   1.264586655 ||x|||||||-||||| d0v5 csched2:tickle_check d0v15, credit = 5822635
   1.264586952 ||x|||||||-||||| d0v5 csched2:burn_credits d0v13, credit = 7956174, delta = 12920, start_time = 1972180270044
   1.264587115 ||x|||||||-||||| d0v5 csched2:tickle_check d0v13, credit = 7956174
   1.264587385 ||x|||||||-||||| d0v5 csched2:burn_credits d0v3, credit = 7870338, delta = 12920, start_time = 1972180270044
   1.264587545 ||x|||||||-||||| d0v5 csched2:tickle_check d0v3, credit = 7870338
   1.264595859 |||x||||||-||||| d0v9 csched2:schedule current = d0v9, now = 1972180311345
   1.264596313 |||x||||||-||||| d0v9 csched2:burn_credits d0v9, credit = 5718198, delta = 4877, start_time = 1972180311345

Interestingly enough, if we look at the same (well, sort of) set of
events with just xentrace_format, we, after having sorted the records
by their timestamps, get this:

   CPU2  5613124777311 (+     507)  running_to_runnable [ dom:vcpu = 0x0000000a ]
   CPU2  5613124778046 (+     735)  runnable_to_running [ dom:vcpu = 0x00000005 ]

   CPU3  5613124788657 (+     537)  __enter_scheduler [ prev<dom:vcpu> = 0x00000000, next<dom:vcpu> = 0x00000009 ]
   CPU3  5613124789224 (+     567)  running_to_blocked  [ dom:vcpu = 0x00000000 ]
T1 CPU3  5613124790034 (+     810)  runnable_to_running [ dom:vcpu = 0x00000009 ]

   CPU2  5613124791240 (+   13194)  csched2:runq_pos       [ dom:vcpu = 0x0000000a, pos = 0]
   CPU2  5613124791795 (+     555)  csched2:tickle_new     [ dom:vcpu = 0x0000000a, credit = 0x5691963 ]
   CPU2  5613124793154 (+    1359)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 5 ]
   CPU2  5613124793943 (+     789)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 65537 ]
   CPU2  5613124794411 (+     468)  csched2:tickle_check   [ dom:vcpu = 0x00010001, credit = 8696613 ]
   CPU2  5613124795164 (+     753)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 65539 ]
   CPU2  5613124795695 (+     531)  csched2:tickle_check   [ dom:vcpu = 0x00010003, credit = 10182550 ]
T2 CPU2  5613124796274 (+     579)  unknown (0x000000000002221e)  [ 0x2f1abbdc 0x000001cb 0x2f1b236d 0x000001cb 0x00000009 0x00030002 0x00000000 ]
   CPU2  5613124796649 (+     375)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 9 ]
   CPU2  5613124797066 (+     417)  csched2:tickle_check   [ dom:vcpu = 0x00000009, credit = 5759499 ]
   CPU2  5613124797879 (+     813)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 7 ]
   CPU2  5613124798275 (+     396)  csched2:tickle_check   [ dom:vcpu = 0x00000007, credit = 8835314 ]
   CPU2  5613124799355 (+    1080)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 15 ]
   CPU2  5613124799748 (+     393)  csched2:tickle_check   [ dom:vcpu = 0x0000000f, credit = 5822635 ]
   CPU2  5613124800459 (+     711)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 13 ]
   CPU2  5613124800852 (+     393)  csched2:tickle_check   [ dom:vcpu = 0x0000000d, credit = 7956174 ]
   CPU2  5613124801500 (+     648)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 3 ]
   CPU2  5613124801884 (+     384)  csched2:tickle_check   [ dom:vcpu = 0x00000003, credit = 7870338 ]

   CPU3  5613124803177 (+   13143)  csched2:update_load
   CPU3  5613124803786 (+     609)  unknown (0x0000000000022220)  [ 0x0000d61e 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ]
   CPU3  5613124804560 (+     774)  csched2:updt_runq_load [ rq_load[16]:rq_id[8]:wshift[8] = 0x12000009, rq_avgload = 0x0000000000216566, b_avgload = 0x0000000000216566,
   CPU3  5613124804929 (+     369)  unknown (0x0000000000022220)  [ 0x0001e8de 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ]
   CPU3  5613124805256 (+     327)  csched2:updt_vcpu_load [ dom:vcpu = 0x00000000, vcpuload = 0x0000000000026040, wshift = 18 ]

   CPU2  5613124975596 (+  173712)  domain_wake       [ dom:vcpu = 0x00010005 ]

Interestingly, the timestamp at T1 is 5613124790034, which looks
_correctly_ smaller than the timestamp at T2 (5613124796274).

NOW(), as it is called from the scheduling code, does this:
 s_time_t NOW(u64 at_tsc)  //it's actually get_s_time_fixed()
 {
     struct cpu_time *t = &this_cpu(cpu_time);
     u64 tsc, delta;
     s_time_t now;

     tsc = rdtsc();
     delta = tsc - t->local_tsc_stamp;
     now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);

     return now;
 }

While the timestamps in the trace are the result of get_cycles():
 static inline cycles_t get_cycles(void)
 {
     return rdtsc();
 }

So, it looks to me that the TSC is actually ok, and it could be the
local_tsc_stamp and scale_delta() magic done in get_s_time_fixed()
(which, AFAICUI, is there to convert cycles to time) that does not
guarantee the result to be monotonic, even if the input is... Thoughts?

Archeology says that the current shape of get_s_time_fixed() comes
from:
commit bbb0eb32255d1a3e97994d9c57ca5e3d50de4298
commit 8cf6cfe286aa6949f9bd546d03e275e21e9723db

But commit messages are not particularly helpful... and I'm not
familiar at all with all this. I'll keep investigating, but in the
meanwhile, any thought and idea is more than welcome.

Thanks and Regards,
Dario

[1] So. all the various functions/chunks of code of interest here are
serialized by the same spinlock (there is a runqueue for the whole
socket). Considering both the code and the trace, the flow of execution
is this:

                   CPU2                  CPU3
                     |                     |
             csched2_schedule()            |
              spin_lock(rq->lock)          |
1.264577304   d0v10 --> runnable           |
1.264577611   current <-- d0v5             |
              spin_unlock(rq->lock)        |
                     |                     |
1.264580025          |              vcpu_block(d0v0)
                     |               spin_lock(rq->lock)
                     |               spin_unlock(rq->lock)
1.264580795          |             csched2_schedule()
                     |              spin_lock(rq->lock)
                     |              current <-- d0v9
1.264582607          |              d0v9->start_time = NOW()  (***T1)
                     |              spin_unlock(rq->lock)
                     |                     |
             context_saved(d0v10)          |
              spin_lock(rq->lock);         |
1.264583109   runq_insert(d0v10)           |
1.264583341   runq_tickle_new(d0v10)       |
1.264583907   burn_credits(d0v5)           |
1.264584236   burn_credits(d1v1)           |
1.264584431   tickle_check(d1v1)           |
1.264584745   burn_credits(d1v3)           |
1.264584966   tickle_check(d1v3)           |
1.264585207   !!time_backwards(d0v9)       |                  (***T2)
1.264585364   burn_credits(d0v9)           |
1.264585537   tickle_check d0v9            |
              ...                         ...
              spin_lock(rq->lock)          |

NOW() is always sampled and checked *inside* a critical section
protected by rq->lock, so it's not possible that the value gets stale
waiting for a lock (I've fixed a couple of instances of this already).

So, if this is what happens, it's correct to expect that NOW() at T2 is
ahead of NOW() at T1 (and hence, what really happens, T2<T1, means
there is a bug somewhere). And I am sure the flow is actually this --
not only because this is the order shown by both xenalyze and
xentrace_format-- by looking at the implementation of runq_tickle(), in
sched_credit2.c.

In fact, in there, first of all we call burn_credits() and check if we
can preempt the vcpu running on our vc->processor, and in this case,
this would be d0v10 checking d0v5, at time 1.264583907. Afterwards, we
check the vcpu running on each pCPU (of our runqueue), and that would
be d1v1 on pCPU 0, d1v3 on pCPU 1, then we skip d0v5 on pCPU 2 (we
already checked it outside of the loop), and then we check d0v9 on pCPU
3.

And this is eventually why I say I'm sure about the execution flow:
d0v9 is current on pCPU 3 _only_ if the call to csched2_schedule() that
put it in execution has happened already (still on pCPU 3), and that,
thanks also to locking, means d0v9->start_time = NOW() at T1 already
happened when we reach T2.
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: NOW() seems to (sometimes) go backwards!
  2016-06-07 15:54 BUG: NOW() seems to (sometimes) go backwards! Dario Faggioli
@ 2016-06-08 10:42 ` Jan Beulich
  2016-06-08 13:43   ` Dario Faggioli
  2016-06-09 10:19 ` George Dunlap
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2016-06-08 10:42 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andrew Cooper, Joao Martins, Tim Deegan, George Dunlap, xen-devel

>>> On 07.06.16 at 17:54, <dario.faggioli@citrix.com> wrote:
> So, it looks to me that the TSC is actually ok, and it could be the
> local_tsc_stamp and scale_delta() magic done in get_s_time_fixed()
> (which, AFAICUI, is there to convert cycles to time) that does not
> guarantee the result to be monotonic, even if the input is... Thoughts?

Indeed this smells like an issue in the scaling. However, the scaling
values vary only when !CONSTANT_TSC. Can you check that this
flag is actually set on that system? (I hope you're not running a
strange Dom0 setup with FREQCTL_dom0_kernel in effect.) And
at the same time that it's time_calibration_tsc_rendezvous() that
is in use?

Yet when the scaling values get set only once at boot, monotonic
(cross-CPU) TSC means monotonic (cross-CPU) returns from NOW().

In any event - could you try to exclude C- and P-state effects? Of
course that makes sense only if you can reasonably repro the
problem situation (and hence can tell from its absence over a certain
period of time that whatever change was done did have some
positive effect).

How big of system are we talking about? I'm asking to assess the
overhead of adding some cross-CPU checks to get_s_time_fixed()
(in a debugging patch), logging relevant values if non-monotonic
output gets detected. (On too big a system, the overhead here
might end up masking the problem.)

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: NOW() seems to (sometimes) go backwards!
  2016-06-08 10:42 ` Jan Beulich
@ 2016-06-08 13:43   ` Dario Faggioli
  2016-06-08 14:01     ` Jan Beulich
  2016-06-08 20:36     ` Joao Martins
  0 siblings, 2 replies; 11+ messages in thread
From: Dario Faggioli @ 2016-06-08 13:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Joao Martins, Tim Deegan, George Dunlap, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 4858 bytes --]

On Wed, 2016-06-08 at 04:42 -0600, Jan Beulich wrote:
> > > > On 07.06.16 at 17:54, <dario.faggioli@citrix.com> wrote:
> > So, it looks to me that the TSC is actually ok, and it could be the
> > local_tsc_stamp and scale_delta() magic done in get_s_time_fixed()
> > (which, AFAICUI, is there to convert cycles to time) that does not
> > guarantee the result to be monotonic, even if the input is...
> > Thoughts?
> Indeed this smells like an issue in the scaling. However, the scaling
> values vary only when !CONSTANT_TSC. Can you check that this
> flag is actually set on that system? 
>
Checked. I do have it. I instrumented the code to print stuff if it
finds it, and it prints.

Also:
root@Zhaman:~# xl debug-keys s
(XEN) [  406.719464] TSC marked as reliable, warp = 0 (count=3)
(XEN) [  406.719467] dom1(hvm): mode=0,ofs=0xffffd9279716c276,khz=2394069,inc=4,vtsc count: 195367 kernel, 0 user

> (I hope you're not running a
> strange Dom0 setup with FREQCTL_dom0_kernel in effect.) 
>
I've no idea what this is. I've been running 4.1.0, built myself, and
stock Debian unstable 4.5.0, and I'm seeing the issue in both cases.

Looking FREQCTL_dom0_kernel up, I guess you mean what happens when one
passes cpufreq="dom0-kernel" to xen on boot command line. In which
case, no, I'm not doing that.

> And
> at the same time that it's time_calibration_tsc_rendezvous() that
> is in use?
> 
The code you're referring to should be this:

    /* If we have constant-rate TSCs then scale factor can be shared. */
    if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
    {
        /* If TSCs are not marked as 'reliable', re-sync during rendezvous. */
        if ( !boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
            time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
    }

And I have both X86_FEATURE_CONSTANT_TSC and X86_FEATURE_TSC_RELIABLE.

I've again instrumented the code in order to check whether it is
time_calibration_tsc_rendezvous() or time_calibration_std_rendezvous()
that is being used, and it's the latter:

(XEN) [    1.795916] TSC reliable. Yay!! Using ffff82d080198362 for rendevousez

[dario@Solace xen.git] $ objdump -D xen/xen-syms |grep ffff82d080198362
ffff82d080198362 <time_calibration_std_rendezvous>:

which looks correct to me.

> Yet when the scaling values get set only once at boot, monotonic
> (cross-CPU) TSC means monotonic (cross-CPU) returns from NOW().
> 
Yep. And at this point, this is what needs to be verified, I guess...

> In any event - could you try to exclude C- and P-state effects? Of
> course that makes sense only if you can reasonably repro the
> problem situation (and hence can tell from its absence over a certain
> period of time that whatever change was done did have some
> positive effect).
> 
It's actually quite hard *NOT* to reproduce the problem... it happens
all the time, and if the load is high enough, I see the "Time went
backwards?" printk several times per second!

So, trying to do what you suggest in an online fashion, i.e., issueing:

 # xenpm set-max-cstate 0
 # xenpm set-scaling-governor all performance

did not change the situation (I still see the printks).

I've then tried passing both cpufreq="none" and max_cstate=0 to xen at
boot, but they made no difference at all either.

Most of the time, we're speaking of very small skews, e.g.:

(XEN) [   59.999959] WARNING: __update_runq_load: Time went backwards? now 59999946079 llu 59999946085
(XEN) [  117.595508] WARNING: __update_runq_load: Time went backwards? now 117595495295 llu 117595495310

i.e., 6 nanoseconds or 15 nanoseconds!

Then there are instances where the difference is bigger (microseconds
time scale, like in the first email of the thread).

> How big of system are we talking about? I'm asking to assess the
> overhead of adding some cross-CPU checks to get_s_time_fixed()
> (in a debugging patch), logging relevant values if non-monotonic
> output gets detected. (On too big a system, the overhead here
> might end up masking the problem.)
> 
Yeah, I sort of tried doing something like that already, but was
logging the wrong thing (I was not yet suspecting a problem with
scaling).

I can try putting something together again.

In any case, the system I use most for testing is a 16 cpus (in 2 NUMA
nodes) Xeon. But I see the issue even within the same socket, so I can
easily reduce to use a subset of the available processors.

Thanks for your time. :-)
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: NOW() seems to (sometimes) go backwards!
  2016-06-08 13:43   ` Dario Faggioli
@ 2016-06-08 14:01     ` Jan Beulich
  2016-06-08 20:36     ` Joao Martins
  1 sibling, 0 replies; 11+ messages in thread
From: Jan Beulich @ 2016-06-08 14:01 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andrew Cooper, Joao Martins, Tim Deegan, George Dunlap, xen-devel

>>> On 08.06.16 at 15:43, <dario.faggioli@citrix.com> wrote:
> On Wed, 2016-06-08 at 04:42 -0600, Jan Beulich wrote:
>> How big of system are we talking about? I'm asking to assess the
>> overhead of adding some cross-CPU checks to get_s_time_fixed()
>> (in a debugging patch), logging relevant values if non-monotonic
>> output gets detected. (On too big a system, the overhead here
>> might end up masking the problem.)
>> 
> Yeah, I sort of tried doing something like that already, but was
> logging the wrong thing (I was not yet suspecting a problem with
> scaling).

Yeah, I did also realize that with you observing the problem even
within a core, the cross-CPU checking could be limited to just the
other SMT sibling, which should bring the overhead down quite a
bit. Otoh, with it - as you say - often just being a couple of nano-
seconds, even that may then already be too much.

Speaking of overhead: The tracing itself that you use does not
influence the picture in any way, does it?

Considering how easily you say you see the problem, it ought to
be more widespread. I'm considering to try to do some such
instrumentation to see whether I can see the issue anywhere.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: NOW() seems to (sometimes) go backwards!
  2016-06-08 13:43   ` Dario Faggioli
  2016-06-08 14:01     ` Jan Beulich
@ 2016-06-08 20:36     ` Joao Martins
  2016-06-09  8:05       ` Jan Beulich
  1 sibling, 1 reply; 11+ messages in thread
From: Joao Martins @ 2016-06-08 20:36 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Andrew Cooper, Tim Deegan, George Dunlap, Jan Beulich, xen-devel

On 06/08/2016 02:43 PM, Dario Faggioli wrote:
> On Wed, 2016-06-08 at 04:42 -0600, Jan Beulich wrote:
>>>>> On 07.06.16 at 17:54, <dario.faggioli@citrix.com> wrote:
>>> So, it looks to me that the TSC is actually ok, and it could be the
>>> local_tsc_stamp and scale_delta() magic done in get_s_time_fixed()
>>> (which, AFAICUI, is there to convert cycles to time) that does not
>>> guarantee the result to be monotonic, even if the input is...
>>> Thoughts?
>> Indeed this smells like an issue in the scaling. However, the scaling
>> values vary only when !CONSTANT_TSC. Can you check that this
>> flag is actually set on that system? 
>>
> Checked. I do have it. I instrumented the code to print stuff if it
> finds it, and it prints.
> 
> Also:
> root@Zhaman:~# xl debug-keys s
> (XEN) [  406.719464] TSC marked as reliable, warp = 0 (count=3)
> (XEN) [  406.719467] dom1(hvm): mode=0,ofs=0xffffd9279716c276,khz=2394069,inc=4,vtsc count: 195367 kernel, 0 user
> 
>> (I hope you're not running a
>> strange Dom0 setup with FREQCTL_dom0_kernel in effect.) 
>>
> I've no idea what this is. I've been running 4.1.0, built myself, and
> stock Debian unstable 4.5.0, and I'm seeing the issue in both cases.
> 
> Looking FREQCTL_dom0_kernel up, I guess you mean what happens when one
> passes cpufreq="dom0-kernel" to xen on boot command line. In which
> case, no, I'm not doing that.
> 
>> And
>> at the same time that it's time_calibration_tsc_rendezvous() that
>> is in use?
>>
> The code you're referring to should be this:
> 
>     /* If we have constant-rate TSCs then scale factor can be shared. */
>     if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>     {
>         /* If TSCs are not marked as 'reliable', re-sync during rendezvous. */
>         if ( !boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
>             time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
>     }
> 
> And I have both X86_FEATURE_CONSTANT_TSC and X86_FEATURE_TSC_RELIABLE.
> 
> I've again instrumented the code in order to check whether it is
> time_calibration_tsc_rendezvous() or time_calibration_std_rendezvous()
> that is being used, and it's the latter:
> 
> (XEN) [    1.795916] TSC reliable. Yay!! Using ffff82d080198362 for rendevousez
> 
> [dario@Solace xen.git] $ objdump -D xen/xen-syms |grep ffff82d080198362
> ffff82d080198362 <time_calibration_std_rendezvous>:
> 
> which looks correct to me.
Probably one other codepath that you would be interested is on
local_time_calibration() commenting the chunk on this conditional:

if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC )
{
...
}

And with that you rarely see time going backwards since the scaling would be
adjusted to each cpu calibration error - but it would be a (very very close)
approximation on software not exactly guaranteed as hardware TSC. But trying this out
would be just for experimentation - doesn't look to be a correct way of addressing this.

> 
>> Yet when the scaling values get set only once at boot, monotonic
>> (cross-CPU) TSC means monotonic (cross-CPU) returns from NOW().
>>
> Yep. And at this point, this is what needs to be verified, I guess...
I think get_s_time_fixed doesn't guarantee monotonicity across CPUs being it
different socket or (SMT) siblings. local_tsc_stamp is seeded differently on each CPU
i.e. rdtsc() right after doing the platform time read on the calibration rendezvous.
Plus stime_local_stamp is seeded with values taken from platform timer (HPET, ACPI,
PIT) on local_time_calibration which means that get_s_time isn't solely based on TSC
and that there will always be a gap between stime_local_stamp and local_tsc_stamp.
TSC is indeed monotonic on a TSC invariant box, but the delta that is computed
differs from cpu to cpu according to when time calibration happens on each CPU - thus
not guaranteeing the desired monotonicity property. Having stime_local_stamp be based
on the same timestamp that of the local_tsc_stamp plus having a single
local_tsc_stamp as reference would address this behaviour - see also below.

> 
>> In any event - could you try to exclude C- and P-state effects? Of
>> course that makes sense only if you can reasonably repro the
>> problem situation (and hence can tell from its absence over a certain
>> period of time that whatever change was done did have some
>> positive effect).
>>
> It's actually quite hard *NOT* to reproduce the problem... it happens
> all the time, and if the load is high enough, I see the "Time went
> backwards?" printk several times per second!
> 
> So, trying to do what you suggest in an online fashion, i.e., issueing:
> 
>  # xenpm set-max-cstate 0
>  # xenpm set-scaling-governor all performance
> 
> did not change the situation (I still see the printks).
> 
> I've then tried passing both cpufreq="none" and max_cstate=0 to xen at
> boot, but they made no difference at all either.
> 
> Most of the time, we're speaking of very small skews, e.g.:
> 
> (XEN) [   59.999959] WARNING: __update_runq_load: Time went backwards? now 59999946079 llu 59999946085
> (XEN) [  117.595508] WARNING: __update_runq_load: Time went backwards? now 117595495295 llu 117595495310
> 
> i.e., 6 nanoseconds or 15 nanoseconds!
> 
> Then there are instances where the difference is bigger (microseconds
> time scale, like in the first email of the thread).
It does sound similar to the cross cpu calibration error I was observing but on a
place other than guests pv time infos updates. My pvclock TSC stable bit series could
address your issue - but only for clocksource=tsc added there, though it would be
quite limitative for the other cases i.e. clocksource=hpet|acpi|etc.

One patch in the RFC of that series, could fix your issue but it was more of a hacky
patch for discussion [0]. It converts a value returned from the platform timer (which
monotonically increases) and scales it to the equivalent in TSC, assuming the TSC is
always ahead of the other clocksources. But probably the correct fix would be to do
as linux does with pvclock? Linux keeps an accumulator representing the latest system
time across CPUs and atomically update it if now > last. Thus always returning the
most up-to-date value. But then NOW() would take a hit in performance.

[0] http://lists.xenproject.org/archives/html/xen-devel/2015-12/msg02832.html

Cheers,
Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: NOW() seems to (sometimes) go backwards!
  2016-06-08 20:36     ` Joao Martins
@ 2016-06-09  8:05       ` Jan Beulich
  2016-06-09 10:24         ` Joao Martins
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2016-06-09  8:05 UTC (permalink / raw)
  To: Joao Martins
  Cc: Andrew Cooper, Dario Faggioli, Tim Deegan, George Dunlap, xen-devel

>>> On 08.06.16 at 22:36, <joao.m.martins@oracle.com> wrote:
> On 06/08/2016 02:43 PM, Dario Faggioli wrote:
>> The code you're referring to should be this:
>> 
>>     /* If we have constant-rate TSCs then scale factor can be shared. */
>>     if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) )
>>     {
>>         /* If TSCs are not marked as 'reliable', re-sync during rendezvous. */
>>         if ( !boot_cpu_has(X86_FEATURE_TSC_RELIABLE) )
>>             time_calibration_rendezvous_fn = time_calibration_tsc_rendezvous;
>>     }
>> 
>> And I have both X86_FEATURE_CONSTANT_TSC and X86_FEATURE_TSC_RELIABLE.
>> 
>> I've again instrumented the code in order to check whether it is
>> time_calibration_tsc_rendezvous() or time_calibration_std_rendezvous()
>> that is being used, and it's the latter:
>> 
>> (XEN) [    1.795916] TSC reliable. Yay!! Using ffff82d080198362 for rendevousez
>> 
>> [dario@Solace xen.git] $ objdump -D xen/xen-syms |grep ffff82d080198362
>> ffff82d080198362 <time_calibration_std_rendezvous>:
>> 
>> which looks correct to me.
> Probably one other codepath that you would be interested is on
> local_time_calibration() commenting the chunk on this conditional:
> 
> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC )
> {
> ...
> }
> 
> And with that you rarely see time going backwards since the scaling would be
> adjusted to each cpu calibration error - but it would be a (very very close)
> approximation on software not exactly guaranteed as hardware TSC. But trying 
> this out
> would be just for experimentation - doesn't look to be a correct way of 
> addressing this.

Wait - I would think this should make the situation worse, not better.
If we let execution flow past this conditional, ->tsc_scale will get
updated, whereas when the if() path above is taken, all CPUs will
always use the same scale values. If TSC values are monotonic and
the scaling works identically on all CPUs, the scaled values _should_
be monotonic too. Yet they aren't, so some piece is still missing in this
picture.

>>> Yet when the scaling values get set only once at boot, monotonic
>>> (cross-CPU) TSC means monotonic (cross-CPU) returns from NOW().
>>>
>> Yep. And at this point, this is what needs to be verified, I guess...
> I think get_s_time_fixed doesn't guarantee monotonicity across CPUs being it
> different socket or (SMT) siblings. local_tsc_stamp is seeded differently on 
> each CPU
> i.e. rdtsc() right after doing the platform time read on the calibration 
> rendezvous.
> Plus stime_local_stamp is seeded with values taken from platform timer 
> (HPET, ACPI,
> PIT) on local_time_calibration which means that get_s_time isn't solely 
> based on TSC
> and that there will always be a gap between stime_local_stamp and 
> local_tsc_stamp.
> TSC is indeed monotonic on a TSC invariant box, but the delta that is 
> computed
> differs from cpu to cpu according to when time calibration happens on each 
> CPU - thus
> not guaranteeing the desired monotonicity property. Having stime_local_stamp 
> be based
> on the same timestamp that of the local_tsc_stamp plus having a single
> local_tsc_stamp as reference would address this behaviour - see also below.

The quality of get_s_time_fixed() output indeed heavily depends on
t->local_tsc_stamp and t->stime_local_stamp being a properly
matched pair. Yet in local_time_calibration()'s constant-TSC case,
they're a direct copy from the respective cpu_calibration fields. The
main issue I could see here is that on SMT siblings the hardware
switching between the two may introduce arbitrary delays between
them. And with CPU frequency changes, the latency between the
rdtsc() and the execution of get_s_time() finishing could also be
pretty variable. I wonder whether c->local_tsc_stamp wouldn't
better be written with the TSC value used by get_s_time() (or,
which should amount to the same effect, whether we shouldn't
simply call get_s_time_fixed() here with the just sampled TSC value).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: NOW() seems to (sometimes) go backwards!
  2016-06-07 15:54 BUG: NOW() seems to (sometimes) go backwards! Dario Faggioli
  2016-06-08 10:42 ` Jan Beulich
@ 2016-06-09 10:19 ` George Dunlap
  1 sibling, 0 replies; 11+ messages in thread
From: George Dunlap @ 2016-06-09 10:19 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Andrew Cooper, Joao Martins, Tim Deegan, Jan Beulich

On 07/06/16 16:54, Dario Faggioli wrote:
> Hi,
> 
> I've been fighting with this during the past few days, while testing
> and benchmarking some of the modification I'm doing to Credit2.
> 
> In summary, what I see is NOW() going backwards. The issues shows up
> when I call NOW() on different pCPUs, and compare the results (which is
> something the Credit2 scheduler does a lot, and I would like it to be
> able to continue to do so. :-D).
> 
> I can elaborate on why this is really really bad for scheduling in
> general, and even more so for Credit2. For now, let me just say it's
> pretty bad. Feel free to ask for details if interested.
> 
> In any case, IMO, we need to fix this.
> 
> It does not only happen cross-socket, i.e., it happens even if the two
> pCPUs are part of the same socket (in the example shown below, they're
> two SMT threads of the same core!!).
> 
> I'm able to trigger this on more than just one box, all with invariant
> TSC support. (All Intel, though, I don't have any AMD test hardware
> available right now).
> 
> In some more details.
> 
> The scheduler (both in scheduler.c and in sched_credit{1,2}.c) uses
> NOW() for all the timing and time sampling. In the trace below, at time
> T1 (equal to 1.264582607), we do this:
> 
>  d0v9->start_time = NOW()
> 
> and we are on pCPU 3.
> 
> Then, at time T2, on pCPU2, there is this:
> 
>  if ( d0v9->start_time < NOW() )
>  {
>    /* OMG!! */
>    __trace("csched2:time_backwards ...")
>  }
> 
> and we indeed hit the __trace().
> 
> What the trace point says is that NOW() was 1972180296557 at time T1,
> on pCPU 3, and it is 1972180270044 at time T2 on pCPU2 (i.e., ~26.5
> microsecs in the past).
> 
> I speak about NOW() going backwards because, by looking at both the
> code and the trace, I am positive about the fact that T2 happens later
> than T1 (go to [1], if interested)... So, why NOW() is smaller at T2?
> :-(
> 
>    1.264577304 ||x||||||||||||| d0v10 runstate_change d0v10 running->runnable
>    1.264577611 ||x||||||||||||| d?v? runstate_change d0v5 runnable->running
>    1.264580025 |||x||||||-||||| d0v0 vcpu_block d0v0
>    1.264580795 |||x||||||-||||| d0v0 csched2:schedule current = d0v0, now = 1972180296557
>    1.264581014 |||x||||||-||||| d0v0 csched2:burn_credits d0v0, credit = 9091956, delta = 39433, start_time = 1972180296557
>    1.264581660 |||x||||||-||||| d0v0 sched_switch prev d0, run for 112.503us
>    1.264581809 |||x||||||-||||| d0v0 sched_switch next d0, was runnable for 54.532us, next slice 2000.0us
>    1.264582033 |||x||||||-||||| d0v0 sched_switch prev d0v0 next d0v9
>    1.264582269 |||x||||||-||||| d0v0 runstate_change d0v0 running->blocked
> T1 1.264582607 |||x||||||-||||| d?v? runstate_change d0v9 runnable->running
>    1.264583109 ||x|||||||-||||| d0v5 csched2:runq_insert d0v10, position 0
>    1.264583341 ||x|||||||-||||| d0v5 csched2:runq_tickle_new d0v10, credit = 5691963
>    1.264583907 ||x|||||||-||||| d0v5 csched2:burn_credits d0v5, credit = 8575685, delta = 8241, start_time = 1972180270044
>    1.264584236 ||x|||||||-||||| d0v5 csched2:burn_credits d1v1, credit = 8696613, delta = 12920, start_time = 1972180270044
>    1.264584431 ||x|||||||-||||| d0v5 csched2:tickle_check d1v1, credit = 8696613
>    1.264584745 ||x|||||||-||||| d0v5 csched2:burn_credits d1v3, credit = 10182550, delta = 12920, start_time = 1972180270044
>    1.264584966 ||x|||||||-||||| d0v5 csched2:tickle_check d1v3, credit = 10182550
> T2 1.264585207 ||x|||||||-||||| d0v5 csched2:time_backwards d0v9: cur_time = 1972180270044, cur_cpu = 2, old_time = 1972180296557, old_cpu = 3
>    1.264585364 ||x|||||||-||||| d0v5 csched2:burn_credits d0v9, credit = 5759499, delta = -26513, start_time = 1972180270044
>    1.264585537 ||x|||||||-||||| d0v5 csched2:tickle_check d0v9, credit = 5759499
>    1.264585876 ||x|||||||-||||| d0v5 csched2:burn_credits d0v7, credit = 8835314, delta = 12920, start_time = 1972180270044
>    1.264586041 ||x|||||||-||||| d0v5 csched2:tickle_check d0v7, credit = 8835314
>    1.264586491 ||x|||||||-||||| d0v5 csched2:burn_credits d0v15, credit = 5822635, delta = 12920, start_time = 1972180270044
>    1.264586655 ||x|||||||-||||| d0v5 csched2:tickle_check d0v15, credit = 5822635
>    1.264586952 ||x|||||||-||||| d0v5 csched2:burn_credits d0v13, credit = 7956174, delta = 12920, start_time = 1972180270044
>    1.264587115 ||x|||||||-||||| d0v5 csched2:tickle_check d0v13, credit = 7956174
>    1.264587385 ||x|||||||-||||| d0v5 csched2:burn_credits d0v3, credit = 7870338, delta = 12920, start_time = 1972180270044
>    1.264587545 ||x|||||||-||||| d0v5 csched2:tickle_check d0v3, credit = 7870338
>    1.264595859 |||x||||||-||||| d0v9 csched2:schedule current = d0v9, now = 1972180311345
>    1.264596313 |||x||||||-||||| d0v9 csched2:burn_credits d0v9, credit = 5718198, delta = 4877, start_time = 1972180311345
> 
> Interestingly enough, if we look at the same (well, sort of) set of
> events with just xentrace_format, we, after having sorted the records
> by their timestamps, get this:
> 
>    CPU2  5613124777311 (+     507)  running_to_runnable [ dom:vcpu = 0x0000000a ]
>    CPU2  5613124778046 (+     735)  runnable_to_running [ dom:vcpu = 0x00000005 ]
> 
>    CPU3  5613124788657 (+     537)  __enter_scheduler [ prev<dom:vcpu> = 0x00000000, next<dom:vcpu> = 0x00000009 ]
>    CPU3  5613124789224 (+     567)  running_to_blocked  [ dom:vcpu = 0x00000000 ]
> T1 CPU3  5613124790034 (+     810)  runnable_to_running [ dom:vcpu = 0x00000009 ]
> 
>    CPU2  5613124791240 (+   13194)  csched2:runq_pos       [ dom:vcpu = 0x0000000a, pos = 0]
>    CPU2  5613124791795 (+     555)  csched2:tickle_new     [ dom:vcpu = 0x0000000a, credit = 0x5691963 ]
>    CPU2  5613124793154 (+    1359)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 5 ]
>    CPU2  5613124793943 (+     789)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 65537 ]
>    CPU2  5613124794411 (+     468)  csched2:tickle_check   [ dom:vcpu = 0x00010001, credit = 8696613 ]
>    CPU2  5613124795164 (+     753)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 65539 ]
>    CPU2  5613124795695 (+     531)  csched2:tickle_check   [ dom:vcpu = 0x00010003, credit = 10182550 ]
> T2 CPU2  5613124796274 (+     579)  unknown (0x000000000002221e)  [ 0x2f1abbdc 0x000001cb 0x2f1b236d 0x000001cb 0x00000009 0x00030002 0x00000000 ]
>    CPU2  5613124796649 (+     375)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 9 ]
>    CPU2  5613124797066 (+     417)  csched2:tickle_check   [ dom:vcpu = 0x00000009, credit = 5759499 ]
>    CPU2  5613124797879 (+     813)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 7 ]
>    CPU2  5613124798275 (+     396)  csched2:tickle_check   [ dom:vcpu = 0x00000007, credit = 8835314 ]
>    CPU2  5613124799355 (+    1080)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 15 ]
>    CPU2  5613124799748 (+     393)  csched2:tickle_check   [ dom:vcpu = 0x0000000f, credit = 5822635 ]
>    CPU2  5613124800459 (+     711)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 13 ]
>    CPU2  5613124800852 (+     393)  csched2:tickle_check   [ dom:vcpu = 0x0000000d, credit = 7956174 ]
>    CPU2  5613124801500 (+     648)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 3 ]
>    CPU2  5613124801884 (+     384)  csched2:tickle_check   [ dom:vcpu = 0x00000003, credit = 7870338 ]
> 
>    CPU3  5613124803177 (+   13143)  csched2:update_load
>    CPU3  5613124803786 (+     609)  unknown (0x0000000000022220)  [ 0x0000d61e 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ]
>    CPU3  5613124804560 (+     774)  csched2:updt_runq_load [ rq_load[16]:rq_id[8]:wshift[8] = 0x12000009, rq_avgload = 0x0000000000216566, b_avgload = 0x0000000000216566,
>    CPU3  5613124804929 (+     369)  unknown (0x0000000000022220)  [ 0x0001e8de 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ]
>    CPU3  5613124805256 (+     327)  csched2:updt_vcpu_load [ dom:vcpu = 0x00000000, vcpuload = 0x0000000000026040, wshift = 18 ]
> 
>    CPU2  5613124975596 (+  173712)  domain_wake       [ dom:vcpu = 0x00010005 ]
> 
> Interestingly, the timestamp at T1 is 5613124790034, which looks
> _correctly_ smaller than the timestamp at T2 (5613124796274).
> 
> NOW(), as it is called from the scheduling code, does this:
>  s_time_t NOW(u64 at_tsc)  //it's actually get_s_time_fixed()
>  {
>      struct cpu_time *t = &this_cpu(cpu_time);
>      u64 tsc, delta;
>      s_time_t now;
> 
>      tsc = rdtsc();
>      delta = tsc - t->local_tsc_stamp;
>      now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);
> 
>      return now;
>  }
> 
> While the timestamps in the trace are the result of get_cycles():
>  static inline cycles_t get_cycles(void)
>  {
>      return rdtsc();
>  }
> 
> So, it looks to me that the TSC is actually ok, and it could be the
> local_tsc_stamp and scale_delta() magic done in get_s_time_fixed()
> (which, AFAICUI, is there to convert cycles to time) that does not
> guarantee the result to be monotonic, even if the input is... Thoughts?

What about adding a trace point inside NOW() to record the various
inputs into the calculation?

Adding some sort of global ASSERT() that time doesn't go backwards by
too much would be nice, but I think would require spinlocks -- OTOH,
just for debugging purposes it might not be that bad.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: NOW() seems to (sometimes) go backwards!
  2016-06-09  8:05       ` Jan Beulich
@ 2016-06-09 10:24         ` Joao Martins
  2016-06-09 10:43           ` Joao Martins
  2016-06-09 11:25           ` Jan Beulich
  0 siblings, 2 replies; 11+ messages in thread
From: Joao Martins @ 2016-06-09 10:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Dario Faggioli, Tim Deegan, George Dunlap, xen-devel

>>>> Yet when the scaling values get set only once at boot, monotonic
>>>> (cross-CPU) TSC means monotonic (cross-CPU) returns from NOW().
>>>>
>>> Yep. And at this point, this is what needs to be verified, I guess...
>> I think get_s_time_fixed doesn't guarantee monotonicity across CPUs being it
>> different socket or (SMT) siblings. local_tsc_stamp is seeded differently on 
>> each CPU
>> i.e. rdtsc() right after doing the platform time read on the calibration 
>> rendezvous.
>> Plus stime_local_stamp is seeded with values taken from platform timer 
>> (HPET, ACPI,
>> PIT) on local_time_calibration which means that get_s_time isn't solely 
>> based on TSC
>> and that there will always be a gap between stime_local_stamp and 
>> local_tsc_stamp.
>> TSC is indeed monotonic on a TSC invariant box, but the delta that is 
>> computed
>> differs from cpu to cpu according to when time calibration happens on each 
>> CPU - thus
>> not guaranteeing the desired monotonicity property. Having stime_local_stamp 
>> be based
>> on the same timestamp that of the local_tsc_stamp plus having a single
>> local_tsc_stamp as reference would address this behaviour - see also below.
> 
> The quality of get_s_time_fixed() output indeed heavily depends on
> t->local_tsc_stamp and t->stime_local_stamp being a properly
> matched pair. Yet in local_time_calibration()'s constant-TSC case,
> they're a direct copy from the respective cpu_calibration fields.
> The
> main issue I could see here is that on SMT siblings the hardware
> switching between the two may introduce arbitrary delays between
> them. And with CPU frequency changes, the latency between the
> rdtsc() and the execution of get_s_time() finishing could also be
> pretty variable. I wonder whether c->local_tsc_stamp wouldn't
> better be written with the TSC value used by get_s_time() (or,
> which should amount to the same effect, whether we shouldn't
> simply call get_s_time_fixed() here with the just sampled TSC value).
Indeed, but notice that in this copy for the constant-TSC case:
t->stime_local_stamp is written with c->stime_master_stamp - ending
up the former being discarded. So even changing that pair to correctly
match it wouldn't change the result. At point of which I wonder if copying
rendezvous c->stime_master_stamp to t->stime_local_stamp on
local_time_calibration is correct?

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: NOW() seems to (sometimes) go backwards!
  2016-06-09 10:24         ` Joao Martins
@ 2016-06-09 10:43           ` Joao Martins
  2016-06-09 11:25           ` Jan Beulich
  1 sibling, 0 replies; 11+ messages in thread
From: Joao Martins @ 2016-06-09 10:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Dario Faggioli, Tim Deegan, George Dunlap, xen-devel



On 06/09/2016 11:24 AM, Joao Martins wrote:
>>>>> Yet when the scaling values get set only once at boot, monotonic
>>>>> (cross-CPU) TSC means monotonic (cross-CPU) returns from NOW().
>>>>>
>>>> Yep. And at this point, this is what needs to be verified, I guess...
>>> I think get_s_time_fixed doesn't guarantee monotonicity across CPUs being it
>>> different socket or (SMT) siblings. local_tsc_stamp is seeded differently on 
>>> each CPU
>>> i.e. rdtsc() right after doing the platform time read on the calibration 
>>> rendezvous.
>>> Plus stime_local_stamp is seeded with values taken from platform timer 
>>> (HPET, ACPI,
>>> PIT) on local_time_calibration which means that get_s_time isn't solely 
>>> based on TSC
>>> and that there will always be a gap between stime_local_stamp and 
>>> local_tsc_stamp.
>>> TSC is indeed monotonic on a TSC invariant box, but the delta that is 
>>> computed
>>> differs from cpu to cpu according to when time calibration happens on each 
>>> CPU - thus
>>> not guaranteeing the desired monotonicity property. Having stime_local_stamp 
>>> be based
>>> on the same timestamp that of the local_tsc_stamp plus having a single
>>> local_tsc_stamp as reference would address this behaviour - see also below.
>>
>> The quality of get_s_time_fixed() output indeed heavily depends on
>> t->local_tsc_stamp and t->stime_local_stamp being a properly
>> matched pair. Yet in local_time_calibration()'s constant-TSC case,
>> they're a direct copy from the respective cpu_calibration fields.
>> The
>> main issue I could see here is that on SMT siblings the hardware
>> switching between the two may introduce arbitrary delays between
>> them. And with CPU frequency changes, the latency between the
>> rdtsc() and the execution of get_s_time() finishing could also be
>> pretty variable. I wonder whether c->local_tsc_stamp wouldn't
>> better be written with the TSC value used by get_s_time() (or,
>> which should amount to the same effect, whether we shouldn't
>> simply call get_s_time_fixed() here with the just sampled TSC value).
> 
> Indeed, but notice that in this copy for the constant-TSC case:
> t->stime_local_stamp is written with c->stime_master_stamp - ending
> up the former being discarded. So even changing that pair to correctly
> match it wouldn't change the result. At point of which I wonder if copying
> rendezvous c->stime_master_stamp to t->stime_local_stamp on
> local_time_calibration is correct?
> 
This first sentence was a bit cryptic: By former, I actually meant
c->stime_local_stamp (cpu_calibration field).

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: NOW() seems to (sometimes) go backwards!
  2016-06-09 10:24         ` Joao Martins
  2016-06-09 10:43           ` Joao Martins
@ 2016-06-09 11:25           ` Jan Beulich
  2016-06-09 11:31             ` Jan Beulich
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2016-06-09 11:25 UTC (permalink / raw)
  To: Joao Martins
  Cc: Andrew Cooper, Dario Faggioli, Tim Deegan, George Dunlap, xen-devel

>>> On 09.06.16 at 12:24, <joao.m.martins@oracle.com> wrote:
>>>>> Yet when the scaling values get set only once at boot, monotonic
>>>>> (cross-CPU) TSC means monotonic (cross-CPU) returns from NOW().
>>>>>
>>>> Yep. And at this point, this is what needs to be verified, I guess...
>>> I think get_s_time_fixed doesn't guarantee monotonicity across CPUs being it
>>> different socket or (SMT) siblings. local_tsc_stamp is seeded differently on 
> 
>>> each CPU
>>> i.e. rdtsc() right after doing the platform time read on the calibration 
>>> rendezvous.
>>> Plus stime_local_stamp is seeded with values taken from platform timer 
>>> (HPET, ACPI,
>>> PIT) on local_time_calibration which means that get_s_time isn't solely 
>>> based on TSC
>>> and that there will always be a gap between stime_local_stamp and 
>>> local_tsc_stamp.
>>> TSC is indeed monotonic on a TSC invariant box, but the delta that is 
>>> computed
>>> differs from cpu to cpu according to when time calibration happens on each 
>>> CPU - thus
>>> not guaranteeing the desired monotonicity property. Having stime_local_stamp 
> 
>>> be based
>>> on the same timestamp that of the local_tsc_stamp plus having a single
>>> local_tsc_stamp as reference would address this behaviour - see also below.
>> 
>> The quality of get_s_time_fixed() output indeed heavily depends on
>> t->local_tsc_stamp and t->stime_local_stamp being a properly
>> matched pair. Yet in local_time_calibration()'s constant-TSC case,
>> they're a direct copy from the respective cpu_calibration fields.
>> The
>> main issue I could see here is that on SMT siblings the hardware
>> switching between the two may introduce arbitrary delays between
>> them. And with CPU frequency changes, the latency between the
>> rdtsc() and the execution of get_s_time() finishing could also be
>> pretty variable. I wonder whether c->local_tsc_stamp wouldn't
>> better be written with the TSC value used by get_s_time() (or,
>> which should amount to the same effect, whether we shouldn't
>> simply call get_s_time_fixed() here with the just sampled TSC value).
> Indeed, but notice that in this copy for the constant-TSC case:
> t->stime_local_stamp is written with c->stime_master_stamp - ending
> up the former being discarded. So even changing that pair to correctly
> match it wouldn't change the result. At point of which I wonder if copying
> rendezvous c->stime_master_stamp to t->stime_local_stamp on
> local_time_calibration is correct?

Yeah, I stumbled across that too, and I am in the process of
trying whether this in fact was just a copy-and-paste mistake
many years ago.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: BUG: NOW() seems to (sometimes) go backwards!
  2016-06-09 11:25           ` Jan Beulich
@ 2016-06-09 11:31             ` Jan Beulich
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Beulich @ 2016-06-09 11:31 UTC (permalink / raw)
  To: Joao Martins
  Cc: Andrew Cooper, Dario Faggioli, Tim Deegan, George Dunlap, xen-devel

>>> On 09.06.16 at 13:25, <JBeulich@suse.com> wrote:
>>>> On 09.06.16 at 12:24, <joao.m.martins@oracle.com> wrote:
>>>>>> Yet when the scaling values get set only once at boot, monotonic
>>>>>> (cross-CPU) TSC means monotonic (cross-CPU) returns from NOW().
>>>>>>
>>>>> Yep. And at this point, this is what needs to be verified, I guess...
>>>> I think get_s_time_fixed doesn't guarantee monotonicity across CPUs being it
>>>> different socket or (SMT) siblings. local_tsc_stamp is seeded differently on 
> 
>> 
>>>> each CPU
>>>> i.e. rdtsc() right after doing the platform time read on the calibration 
>>>> rendezvous.
>>>> Plus stime_local_stamp is seeded with values taken from platform timer 
>>>> (HPET, ACPI,
>>>> PIT) on local_time_calibration which means that get_s_time isn't solely 
>>>> based on TSC
>>>> and that there will always be a gap between stime_local_stamp and 
>>>> local_tsc_stamp.
>>>> TSC is indeed monotonic on a TSC invariant box, but the delta that is 
>>>> computed
>>>> differs from cpu to cpu according to when time calibration happens on each 
>>>> CPU - thus
>>>> not guaranteeing the desired monotonicity property. Having stime_local_stamp 
> 
>> 
>>>> be based
>>>> on the same timestamp that of the local_tsc_stamp plus having a single
>>>> local_tsc_stamp as reference would address this behaviour - see also below.
>>> 
>>> The quality of get_s_time_fixed() output indeed heavily depends on
>>> t->local_tsc_stamp and t->stime_local_stamp being a properly
>>> matched pair. Yet in local_time_calibration()'s constant-TSC case,
>>> they're a direct copy from the respective cpu_calibration fields.
>>> The
>>> main issue I could see here is that on SMT siblings the hardware
>>> switching between the two may introduce arbitrary delays between
>>> them. And with CPU frequency changes, the latency between the
>>> rdtsc() and the execution of get_s_time() finishing could also be
>>> pretty variable. I wonder whether c->local_tsc_stamp wouldn't
>>> better be written with the TSC value used by get_s_time() (or,
>>> which should amount to the same effect, whether we shouldn't
>>> simply call get_s_time_fixed() here with the just sampled TSC value).
>> Indeed, but notice that in this copy for the constant-TSC case:
>> t->stime_local_stamp is written with c->stime_master_stamp - ending
>> up the former being discarded. So even changing that pair to correctly
>> match it wouldn't change the result. At point of which I wonder if copying
>> rendezvous c->stime_master_stamp to t->stime_local_stamp on
>> local_time_calibration is correct?
> 
> Yeah, I stumbled across that too, and I am in the process of
> trying whether this in fact was just a copy-and-paste mistake
> many years ago.

And indeed changing that makes all of the dozens-of-microseconds
backwards moves (compared cross-CPU) go away.

The largest ones left are now around 3.7us, and apparently always
between CPU0 and CPU1 (which I suspect may be due to the special
role CPU0 plays here).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-06-09 11:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-07 15:54 BUG: NOW() seems to (sometimes) go backwards! Dario Faggioli
2016-06-08 10:42 ` Jan Beulich
2016-06-08 13:43   ` Dario Faggioli
2016-06-08 14:01     ` Jan Beulich
2016-06-08 20:36     ` Joao Martins
2016-06-09  8:05       ` Jan Beulich
2016-06-09 10:24         ` Joao Martins
2016-06-09 10:43           ` Joao Martins
2016-06-09 11:25           ` Jan Beulich
2016-06-09 11:31             ` Jan Beulich
2016-06-09 10:19 ` George Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).