BUG: NOW() seems to (sometimes) go backwards!

* BUG: NOW() seems to (sometimes) go backwards!
@ 2016-06-07 15:54 Dario Faggioli
  2016-06-08 10:42 ` Jan Beulich
  2016-06-09 10:19 ` George Dunlap
  0 siblings, 2 replies; 11+ messages in thread
From: Dario Faggioli @ 2016-06-07 15:54 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Joao Martins, Tim Deegan, George Dunlap, Jan Beulich

[-- Attachment #1.1: Type: text/plain, Size: 13618 bytes --]

Hi,

I've been fighting with this during the past few days, while testing
and benchmarking some of the modification I'm doing to Credit2.

In summary, what I see is NOW() going backwards. The issues shows up
when I call NOW() on different pCPUs, and compare the results (which is
something the Credit2 scheduler does a lot, and I would like it to be
able to continue to do so. :-D).

I can elaborate on why this is really really bad for scheduling in
general, and even more so for Credit2. For now, let me just say it's
pretty bad. Feel free to ask for details if interested.

In any case, IMO, we need to fix this.

It does not only happen cross-socket, i.e., it happens even if the two
pCPUs are part of the same socket (in the example shown below, they're
two SMT threads of the same core!!).

I'm able to trigger this on more than just one box, all with invariant
TSC support. (All Intel, though, I don't have any AMD test hardware
available right now).

In some more details.

The scheduler (both in scheduler.c and in sched_credit{1,2}.c) uses
NOW() for all the timing and time sampling. In the trace below, at time
T1 (equal to 1.264582607), we do this:

 d0v9->start_time = NOW()

and we are on pCPU 3.

Then, at time T2, on pCPU2, there is this:

 if ( d0v9->start_time < NOW() )
 {
   /* OMG!! */
   __trace("csched2:time_backwards ...")
 }

and we indeed hit the __trace().

What the trace point says is that NOW() was 1972180296557 at time T1,
on pCPU 3, and it is 1972180270044 at time T2 on pCPU2 (i.e., ~26.5
microsecs in the past).

I speak about NOW() going backwards because, by looking at both the
code and the trace, I am positive about the fact that T2 happens later
than T1 (go to [1], if interested)... So, why NOW() is smaller at T2?
:-(

   1.264577304 ||x||||||||||||| d0v10 runstate_change d0v10 running->runnable
   1.264577611 ||x||||||||||||| d?v? runstate_change d0v5 runnable->running
   1.264580025 |||x||||||-||||| d0v0 vcpu_block d0v0
   1.264580795 |||x||||||-||||| d0v0 csched2:schedule current = d0v0, now = 1972180296557
   1.264581014 |||x||||||-||||| d0v0 csched2:burn_credits d0v0, credit = 9091956, delta = 39433, start_time = 1972180296557
   1.264581660 |||x||||||-||||| d0v0 sched_switch prev d0, run for 112.503us
   1.264581809 |||x||||||-||||| d0v0 sched_switch next d0, was runnable for 54.532us, next slice 2000.0us
   1.264582033 |||x||||||-||||| d0v0 sched_switch prev d0v0 next d0v9
   1.264582269 |||x||||||-||||| d0v0 runstate_change d0v0 running->blocked
T1 1.264582607 |||x||||||-||||| d?v? runstate_change d0v9 runnable->running
   1.264583109 ||x|||||||-||||| d0v5 csched2:runq_insert d0v10, position 0
   1.264583341 ||x|||||||-||||| d0v5 csched2:runq_tickle_new d0v10, credit = 5691963
   1.264583907 ||x|||||||-||||| d0v5 csched2:burn_credits d0v5, credit = 8575685, delta = 8241, start_time = 1972180270044
   1.264584236 ||x|||||||-||||| d0v5 csched2:burn_credits d1v1, credit = 8696613, delta = 12920, start_time = 1972180270044
   1.264584431 ||x|||||||-||||| d0v5 csched2:tickle_check d1v1, credit = 8696613
   1.264584745 ||x|||||||-||||| d0v5 csched2:burn_credits d1v3, credit = 10182550, delta = 12920, start_time = 1972180270044
   1.264584966 ||x|||||||-||||| d0v5 csched2:tickle_check d1v3, credit = 10182550
T2 1.264585207 ||x|||||||-||||| d0v5 csched2:time_backwards d0v9: cur_time = 1972180270044, cur_cpu = 2, old_time = 1972180296557, old_cpu = 3
   1.264585364 ||x|||||||-||||| d0v5 csched2:burn_credits d0v9, credit = 5759499, delta = -26513, start_time = 1972180270044
   1.264585537 ||x|||||||-||||| d0v5 csched2:tickle_check d0v9, credit = 5759499
   1.264585876 ||x|||||||-||||| d0v5 csched2:burn_credits d0v7, credit = 8835314, delta = 12920, start_time = 1972180270044
   1.264586041 ||x|||||||-||||| d0v5 csched2:tickle_check d0v7, credit = 8835314
   1.264586491 ||x|||||||-||||| d0v5 csched2:burn_credits d0v15, credit = 5822635, delta = 12920, start_time = 1972180270044
   1.264586655 ||x|||||||-||||| d0v5 csched2:tickle_check d0v15, credit = 5822635
   1.264586952 ||x|||||||-||||| d0v5 csched2:burn_credits d0v13, credit = 7956174, delta = 12920, start_time = 1972180270044
   1.264587115 ||x|||||||-||||| d0v5 csched2:tickle_check d0v13, credit = 7956174
   1.264587385 ||x|||||||-||||| d0v5 csched2:burn_credits d0v3, credit = 7870338, delta = 12920, start_time = 1972180270044
   1.264587545 ||x|||||||-||||| d0v5 csched2:tickle_check d0v3, credit = 7870338
   1.264595859 |||x||||||-||||| d0v9 csched2:schedule current = d0v9, now = 1972180311345
   1.264596313 |||x||||||-||||| d0v9 csched2:burn_credits d0v9, credit = 5718198, delta = 4877, start_time = 1972180311345

Interestingly enough, if we look at the same (well, sort of) set of
events with just xentrace_format, we, after having sorted the records
by their timestamps, get this:

   CPU2  5613124777311 (+     507)  running_to_runnable [ dom:vcpu = 0x0000000a ]
   CPU2  5613124778046 (+     735)  runnable_to_running [ dom:vcpu = 0x00000005 ]

   CPU3  5613124788657 (+     537)  __enter_scheduler [ prev<dom:vcpu> = 0x00000000, next<dom:vcpu> = 0x00000009 ]
   CPU3  5613124789224 (+     567)  running_to_blocked  [ dom:vcpu = 0x00000000 ]
T1 CPU3  5613124790034 (+     810)  runnable_to_running [ dom:vcpu = 0x00000009 ]

   CPU2  5613124791240 (+   13194)  csched2:runq_pos       [ dom:vcpu = 0x0000000a, pos = 0]
   CPU2  5613124791795 (+     555)  csched2:tickle_new     [ dom:vcpu = 0x0000000a, credit = 0x5691963 ]
   CPU2  5613124793154 (+    1359)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 5 ]
   CPU2  5613124793943 (+     789)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 65537 ]
   CPU2  5613124794411 (+     468)  csched2:tickle_check   [ dom:vcpu = 0x00010001, credit = 8696613 ]
   CPU2  5613124795164 (+     753)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 65539 ]
   CPU2  5613124795695 (+     531)  csched2:tickle_check   [ dom:vcpu = 0x00010003, credit = 10182550 ]
T2 CPU2  5613124796274 (+     579)  unknown (0x000000000002221e)  [ 0x2f1abbdc 0x000001cb 0x2f1b236d 0x000001cb 0x00000009 0x00030002 0x00000000 ]
   CPU2  5613124796649 (+     375)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 9 ]
   CPU2  5613124797066 (+     417)  csched2:tickle_check   [ dom:vcpu = 0x00000009, credit = 5759499 ]
   CPU2  5613124797879 (+     813)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 7 ]
   CPU2  5613124798275 (+     396)  csched2:tickle_check   [ dom:vcpu = 0x00000007, credit = 8835314 ]
   CPU2  5613124799355 (+    1080)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 15 ]
   CPU2  5613124799748 (+     393)  csched2:tickle_check   [ dom:vcpu = 0x0000000f, credit = 5822635 ]
   CPU2  5613124800459 (+     711)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 13 ]
   CPU2  5613124800852 (+     393)  csched2:tickle_check   [ dom:vcpu = 0x0000000d, credit = 7956174 ]
   CPU2  5613124801500 (+     648)  csched2:credit burn    [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 3 ]
   CPU2  5613124801884 (+     384)  csched2:tickle_check   [ dom:vcpu = 0x00000003, credit = 7870338 ]

   CPU3  5613124803177 (+   13143)  csched2:update_load
   CPU3  5613124803786 (+     609)  unknown (0x0000000000022220)  [ 0x0000d61e 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ]
   CPU3  5613124804560 (+     774)  csched2:updt_runq_load [ rq_load[16]:rq_id[8]:wshift[8] = 0x12000009, rq_avgload = 0x0000000000216566, b_avgload = 0x0000000000216566,
   CPU3  5613124804929 (+     369)  unknown (0x0000000000022220)  [ 0x0001e8de 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ]
   CPU3  5613124805256 (+     327)  csched2:updt_vcpu_load [ dom:vcpu = 0x00000000, vcpuload = 0x0000000000026040, wshift = 18 ]

   CPU2  5613124975596 (+  173712)  domain_wake       [ dom:vcpu = 0x00010005 ]

Interestingly, the timestamp at T1 is 5613124790034, which looks
_correctly_ smaller than the timestamp at T2 (5613124796274).

NOW(), as it is called from the scheduling code, does this:
 s_time_t NOW(u64 at_tsc)  //it's actually get_s_time_fixed()
 {
     struct cpu_time *t = &this_cpu(cpu_time);
     u64 tsc, delta;
     s_time_t now;

     tsc = rdtsc();
     delta = tsc - t->local_tsc_stamp;
     now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);

     return now;
 }

While the timestamps in the trace are the result of get_cycles():
 static inline cycles_t get_cycles(void)
 {
     return rdtsc();
 }

So, it looks to me that the TSC is actually ok, and it could be the
local_tsc_stamp and scale_delta() magic done in get_s_time_fixed()
(which, AFAICUI, is there to convert cycles to time) that does not
guarantee the result to be monotonic, even if the input is... Thoughts?

Archeology says that the current shape of get_s_time_fixed() comes
from:
commit bbb0eb32255d1a3e97994d9c57ca5e3d50de4298
commit 8cf6cfe286aa6949f9bd546d03e275e21e9723db

But commit messages are not particularly helpful... and I'm not
familiar at all with all this. I'll keep investigating, but in the
meanwhile, any thought and idea is more than welcome.

Thanks and Regards,
Dario

[1] So. all the various functions/chunks of code of interest here are
serialized by the same spinlock (there is a runqueue for the whole
socket). Considering both the code and the trace, the flow of execution
is this:

                   CPU2                  CPU3
                     |                     |
             csched2_schedule()            |
              spin_lock(rq->lock)          |
1.264577304   d0v10 --> runnable           |
1.264577611   current <-- d0v5             |
              spin_unlock(rq->lock)        |
                     |                     |
1.264580025          |              vcpu_block(d0v0)
                     |               spin_lock(rq->lock)
                     |               spin_unlock(rq->lock)
1.264580795          |             csched2_schedule()
                     |              spin_lock(rq->lock)
                     |              current <-- d0v9
1.264582607          |              d0v9->start_time = NOW()  (***T1)
                     |              spin_unlock(rq->lock)
                     |                     |
             context_saved(d0v10)          |
              spin_lock(rq->lock);         |
1.264583109   runq_insert(d0v10)           |
1.264583341   runq_tickle_new(d0v10)       |
1.264583907   burn_credits(d0v5)           |
1.264584236   burn_credits(d1v1)           |
1.264584431   tickle_check(d1v1)           |
1.264584745   burn_credits(d1v3)           |
1.264584966   tickle_check(d1v3)           |
1.264585207   !!time_backwards(d0v9)       |                  (***T2)
1.264585364   burn_credits(d0v9)           |
1.264585537   tickle_check d0v9            |
              ...                         ...
              spin_lock(rq->lock)          |

NOW() is always sampled and checked *inside* a critical section
protected by rq->lock, so it's not possible that the value gets stale
waiting for a lock (I've fixed a couple of instances of this already).

So, if this is what happens, it's correct to expect that NOW() at T2 is
ahead of NOW() at T1 (and hence, what really happens, T2<T1, means
there is a bug somewhere). And I am sure the flow is actually this --
not only because this is the order shown by both xenalyze and
xentrace_format-- by looking at the implementation of runq_tickle(), in
sched_credit2.c.

In fact, in there, first of all we call burn_credits() and check if we
can preempt the vcpu running on our vc->processor, and in this case,
this would be d0v10 checking d0v5, at time 1.264583907. Afterwards, we
check the vcpu running on each pCPU (of our runqueue), and that would
be d1v1 on pCPU 0, d1v3 on pCPU 1, then we skip d0v5 on pCPU 2 (we
already checked it outside of the loop), and then we check d0v9 on pCPU
3.

And this is eventually why I say I'm sure about the execution flow:
d0v9 is current on pCPU 3 _only_ if the call to csched2_schedule() that
put it in execution has happened already (still on pCPU 3), and that,
thanks also to locking, means d0v9->start_time = NOW() at T1 already
happened when we reach T2.
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread