From: George Dunlap <george.dunlap@citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>,
"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>,
Joao Martins <joao.m.martins@oracle.com>,
Tim Deegan <tim@xen.org>, Jan Beulich <JBeulich@suse.com>
Subject: Re: BUG: NOW() seems to (sometimes) go backwards!
Date: Thu, 9 Jun 2016 11:19:36 +0100 [thread overview]
Message-ID: <575942B8.8060904@citrix.com> (raw)
In-Reply-To: <1465314895.15816.165.camel@citrix.com>
On 07/06/16 16:54, Dario Faggioli wrote:
> Hi,
>
> I've been fighting with this during the past few days, while testing
> and benchmarking some of the modification I'm doing to Credit2.
>
> In summary, what I see is NOW() going backwards. The issues shows up
> when I call NOW() on different pCPUs, and compare the results (which is
> something the Credit2 scheduler does a lot, and I would like it to be
> able to continue to do so. :-D).
>
> I can elaborate on why this is really really bad for scheduling in
> general, and even more so for Credit2. For now, let me just say it's
> pretty bad. Feel free to ask for details if interested.
>
> In any case, IMO, we need to fix this.
>
> It does not only happen cross-socket, i.e., it happens even if the two
> pCPUs are part of the same socket (in the example shown below, they're
> two SMT threads of the same core!!).
>
> I'm able to trigger this on more than just one box, all with invariant
> TSC support. (All Intel, though, I don't have any AMD test hardware
> available right now).
>
> In some more details.
>
> The scheduler (both in scheduler.c and in sched_credit{1,2}.c) uses
> NOW() for all the timing and time sampling. In the trace below, at time
> T1 (equal to 1.264582607), we do this:
>
> d0v9->start_time = NOW()
>
> and we are on pCPU 3.
>
> Then, at time T2, on pCPU2, there is this:
>
> if ( d0v9->start_time < NOW() )
> {
> /* OMG!! */
> __trace("csched2:time_backwards ...")
> }
>
> and we indeed hit the __trace().
>
> What the trace point says is that NOW() was 1972180296557 at time T1,
> on pCPU 3, and it is 1972180270044 at time T2 on pCPU2 (i.e., ~26.5
> microsecs in the past).
>
> I speak about NOW() going backwards because, by looking at both the
> code and the trace, I am positive about the fact that T2 happens later
> than T1 (go to [1], if interested)... So, why NOW() is smaller at T2?
> :-(
>
> 1.264577304 ||x||||||||||||| d0v10 runstate_change d0v10 running->runnable
> 1.264577611 ||x||||||||||||| d?v? runstate_change d0v5 runnable->running
> 1.264580025 |||x||||||-||||| d0v0 vcpu_block d0v0
> 1.264580795 |||x||||||-||||| d0v0 csched2:schedule current = d0v0, now = 1972180296557
> 1.264581014 |||x||||||-||||| d0v0 csched2:burn_credits d0v0, credit = 9091956, delta = 39433, start_time = 1972180296557
> 1.264581660 |||x||||||-||||| d0v0 sched_switch prev d0, run for 112.503us
> 1.264581809 |||x||||||-||||| d0v0 sched_switch next d0, was runnable for 54.532us, next slice 2000.0us
> 1.264582033 |||x||||||-||||| d0v0 sched_switch prev d0v0 next d0v9
> 1.264582269 |||x||||||-||||| d0v0 runstate_change d0v0 running->blocked
> T1 1.264582607 |||x||||||-||||| d?v? runstate_change d0v9 runnable->running
> 1.264583109 ||x|||||||-||||| d0v5 csched2:runq_insert d0v10, position 0
> 1.264583341 ||x|||||||-||||| d0v5 csched2:runq_tickle_new d0v10, credit = 5691963
> 1.264583907 ||x|||||||-||||| d0v5 csched2:burn_credits d0v5, credit = 8575685, delta = 8241, start_time = 1972180270044
> 1.264584236 ||x|||||||-||||| d0v5 csched2:burn_credits d1v1, credit = 8696613, delta = 12920, start_time = 1972180270044
> 1.264584431 ||x|||||||-||||| d0v5 csched2:tickle_check d1v1, credit = 8696613
> 1.264584745 ||x|||||||-||||| d0v5 csched2:burn_credits d1v3, credit = 10182550, delta = 12920, start_time = 1972180270044
> 1.264584966 ||x|||||||-||||| d0v5 csched2:tickle_check d1v3, credit = 10182550
> T2 1.264585207 ||x|||||||-||||| d0v5 csched2:time_backwards d0v9: cur_time = 1972180270044, cur_cpu = 2, old_time = 1972180296557, old_cpu = 3
> 1.264585364 ||x|||||||-||||| d0v5 csched2:burn_credits d0v9, credit = 5759499, delta = -26513, start_time = 1972180270044
> 1.264585537 ||x|||||||-||||| d0v5 csched2:tickle_check d0v9, credit = 5759499
> 1.264585876 ||x|||||||-||||| d0v5 csched2:burn_credits d0v7, credit = 8835314, delta = 12920, start_time = 1972180270044
> 1.264586041 ||x|||||||-||||| d0v5 csched2:tickle_check d0v7, credit = 8835314
> 1.264586491 ||x|||||||-||||| d0v5 csched2:burn_credits d0v15, credit = 5822635, delta = 12920, start_time = 1972180270044
> 1.264586655 ||x|||||||-||||| d0v5 csched2:tickle_check d0v15, credit = 5822635
> 1.264586952 ||x|||||||-||||| d0v5 csched2:burn_credits d0v13, credit = 7956174, delta = 12920, start_time = 1972180270044
> 1.264587115 ||x|||||||-||||| d0v5 csched2:tickle_check d0v13, credit = 7956174
> 1.264587385 ||x|||||||-||||| d0v5 csched2:burn_credits d0v3, credit = 7870338, delta = 12920, start_time = 1972180270044
> 1.264587545 ||x|||||||-||||| d0v5 csched2:tickle_check d0v3, credit = 7870338
> 1.264595859 |||x||||||-||||| d0v9 csched2:schedule current = d0v9, now = 1972180311345
> 1.264596313 |||x||||||-||||| d0v9 csched2:burn_credits d0v9, credit = 5718198, delta = 4877, start_time = 1972180311345
>
> Interestingly enough, if we look at the same (well, sort of) set of
> events with just xentrace_format, we, after having sorted the records
> by their timestamps, get this:
>
> CPU2 5613124777311 (+ 507) running_to_runnable [ dom:vcpu = 0x0000000a ]
> CPU2 5613124778046 (+ 735) runnable_to_running [ dom:vcpu = 0x00000005 ]
>
> CPU3 5613124788657 (+ 537) __enter_scheduler [ prev<dom:vcpu> = 0x00000000, next<dom:vcpu> = 0x00000009 ]
> CPU3 5613124789224 (+ 567) running_to_blocked [ dom:vcpu = 0x00000000 ]
> T1 CPU3 5613124790034 (+ 810) runnable_to_running [ dom:vcpu = 0x00000009 ]
>
> CPU2 5613124791240 (+ 13194) csched2:runq_pos [ dom:vcpu = 0x0000000a, pos = 0]
> CPU2 5613124791795 (+ 555) csched2:tickle_new [ dom:vcpu = 0x0000000a, credit = 0x5691963 ]
> CPU2 5613124793154 (+ 1359) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 5 ]
> CPU2 5613124793943 (+ 789) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 65537 ]
> CPU2 5613124794411 (+ 468) csched2:tickle_check [ dom:vcpu = 0x00010001, credit = 8696613 ]
> CPU2 5613124795164 (+ 753) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 65539 ]
> CPU2 5613124795695 (+ 531) csched2:tickle_check [ dom:vcpu = 0x00010003, credit = 10182550 ]
> T2 CPU2 5613124796274 (+ 579) unknown (0x000000000002221e) [ 0x2f1abbdc 0x000001cb 0x2f1b236d 0x000001cb 0x00000009 0x00030002 0x00000000 ]
> CPU2 5613124796649 (+ 375) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 9 ]
> CPU2 5613124797066 (+ 417) csched2:tickle_check [ dom:vcpu = 0x00000009, credit = 5759499 ]
> CPU2 5613124797879 (+ 813) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 7 ]
> CPU2 5613124798275 (+ 396) csched2:tickle_check [ dom:vcpu = 0x00000007, credit = 8835314 ]
> CPU2 5613124799355 (+ 1080) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 15 ]
> CPU2 5613124799748 (+ 393) csched2:tickle_check [ dom:vcpu = 0x0000000f, credit = 5822635 ]
> CPU2 5613124800459 (+ 711) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 13 ]
> CPU2 5613124800852 (+ 393) csched2:tickle_check [ dom:vcpu = 0x0000000d, credit = 7956174 ]
> CPU2 5613124801500 (+ 648) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 3 ]
> CPU2 5613124801884 (+ 384) csched2:tickle_check [ dom:vcpu = 0x00000003, credit = 7870338 ]
>
> CPU3 5613124803177 (+ 13143) csched2:update_load
> CPU3 5613124803786 (+ 609) unknown (0x0000000000022220) [ 0x0000d61e 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ]
> CPU3 5613124804560 (+ 774) csched2:updt_runq_load [ rq_load[16]:rq_id[8]:wshift[8] = 0x12000009, rq_avgload = 0x0000000000216566, b_avgload = 0x0000000000216566,
> CPU3 5613124804929 (+ 369) unknown (0x0000000000022220) [ 0x0001e8de 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ]
> CPU3 5613124805256 (+ 327) csched2:updt_vcpu_load [ dom:vcpu = 0x00000000, vcpuload = 0x0000000000026040, wshift = 18 ]
>
> CPU2 5613124975596 (+ 173712) domain_wake [ dom:vcpu = 0x00010005 ]
>
> Interestingly, the timestamp at T1 is 5613124790034, which looks
> _correctly_ smaller than the timestamp at T2 (5613124796274).
>
> NOW(), as it is called from the scheduling code, does this:
> s_time_t NOW(u64 at_tsc) //it's actually get_s_time_fixed()
> {
> struct cpu_time *t = &this_cpu(cpu_time);
> u64 tsc, delta;
> s_time_t now;
>
> tsc = rdtsc();
> delta = tsc - t->local_tsc_stamp;
> now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);
>
> return now;
> }
>
> While the timestamps in the trace are the result of get_cycles():
> static inline cycles_t get_cycles(void)
> {
> return rdtsc();
> }
>
> So, it looks to me that the TSC is actually ok, and it could be the
> local_tsc_stamp and scale_delta() magic done in get_s_time_fixed()
> (which, AFAICUI, is there to convert cycles to time) that does not
> guarantee the result to be monotonic, even if the input is... Thoughts?
What about adding a trace point inside NOW() to record the various
inputs into the calculation?
Adding some sort of global ASSERT() that time doesn't go backwards by
too much would be nice, but I think would require spinlocks -- OTOH,
just for debugging purposes it might not be that bad.
-George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
prev parent reply other threads:[~2016-06-09 10:20 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-07 15:54 BUG: NOW() seems to (sometimes) go backwards! Dario Faggioli
2016-06-08 10:42 ` Jan Beulich
2016-06-08 13:43 ` Dario Faggioli
2016-06-08 14:01 ` Jan Beulich
2016-06-08 20:36 ` Joao Martins
2016-06-09 8:05 ` Jan Beulich
2016-06-09 10:24 ` Joao Martins
2016-06-09 10:43 ` Joao Martins
2016-06-09 11:25 ` Jan Beulich
2016-06-09 11:31 ` Jan Beulich
2016-06-09 10:19 ` George Dunlap [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=575942B8.8060904@citrix.com \
--to=george.dunlap@citrix.com \
--cc=JBeulich@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=dario.faggioli@citrix.com \
--cc=joao.m.martins@oracle.com \
--cc=tim@xen.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).