rdtscP and xen (and maybe the app-tsc answer I've been looking for)

All of lore.kernel.org
 help / color / mirror / Atom feed

* rdtscP and xen (and maybe the app-tsc answer I've been looking for)
@ 2009-09-18 16:30 Dan Magenheimer
  2009-09-18 20:27 ` Dan Magenheimer
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-18 16:30 UTC (permalink / raw)
  To: Xen-Devel (E-mail)

Xen doesn't appear to support the rdtscp instruction.
Should it?  (And specifically I'm wondering whether
it should be emulated whenever rdtsc is emulated
but see below for another intriguing possibility.)

Rdtscp is unprivileged and we have apps that are using it
on bare metal, after validating that the CPU supports it.
The instruction is available on most (all?) recent AMD
CPUs and Intel's Nehalem supports it.

For an OS to support rdtscp properly, the OS must (once at boot)
wrmsr a different value for each cpu to a "TSC_AUX" register
and this register is read along with the TSC when the rdtscp
instruction is executed.  This allows an app to determine
if two consecutive rdtsc's are (or are not) executed on the
same CPU.

It appears that all recent RHEL kernels write to TSC_AUX if
the CPU supports rdtscp.  I'm told Windows 2008 notably does
not.  Don't know about SLES or other Windoze.

Its not clear to me if/how rdtscp can/should be virtualized.
To do it properly, the value written to the TSC_AUX msr
would become part of the vcpu's state, and would need to
be changed whenever a vcpu->pcpu mapping changes.  To meet
only the current use model of the instruction, Xen could write
TSC_AUX for each pcpu on Xen boot and always ignore guest
OS writes to TSC_AUX.  (This assumes that no OS ever reads
TSC_AUX and attempts to match it with the value that it
thought it wrote to TSC_AUX; and assumes that 

One solution is for Xen to deny the existence of rdtscp even
when Xen is running on hardware that supports it.  Is that
exactly what is happening?

Now thinking creatively, could TSC_AUX be used similar
to the pvclock version number... Xen bumps it whenever a
migration occurs which would prompt an app to go out
and reread new values for scaling and offset (possibly
via specially-handled-by-Xen usermode rdmsr)?  Hmmm...
I think it might be the answer I've been looking for!
(Go ahead, shoot me down :-)

Dan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-18 16:30 rdtscP and xen (and maybe the app-tsc answer I've been looking for) Dan Magenheimer
@ 2009-09-18 20:27 ` Dan Magenheimer
  2009-09-18 22:55   ` Jeremy Fitzhardinge
  2009-09-21  8:17   ` Jan Beulich
  0 siblings, 2 replies; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-18 20:27 UTC (permalink / raw)
  To: Xen-Devel (E-mail), Jan Beulich, Keir Fraser, Jeremy Fitzhardinge
  Cc: kurt.hackel

OK, here's the long version (/me crosses
fingers and hopes to get away from this
for at least some of the weekend)...

Proposal ("pv rdtscp"):

The rdtscP instruction was added to the x86
architecture by AMD a couple of years ago and
Intel added it starting at Nehalem.  It is
essentially the same as an rdtsc except in
addition it copies the value of a privileged
MSR register "TSC_AUX" into a specified memory
location.  There is a CPUID bit that can
be checked to determine if the processor
supports the rdtscp instruction.  Xen currently
does not expose hardware support for rdtscp
to guests.

I propose to paravirtualize support for
rdtscp as follows:

If guest vm.cfg has vrdtscp=0 (default):
  rdtscp is emulated and returns nsec since guest
  boot (same as emulated rdtsc), value returned
  for TSC_AUX is -1

If guest vm.cfg has vrdtscp=1:
  If underlying hardware has rdtscp support:
    rdtscp is directly executed by hardware,
    value returned for TSC_AUX is non-zero
    (see below)
  Else: (no hardware rdtscp support)
    rdtscp is emulated and returns nsec since
    guest boot, value returned for TSC_AUX is 0

How it works from the app point-of-view:

Guest app must have some capability of getting 64-bit
pvclock parameters directly from Xen without OS changes,
e.g. emulated userland wrmsr, userland hypercall,
or userland mapped shared page.  (This will be done
rarely so need not be fast! But it does create
a new userland<->Xen ABI that must be kept compatible.)

On first rdtscp, app records returned TSC_AUX value,
verifies that it is neither 0 nor -1,
fetches pvclock parameters from Xen, executes
another rdtscp.  If TSC_AUX matches previous value,
app applies pvclock algorithm to tsc value to
obtain nsec since guest boot.  If TSC_AUX is
zero or -1, tsc value IS nsec since guest boot.
If TSC_AUX differs from last recorded value,
fetch pvclock parameters from Xen again.

On subsequent rdtscp's, app compares
returned TSC_AUX against the previous one,
and fetches pvclock parameters from Xen only
if it differs (which should be rare).

What Xen needs to do:

Xen must record the setting for each guest's vrdtscp
config variable and ensure that it persists across
save/restore and migration.  If the guest has
vrdtscp=1, a vrdtscp "version" number is also
part of the guest's state and must persist
across save/restore/migration.

Xen must know whether or not it is running on a
machine where TSC is reliable.  If TSC is NOT
reliable AND rdtscp is supported by hardware,
Xen must ensure that TSC_AUX is -1 on all pcpu's
that are running a guest with vrdtscp=0, and 0
on all pcpu's that are running a guest where
vrdtscp=1 (and must enable CR4.TSD on those
pcpus if it wasn't already).  If TSC is NOT
reliable AND rdtscp is NOT supported by hardware,
Xen must emulate rdtscp (e.g.
return Xen system time) and emulate the
same behavior for TSC_AUX.  If TSC IS reliable,
Xen sets TSC_AUX to the guest's vrdtscp version
number on all pcpu's that are running the guest.
Finally, when a guest transitions from one
"TSC domain" to another (restore/migrate/NUMA)
it increments the vrdtscp version number.

I think this will work even for a NUMA machine
provided Xen always schedules all the vcpus
for one guest on pcpus in the same NUMA node,
and increments the version number when
the guest is rescheduled from one NUMA node to
another (assuming TSC on each node is reliable).

I think this pv-rdtscp mechanism will work
for both PV and HVM (with minor additional work
in Xen for HVM); it will be very fast on any
hardware that supports rdtscp in hardware
(which for Intel only includes Nehalem+ but
that provides even more incentive for
customers to upgrade).  Apps that currently
use rdtscp will continue to work (as long as
they don't have
some wild use model that I don't know about).
Pvclock algorithm in the OS would need to be
changed to use rdtscp (instead of rdtsc)
and check for TSC_AUX=0 to do the right thing.
If not changed, it will continue to work
but slower (whether or not rdtsc is emulated
because when emulated it returns the hardware
TSC when the instruction was attempted in kernel
mode).

The only problem I can see is that when
vrdtscp==1, other apps that are running on that guest
that use rdtsc (no p) directly (i.e. haven't been
modified to use pv-rdtscp) will continue to
have the same kinds of failure on save/restore/
migration.  But this is true of all the solutions
proposed so far: Xen can only turn on emulation
guest-wide, not per-app.

Also even on machines where TSC is reliable,
there is a small chance that consecutive
TSC values read will be from different
processors and so TSC might appear to go
backwards by some small amount.  So apps
must still put raw TSC values through
a "monotonicity filter".  (Xen already
does this for emulated reads of TSC.)

Comments?

> -----Original Message-----
> From: Dan Magenheimer 
> Sent: Friday, September 18, 2009 10:30 AM
> To: Xen-Devel (E-mail)
> Subject: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I've
> been looking for)
> 
> 
> Xen doesn't appear to support the rdtscp instruction.
> Should it?  (And specifically I'm wondering whether
> it should be emulated whenever rdtsc is emulated
> but see below for another intriguing possibility.)
> 
> Rdtscp is unprivileged and we have apps that are using it
> on bare metal, after validating that the CPU supports it.
> The instruction is available on most (all?) recent AMD
> CPUs and Intel's Nehalem supports it.
> 
> For an OS to support rdtscp properly, the OS must (once at boot)
> wrmsr a different value for each cpu to a "TSC_AUX" register
> and this register is read along with the TSC when the rdtscp
> instruction is executed.  This allows an app to determine
> if two consecutive rdtsc's are (or are not) executed on the
> same CPU.
> 
> It appears that all recent RHEL kernels write to TSC_AUX if
> the CPU supports rdtscp.  I'm told Windows 2008 notably does
> not.  Don't know about SLES or other Windoze.
> 
> Its not clear to me if/how rdtscp can/should be virtualized.
> To do it properly, the value written to the TSC_AUX msr
> would become part of the vcpu's state, and would need to
> be changed whenever a vcpu->pcpu mapping changes.  To meet
> only the current use model of the instruction, Xen could write
> TSC_AUX for each pcpu on Xen boot and always ignore guest
> OS writes to TSC_AUX.  (This assumes that no OS ever reads
> TSC_AUX and attempts to match it with the value that it
> thought it wrote to TSC_AUX; and assumes that 
> 
> One solution is for Xen to deny the existence of rdtscp even
> when Xen is running on hardware that supports it.  Is that
> exactly what is happening?
> 
> Now thinking creatively, could TSC_AUX be used similar
> to the pvclock version number... Xen bumps it whenever a
> migration occurs which would prompt an app to go out
> and reread new values for scaling and offset (possibly
> via specially-handled-by-Xen usermode rdmsr)?  Hmmm...
> I think it might be the answer I've been looking for!
> (Go ahead, shoot me down :-)
> 
> Dan
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-18 20:27 ` Dan Magenheimer
@ 2009-09-18 22:55   ` Jeremy Fitzhardinge
  2009-09-19 15:34     ` Dan Magenheimer
  2009-09-21  8:17   ` Jan Beulich
  1 sibling, 1 reply; 34+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-18 22:55 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

On 09/18/09 13:27, Dan Magenheimer wrote:
> If guest vm.cfg has vrdtscp=0 (default):
>   rdtscp is emulated and returns nsec since guest
>   boot (same as emulated rdtsc), value returned
>   for TSC_AUX is -1
>
> If guest vm.cfg has vrdtscp=1:
>   If underlying hardware has rdtscp support:
>     rdtscp is directly executed by hardware,
>     value returned for TSC_AUX is non-zero
>     (see below)
>   Else: (no hardware rdtscp support)
>     rdtscp is emulated and returns nsec since
>     guest boot, value returned for TSC_AUX is 0
>   

Why do you need to distinguish between the two emulated rdtscp cases? 
Special-casing a version of '0' is awkward because it would arise
naturally from version wraparound (after 2^31 time parameter updates,
but still).

If the hardware doesn't support rdtscp, how should an app know whether
or not to use it?  Should it just try running rdtscp being prepared to
handle a SIGILL?

> How it works from the app point-of-view:
>
> Guest app must have some capability of getting 64-bit
> pvclock parameters directly from Xen without OS changes,
> e.g. emulated userland wrmsr, userland hypercall,
> or userland mapped shared page.  (This will be done
> rarely so need not be fast! But it does create
> a new userland<->Xen ABI that must be kept compatible.)
>
> On first rdtscp, app records returned TSC_AUX value,
> verifies that it is neither 0 nor -1,
> fetches pvclock parameters from Xen, executes
> another rdtscp.  If TSC_AUX matches previous value,
> app applies pvclock algorithm to tsc value to
> obtain nsec since guest boot.  If TSC_AUX is
> zero or -1, tsc value IS nsec since guest boot.
> If TSC_AUX differs from last recorded value,
> fetch pvclock parameters from Xen again.
>
> On subsequent rdtscp's, app compares
> returned TSC_AUX against the previous one,
> and fetches pvclock parameters from Xen only
> if it differs (which should be rare).
>   

Presumably the pvclock would contain the same version number which must
match; if not it keeps iterating (rdtscp, get-timing-parameters) until
they do.

> What Xen needs to do:
>
> Xen must record the setting for each guest's vrdtscp
> config variable and ensure that it persists across
> save/restore and migration.  If the guest has
> vrdtscp=1, a vrdtscp "version" number is also
> part of the guest's state and must persist
> across save/restore/migration.
>
> Xen must know whether or not it is running on a
> machine where TSC is reliable.  If TSC is NOT
> reliable AND rdtscp is supported by hardware,
> Xen must ensure that TSC_AUX is -1 on all pcpu's
> that are running a guest with vrdtscp=0, and 0
> on all pcpu's that are running a guest where
> vrdtscp=1 (and must enable CR4.TSD on those
> pcpus if it wasn't already).

If rdtscp is not reliable but Xen has accurate tsc parameter info, then
the algorithm above will still work efficiently.

>   If TSC is NOT
> reliable AND rdtscp is NOT supported by hardware,
> Xen must emulate rdtscp (e.g.
> return Xen system time) and emulate the
> same behavior for TSC_AUX.  If TSC IS reliable,
> Xen sets TSC_AUX to the guest's vrdtscp version
> number on all pcpu's that are running the guest.
> Finally, when a guest transitions from one
> "TSC domain" to another (restore/migrate/NUMA)
> it increments the vrdtscp version number.
>   

Well, it just needs to increment it whenever Xen knows the tsc has
changed, as the current pvclock code does.  It could be more frequently
than restore/migrate if tsc changes on power events.

> The only problem I can see is that when
> vrdtscp==1, other apps that are running on that guest
> that use rdtsc (no p) directly (i.e. haven't been
> modified to use pv-rdtscp) will continue to
> have the same kinds of failure on save/restore/
> migration.  But this is true of all the solutions
> proposed so far: Xen can only turn on emulation
> guest-wide, not per-app.
>   

Linux already reserves rdtscp for use as part of vsyscall, where TSC_AUX
contains the NUMA node and the CPU number, so there should be no "naked"
users of rdtscp.

> Also even on machines where TSC is reliable,
> there is a small chance that consecutive
> TSC values read will be from different
> processors and so TSC might appear to go
> backwards by some small amount.  So apps
> must still put raw TSC values through
> a "monotonicity filter".  (Xen already
> does this for emulated reads of TSC.)
>   

Why?  I thought "reliable" tscs were supposed to be synced between cores?

    J

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-18 22:55   ` Jeremy Fitzhardinge
@ 2009-09-19 15:34     ` Dan Magenheimer
  2009-09-21 14:47       ` Dan Magenheimer
  2009-09-21 18:36       ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-19 15:34 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

> Why do you need to distinguish between the two emulated rdtscp cases? 
> Special-casing a version of '0' is awkward because it would arise
> naturally from version wraparound (after 2^31 time parameter updates,
> but still).

You're right, I don't need to differentiate between
the two emulated cases.  I was trying to overload
an extra piece of information that I really don't
need to overload.

However, I do need one special case to indicate
emulation vs non-emulation, so wraparound is
still a problem.

Fortunately, wraparound should only occur impossibly
rarely (see below), probably less frequently than
TSC wraparound.

> If the hardware doesn't support rdtscp, how should an app know whether
> or not to use it?  Should it just try running rdtscp being prepared to
> handle a SIGILL?

Yes, that's the plan.  I think this scheme always
works, but only works fast if the hardware supports
rdtscp and constant_tsc.

> If rdtscp is not reliable but Xen has accurate tsc parameter 
> info, then
> the algorithm above will still work efficiently.
> :
> Well, it just needs to increment it whenever Xen knows the tsc has
> changed, as the current pvclock code does.  It could be more 
> frequently
> than restore/migrate if tsc changes on power events.

I've restricted the scheme to constant_tsc as I think
it breaks down due to nasty races if running on a
machine where the pvclock parameters differ across
different pcpus.  I think the races can only be
avoided if Xen sets the TSC_AUX for all of the
pcpus running a pvrdtscp doman while all are idle.

Is there a scheme that avoids the races?

Fortunately, this also has the effect of greatly
reducing the version increase frequency.

> > Also even on machines where TSC is reliable,
> > there is a small chance that consecutive
> > TSC values read will be from different
> > processors and so TSC might appear to go
> > backwards by some small amount.  So apps
> > must still put raw TSC values through
> > a "monotonicity filter".  (Xen already
> > does this for emulated reads of TSC.)
> 
> Why?  I thought "reliable" tscs were supposed to be synced 
> between cores?

The rate is synced but the values may not be.  Since
software (BIOS or Xen) sets tsc on each processor
it is essentially impossible to ensure they are
identical.  The rendezvous algorithm should be able
to set them so that they are "unobservably" different,
but I keep hearing "within 2usec".  (It would be
interesting to measure this across a broad set
of machines.)  So it's probably prudent to recommend
that apps be prepared for the possibility even if
it never happens.

Dan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-18 20:27 ` Dan Magenheimer
  2009-09-18 22:55   ` Jeremy Fitzhardinge
@ 2009-09-21  8:17   ` Jan Beulich
  2009-09-21 14:04     ` Dan Magenheimer
  1 sibling, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2009-09-21  8:17 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: JeremyFitzhardinge, Xen-Devel (E-mail),
	kurt.hackel, keir.fraser, KeirFraser

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 18.09.09 22:27 >>>
>Guest app must have some capability of getting 64-bit
>pvclock parameters directly from Xen without OS changes,
>e.g. emulated userland wrmsr, userland hypercall,
>or userland mapped shared page.  (This will be done
>rarely so need not be fast! But it does create
>a new userland<->Xen ABI that must be kept compatible.)

Are you sure this will indeed be infrequent enough? On my supposedly
constant-TSC AMD box, I see Xen quite frequently apply small error
correction factors to keep TSC from running ahead of HPET/PMTIMER.

>I think this will work even for a NUMA machine
>provided Xen always schedules all the vcpus
>for one guest on pcpus in the same NUMA node,
>and increments the version number when
>the guest is rescheduled from one NUMA node to
>another (assuming TSC on each node is reliable).

I think this is an improper assumption: Any guest with more vCPU-s than
there are pCPU-s on a single node will likely benefit from being run on
two (or more) nodes (compared to its vCPU-s competing amongst
themselves for pCPU-s on a single node).

Jan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21  8:17   ` Jan Beulich
@ 2009-09-21 14:04     ` Dan Magenheimer
  2009-09-21 14:18       ` Jan Beulich
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-21 14:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kurt.hackel, JeremyFitzhardinge, Xen-Devel (E-mail), <Keir Fraser

Hi Jan --

Thanks for the feedback!

> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 18.09.09 22:27 >>>
> >Guest app must have some capability of getting 64-bit
> >pvclock parameters directly from Xen without OS changes,
> >e.g. emulated userland wrmsr, userland hypercall,
> >or userland mapped shared page.  (This will be done
> >rarely so need not be fast! But it does create
> >a new userland<->Xen ABI that must be kept compatible.)
> 
> Are you sure this will indeed be infrequent enough? On my supposedly
> constant-TSC AMD box, I see Xen quite frequently apply small error
> correction factors to keep TSC from running ahead of HPET/PMTIMER.

I'd like to hear from Keir on this, but I'd
guess that this would be either a bug or a
remnant of or inaccuracy in an old algorithm.

Also if you could provide more information, I'd
like to see if I can reproduce it on my Intel
constant_tsc machines.
 
> >I think this will work even for a NUMA machine
> >provided Xen always schedules all the vcpus
> >for one guest on pcpus in the same NUMA node,
> >and increments the version number when
> >the guest is rescheduled from one NUMA node to
> >another (assuming TSC on each node is reliable).
> 
> I think this is an improper assumption: Any guest with more 
> vCPU-s than
> there are pCPU-s on a single node will likely benefit from 
> being run on
> two (or more) nodes (compared to its vCPU-s competing amongst
> themselves for pCPU-s on a single node).

Any guest that has some vcpus running on pcpus
in one "TSC domain" and other vcpus running on
pcpus in another "TSC domain" would
have to be handled the same as running on a
machine with tsc_NOT_constant.  This does raise
a challenge for multi-socket machines that Xen has
to be able to determine and record what pcpu's are
within a TSC domain boundary, which may or may
not be the same as a NUMA boundary.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 14:04     ` Dan Magenheimer
@ 2009-09-21 14:18       ` Jan Beulich
  2009-09-21 15:25         ` Dan Magenheimer
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2009-09-21 14:18 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: kurt.hackel, JeremyFitzhardinge, Xen-Devel (E-mail), <Keir Fraser

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 21.09.09 16:04 >>>
>> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 18.09.09 22:27 >>>
>> >Guest app must have some capability of getting 64-bit
>> >pvclock parameters directly from Xen without OS changes,
>> >e.g. emulated userland wrmsr, userland hypercall,
>> >or userland mapped shared page.  (This will be done
>> >rarely so need not be fast! But it does create
>> >a new userland<->Xen ABI that must be kept compatible.)
>> 
>> Are you sure this will indeed be infrequent enough? On my supposedly
>> constant-TSC AMD box, I see Xen quite frequently apply small error
>> correction factors to keep TSC from running ahead of HPET/PMTIMER.
>
>I'd like to hear from Keir on this, but I'd
>guess that this would be either a bug or a
>remnant of or inaccuracy in an old algorithm.
>
>Also if you could provide more information, I'd
>like to see if I can reproduce it on my Intel
>constant_tsc machines.
 
Not sure what further detail you mean - all that it is you would want to
look for are cases where error_factor is non-zero in
local_time_calibration() (or local time getting warped forward in the
same function; but I can only say for sure that the former does happen
not infrequently in terms of the percentage of executions of
local_time_calibration() - of course, that function itself doesn't run
very frequently).

Jan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-19 15:34     ` Dan Magenheimer
@ 2009-09-21 14:47       ` Dan Magenheimer
  2009-09-21 18:36       ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-21 14:47 UTC (permalink / raw)
  To: dan.magenheimer, Jeremy Fitzhardinge
  Cc: Jan, kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Beulich

> > Why do you need to distinguish between the two emulated 
> rdtscp cases? 
> > Special-casing a version of '0' is awkward because it would arise
> > naturally from version wraparound (after 2^31 time 
> parameter updates,
> > but still).
> 
> You're right, I don't need to differentiate between
> the two emulated cases.  I was trying to overload
> an extra piece of information that I really don't
> need to overload.
> 
> However, I do need one special case to indicate
> emulation vs non-emulation, so wraparound is
> still a problem.
> 
> Fortunately, wraparound should only occur impossibly
> rarely (see below), probably less frequently than
> TSC wraparound.

I realized later that since Xen controls the values
placed in TSC_AUX, it can easily skip any special-cased
values.  Then wraparound is not a problem as long as
the app tests for "version number is different" rather
than "version number is greater."

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 14:18       ` Jan Beulich
@ 2009-09-21 15:25         ` Dan Magenheimer
  2009-09-21 15:41           ` Keir Fraser
  2009-09-21 16:03           ` Jan Beulich
  0 siblings, 2 replies; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-21 15:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kurt.hackel, JeremyFitzhardinge, Xen-Devel (E-mail), <Keir Fraser

> >> Are you sure this will indeed be infrequent enough? On my 
> supposedly
> >> constant-TSC AMD box, I see Xen quite frequently apply small error
> >> correction factors to keep TSC from running ahead of HPET/PMTIMER.
> >
> >I'd like to hear from Keir on this, but I'd
> >guess that this would be either a bug or a
> >remnant of or inaccuracy in an old algorithm.
> >
> >Also if you could provide more information, I'd
> >like to see if I can reproduce it on my Intel
> >constant_tsc machines.
>  
> Not sure what further detail you mean - all that it is you 
> would want to
> look for are cases where error_factor is non-zero in
> local_time_calibration() (or local time getting warped forward in the
> same function; but I can only say for sure that the former does happen
> not infrequently in terms of the percentage of executions of
> local_time_calibration() - of course, that function itself doesn't run
> very frequently).

OK, I think I see the problem.

Since cs19506 "consistent_tscs" is a Xen boot parameter that
defaults to disabled.  If the boot parameter is enabled and
the boot cpu does NOT have X86_FEATURE_CONSTANT_TSC set,
consistent_tscs gets re-disabled.

For my pvrdtscp scheme to work, consistent_tscs would need to
be changed so that it defaults to enabled.

Jan, could you confirm that this solves the problem on your
constant-TSC AMD box?

Keir, is there any reason that consistent_tscs shouldn't
default to enabled?

Thanks,
Dan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 15:25         ` Dan Magenheimer
@ 2009-09-21 15:41           ` Keir Fraser
  2009-09-21 15:53             ` Keir Fraser
  2009-09-21 16:03           ` Jan Beulich
  1 sibling, 1 reply; 34+ messages in thread
From: Keir Fraser @ 2009-09-21 15:41 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich
  Cc: kurt.hackel, JeremyFitzhardinge, Xen-Devel (E-mail)

On 21/09/2009 16:25, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> OK, I think I see the problem.
> 
> Since cs19506 "consistent_tscs" is a Xen boot parameter that
> defaults to disabled.  If the boot parameter is enabled and
> the boot cpu does NOT have X86_FEATURE_CONSTANT_TSC set,
> consistent_tscs gets re-disabled.
> 
> For my pvrdtscp scheme to work, consistent_tscs would need to
> be changed so that it defaults to enabled.
> 
> Jan, could you confirm that this solves the problem on your
> constant-TSC AMD box?
> 
> Keir, is there any reason that consistent_tscs shouldn't
> default to enabled?

There was some question whether it means what you think it means across NUMA
nodes. If we are sure that it does guarantee consistency across NUMA nodes
-- or if we don't care about that -- then it can be enabled by default.

 -- Keir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 15:41           ` Keir Fraser
@ 2009-09-21 15:53             ` Keir Fraser
  2009-09-21 16:55               ` Dan Magenheimer
  0 siblings, 1 reply; 34+ messages in thread
From: Keir Fraser @ 2009-09-21 15:53 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich
  Cc: kurt.hackel, JeremyFitzhardinge, Xen-Devel (E-mail)

On 21/09/2009 16:41, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:

>> Keir, is there any reason that consistent_tscs shouldn't
>> default to enabled?
> 
> There was some question whether it means what you think it means across NUMA
> nodes. If we are sure that it does guarantee consistency across NUMA nodes
> -- or if we don't care about that -- then it can be enabled by default.

There is a question mark over this, since it's not really clear what the
CONSTANT_TSC feature flag actually means. For example, it is set if
CPUID:0x80000007:EDX:8 is set, and that flag merely means that this
particular core's TSC rate is invariant across all Cx/Px/Tx power-saving
states. It doesn't directly say anything about TSC consistency across cores
or sockets unless we are prepared to assume a couple of things: primarily
that all packages run their TSCs at the same rate, and that they are clocked
from the same mainboard oscillator. Is that reasonable to assume? We at
least know the latter is not likely to be true for big-iron NUMA systems,
across NUMA nodes.

 -- Keir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 15:25         ` Dan Magenheimer
  2009-09-21 15:41           ` Keir Fraser
@ 2009-09-21 16:03           ` Jan Beulich
  1 sibling, 0 replies; 34+ messages in thread
From: Jan Beulich @ 2009-09-21 16:03 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: kurt.hackel, JeremyFitzhardinge, Xen-Devel (E-mail), <Keir Fraser

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 21.09.09 17:25 >>>
>Jan, could you confirm that this solves the problem on your
>constant-TSC AMD box?

Based on Keir's responses I don't think there's a point trying.

Jan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 15:53             ` Keir Fraser
@ 2009-09-21 16:55               ` Dan Magenheimer
  2009-09-21 17:02                 ` Keir Fraser
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-21 16:55 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich
  Cc: JeremyFitzhardinge, Xen-Devel (E-mail),
	kurt.hackel, Langsdorf, Mark, Nakajima, Jun, Alex Williamson

> >> Keir, is there any reason that consistent_tscs shouldn't
> >> default to enabled?
>
> There is a question mark over this, since it's not really 
> clear what the
> CONSTANT_TSC feature flag actually means. For example, it is set if
> CPUID:0x80000007:EDX:8 is set, and that flag merely means that this
> particular core's TSC rate is invariant across all Cx/Px/Tx 
> power-saving
> states. It doesn't directly say anything about TSC 
> consistency across cores
> or sockets unless we are prepared to assume a couple of 
> things: primarily
> that all packages run their TSCs at the same rate, and that 
> they are clocked
> from the same mainboard oscillator. Is that reasonable to 
> assume? We at
> least know the latter is not likely to be true for big-iron 
> NUMA systems,
> across NUMA nodes.

Both Intel and AMD have confirmed that constant_tsc means
that TSC is consistent across all cores and even across
multiple sockets; and at least one major system vendor (HP)
with multi-enclosure "big iron" AMD-based NUMA systems has
confirmed that TSC is consistent across all nodes.   So
by applying the Xen rendezvous-sync algorithm (that writes
tsc every second) on such machines, Xen has actually been
creating a tsc-sync problem, not alleviating one!

I've cc'ed key AMD/Intel/HP experts who can confirm or
correct/clarify any misassumptions I might have.

I *think* "CPU reports tsc_is_constant but it's not
really constant across all sockets/enclosures/nodes" does
exist, but may be limited to a few older exceptions such
as IBM Summit systems.  Upstream Linux now assumes that
constant_tsc applies across all CPUs unless the kernel
is compiled with CONFIG_X86_NUMAQ (note NOT CONFIG_X86_NUMA),
so Linux has now embraced constant_tsc.

So I'm thinking we should treat consistent_tscs as the
rule rather than the exception, and place the onus on
"broken" systems to disable consistent_tscs with the
boot option when necessary.  To be extremely safe,
we could also add some code in
time_calibration_std_rendezvous() to check
for "signficant" tsc differences and report it (and
maybe even auto-disable consistent_tscs).

(One minor correction also:  constant_tsc does NOT
guarantee tsc continues to increment across deep-C-
states... that requires nonstop_tsc.  But Xen already
has the logic to correct deep-C-states in
cstate_restore_tsc().)

Dan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 16:55               ` Dan Magenheimer
@ 2009-09-21 17:02                 ` Keir Fraser
  2009-09-21 17:56                   ` Dan Magenheimer
  0 siblings, 1 reply; 34+ messages in thread
From: Keir Fraser @ 2009-09-21 17:02 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich
  Cc: JeremyFitzhardinge, Xen-Devel (E-mail),
	kurt.hackel, Langsdorf, Mark, Nakajima, Jun, Alex Williamson

On 21/09/2009 17:55, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> Both Intel and AMD have confirmed that constant_tsc means
> that TSC is consistent across all cores and even across
> multiple sockets; and at least one major system vendor (HP)
> with multi-enclosure "big iron" AMD-based NUMA systems has
> confirmed that TSC is consistent across all nodes.   So
> by applying the Xen rendezvous-sync algorithm (that writes
> tsc every second) on such machines, Xen has actually been
> creating a tsc-sync problem, not alleviating one!

Constant_tsc is not even directly a hardware flag. It's a synthetic value
that Linux derives for itself and we inherited. Are vendors really making
guarantees about a flag which they do not directly provide?

 -- Keir

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 17:02                 ` Keir Fraser
@ 2009-09-21 17:56                   ` Dan Magenheimer
  2009-09-21 18:17                     ` Keir Fraser
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-21 17:56 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich
  Cc: JeremyFitzhardinge, Xen-Devel (E-mail),
	Alex, kurt.hackel, Langsdorf, Mark, Nakajima, Jun, Williamson

> Are vendors really making guarantees about a flag
> which they do not directly provide?

Sorry, I was overly terse and had lost some of my
context due to a machine crash over the weekend.

By constant_tsc I mean that CPUID:0x80000007:EDX:8
is set.  Upstream Linux (2.6.30) now uses the term
X86_FEATURE_TSC_RELIABLE to indicate that tsc is
consistent across cores and sockets and
X86_FEATURE_NONSTOP_TSC to indicate that it
doesn't stop in deep C-states (which Xen compensates
for) and X86_FEATURE_CONSTANT_TSC to indicate that
it stays running across P/T state transitions.
On Intel systems, CPUID:0x80000007:EDX:8 enables
all of these feature flags.  (Interestingly, on
AMD systems, X86_FEATURE_TSC_RELIABLE is *not*
set by this bit... so my information from AMD is
not represented in Linux (yet)).  Note also that
in linux-2.6.30/arch/x86/kernel/cpu/vmware.c, both
X86_FEATURE_CONSTANT_TSC and X86_FEATURE_TSC_RELIABLE
get set.

Some of this is explained nicely here:
http://lkml.indiana.edu/hypermail/linux/kernel/0811.2/00837.html
https://lists.ubuntu.com/archives/kernel-team/2008-October/004279.html
https://lists.ubuntu.com/archives/kernel-team/2008-October/004282.html

(This last one also re-enforces my answer to Jeremy as
to why users of the proposed pvrdtscp interface would
still need to post-filter rdtscp values to guarantee no
time-going-backwards problems.)

> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Monday, September 21, 2009 11:02 AM
> To: Dan Magenheimer; Jan Beulich
> Cc: JeremyFitzhardinge; Xen-Devel (E-mail); Kurt Hackel; Langsdorf,
> Mark; Nakajima, Jun; Alex Williamson
> Subject: Re: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer
> I've been looking for)
> 
> 
> On 21/09/2009 17:55, "Dan Magenheimer" 
> <dan.magenheimer@oracle.com> wrote:
> 
> > Both Intel and AMD have confirmed that constant_tsc means
> > that TSC is consistent across all cores and even across
> > multiple sockets; and at least one major system vendor (HP)
> > with multi-enclosure "big iron" AMD-based NUMA systems has
> > confirmed that TSC is consistent across all nodes.   So
> > by applying the Xen rendezvous-sync algorithm (that writes
> > tsc every second) on such machines, Xen has actually been
> > creating a tsc-sync problem, not alleviating one!
> 
> Constant_tsc is not even directly a hardware flag. It's a 
> synthetic value
> that Linux derives for itself and we inherited. Are vendors 
> really making
> guarantees about a flag which they do not directly provide?
> 
>  -- Keir
> 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 17:56                   ` Dan Magenheimer
@ 2009-09-21 18:17                     ` Keir Fraser
  2009-09-21 21:47                       ` Dan Magenheimer
  0 siblings, 1 reply; 34+ messages in thread
From: Keir Fraser @ 2009-09-21 18:17 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich
  Cc: JeremyFitzhardinge, Xen-Devel (E-mail),
	kurt.hackel, Langsdorf, Mark, Nakajima, Jun, Alex Williamson

On 21/09/2009 18:56, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> By constant_tsc I mean that CPUID:0x80000007:EDX:8
> is set.

Well, if it is at least true for 99% of systems, then it might be worth
enabling constant_tsc support by default, and detect TSC divergence at
runtime and disbale dynamically. I think that's what Linux does (i.e., it
has a fallback at runtime if its TSC assumptions turn out to be wrong).

 -- Keir

>  Upstream Linux (2.6.30) now uses the term
> X86_FEATURE_TSC_RELIABLE to indicate that tsc is
> consistent across cores and sockets and
> X86_FEATURE_NONSTOP_TSC to indicate that it
> doesn't stop in deep C-states (which Xen compensates
> for) and X86_FEATURE_CONSTANT_TSC to indicate that
> it stays running across P/T state transitions.
> On Intel systems, CPUID:0x80000007:EDX:8 enables
> all of these feature flags.  (Interestingly, on
> AMD systems, X86_FEATURE_TSC_RELIABLE is *not*
> set by this bit... so my information from AMD is
> not represented in Linux (yet)).  Note also that
> in linux-2.6.30/arch/x86/kernel/cpu/vmware.c, both
> X86_FEATURE_CONSTANT_TSC and X86_FEATURE_TSC_RELIABLE
> get set.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-19 15:34     ` Dan Magenheimer
  2009-09-21 14:47       ` Dan Magenheimer
@ 2009-09-21 18:36       ` Jeremy Fitzhardinge
  2009-09-21 22:20         ` Dan Magenheimer
  2009-09-22  7:39         ` Jan Beulich
  1 sibling, 2 replies; 34+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-21 18:36 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

On 09/19/09 08:34, Dan Magenheimer wrote:
> You're right, I don't need to differentiate between
> the two emulated cases.  I was trying to overload
> an extra piece of information that I really don't
> need to overload.
>
> However, I do need one special case to indicate
> emulation vs non-emulation, so wraparound is
> still a problem.
>   

I was assuming you'd just repurpose the existing version number scheme
which is always even, and therefore can never equal -1.

>> > If the hardware doesn't support rdtscp, how should an app know whether
>> > or not to use it?  Should it just try running rdtscp being prepared to
>> > handle a SIGILL?
>>     
> Yes, that's the plan.  I think this scheme always
> works, but only works fast if the hardware supports
> rdtscp and constant_tsc

What's the full algorithm for detecting this feature?  Usermode has to
establish:

   1. It is running under Xen (or not, if you expect this to be
      implemented on multiple hypervisors)
   2. rdtscp is available
   3. the ABI is actually being implemented, ie:
         1. the tsc_aux value actually has the correct meaning
         2. it has a working mechanism for getting the tsc scaling
            parameters
         3. (accommodate ways to evolve the ABI in a back-compatible way)

before it can do anything else.

If nothing else, its probably worth removing the rdtscp feature from the
logical guest cpuid, so that nothing else tries to use it for its own
purposes; in other words, you're exclusively claiming rdtscp for this
ABI.  Or you could disable this ABI if a guest kernel tries to set TSC_AUX.

> I've restricted the scheme to constant_tsc as I think
> it breaks down due to nasty races if running on a
> machine where the pvclock parameters differ across
> different pcpus.  I think the races can only be
> avoided if Xen sets the TSC_AUX for all of the
> pcpus running a pvrdtscp doman while all are idle.
>
> Is there a scheme that avoids the races?
>   

rdtscp makes it quite easy to avoid races because you get the tsc and
metadata about the tsc atomically.  You just need to encode enough info
in the metadata to do the conversion.

The obvious thing to do is to pack a version number and pcpu number into
TSC_AUX.  Usermode would maintain an array of pv_clock parameters, one
for each pcpu.  If the version number matches, then it uses the
parameters it has; if not it fetches new parameters and repeats the
rdtscp.  There's no need to worry about either thread or vcpu context
switches because you get the (tsc,params) tuple atomically, which is the
tricky bit without rdtscp.

(The version number would be truncated wrt the normal pvclock version
number, but it just needs to be large enough to avoid aliasing from
wrapping; I'm assuming something like 24 bits version and 8 bits cpu
number.)

> Fortunately, this also has the effect of greatly
> reducing the version increase frequency.
>   

I don't think that's going to be a huge issue; fetching time parameters
with a syscall/hypercall would be on the same order as doing an emulated
rdtsc, and would only need to happen, say, once per timeslice (100Hz?)
at the outside.

> The rate is synced but the values may not be.  Since
> software (BIOS or Xen) sets tsc on each processor
> it is essentially impossible to ensure they are
> identical.  The rendezvous algorithm should be able
> to set them so that they are "unobservably" different,
> but I keep hearing "within 2usec".  (It would be
> interesting to measure this across a broad set
> of machines.)  So it's probably prudent to recommend
> that apps be prepared for the possibility even if
> it never happens.
>   

You don't need to guarantee anything stronger than they'd see on bare
hardware.  You also need to be more precise about exactly what you're
guaranteeing.

Are you saying that a single thread will never see regressing tscs? 
That just requires making sure that Xen gets the tscs synced closer than
the context switch time of a thread between cpus, which should be possible.

Or are you making the stronger guarantee that two threads running
concurrently on different cpus doing rdtsc will see monotonically
increasing tscs with respect to the ordering of all their operations? 
That would require arbitrarily close syncing (well, within a the time it
takes a cacheline to bounce I guess).

    J

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 18:17                     ` Keir Fraser
@ 2009-09-21 21:47                       ` Dan Magenheimer
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-21 21:47 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich
  Cc: JeremyFitzhardinge, Xen-Devel (E-mail),
	Alex, kurt.hackel, Langsdorf, Mark, Nakajima, Jun, Williamson

> > By constant_tsc I mean that CPUID:0x80000007:EDX:8
> > is set.
> 
> Well, if it is at least true for 99% of systems, then it 
> might be worth

Well I'm not sure how to count, but I'd venture to guess
that close to 99% of servers out there (that are new enough
to have this CPUID bit set) are single socket.  So
as long as constant_tsc applies across all cores in
a socket, your 99% test applies.  But according to
Intel and AMD, it should also apply across multiple
sockets, and according to HP, it applies on one big
NUMA machine even across enclosures.

> enabling constant_tsc support by default, and detect TSC divergence at
> runtime and disbale dynamically. I think that's what Linux 
> does (i.e., it
> has a fallback at runtime if its TSC assumptions turn out to 
> be wrong).

Indeed Linux does, and the code looks easy to leverage.
See arch/x86/kernel/tsc_sync.c where check_tsc_sync_*
is defined, used by start_secondary() and native_cpu_up()
in arch/x86/kernel/smpboot.c.  It may actually too
aggressively test for TSC reliability as it can fail
if TSC's differ by more than "a cacheline bounce",
which is a lot more restrictive than Xen cares about
(or any userland algorithm that post-processes for
monotonicity).

In fact, Linux no longer does any kind of write_tsc(0)
at processor boot but apparently instead assumes
that the BIOS has done the synchronization.  I
don't know if/how the BIOS could do a better job than
Xen's current rendezvous algorithm, but if it does,
Xen's code may not only be superfluous but also
making problems worse.  We should probably test
for divergence and only write_tsc if the
test fails?

P.S. I was looking at 2.6.30 and 2.6.31 though it
looks like check_tsc_sync been around since at least
2.6.24:  http://lwn.net/Articles/211051/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 18:36       ` Jeremy Fitzhardinge
@ 2009-09-21 22:20         ` Dan Magenheimer
  2009-09-21 22:50           ` Jeremy Fitzhardinge
  2009-09-22  7:39         ` Jan Beulich
  1 sibling, 1 reply; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-21 22:20 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

> > However, I do need one special case to indicate
> > emulation vs non-emulation, so wraparound is
> > still a problem.
> 
> I was assuming you'd just repurpose the existing version number scheme
> which is always even, and therefore can never equal -1.

That wasn't my plan but if it can be made to work (see
below), it probably saves code in Xen.

> What's the full algorithm for detecting this feature?  Usermode has to
> establish:
> 
>    1. It is running under Xen (or not, if you expect this to be
>       implemented on multiple hypervisors)
>    2. rdtscp is available
>    3. the ABI is actually being implemented, ie:
>          1. the tsc_aux value actually has the correct meaning
>          2. it has a working mechanism for getting the tsc scaling
>             parameters
>          3. (accommodate ways to evolve the ABI in a 
> back-compatible way)
> before it can do anything else.

Yes, that's what I was thinking.  I was planning on prototyping
these checks with "userland-rdmsr" but userland-hypercall or
userland-shared-page could work also.

> If nothing else, its probably worth removing the rdtscp 
> feature from the
> logical guest cpuid, so that nothing else tries to use it for its own
> purposes; in other words, you're exclusively claiming rdtscp for this
> ABI.  Or you could disable this ABI if a guest kernel tries 
> to set TSC_AUX.

I was thinking that setting pvrdtscp=1 would override
any kernel use of rdtscp/TSC_AUX, but disabling the
cpuid has_rdtscp flag and using a different userland
detection mechanism (than checking cpuid for has_rdtscp)
would be a better way to avoid possible conflict.

> > I've restricted the scheme to constant_tsc as I think
> > it breaks down due to nasty races if running on a
> > machine where the pvclock parameters differ across
> > different pcpus.  I think the races can only be
> > avoided if Xen sets the TSC_AUX for all of the
> > pcpus running a pvrdtscp doman while all are idle.
> >
> > Is there a scheme that avoids the races? 
> 
> rdtscp makes it quite easy to avoid races because you get the tsc and
> metadata about the tsc atomically.  You just need to encode 
> enough info
> in the metadata to do the conversion.

Yes but I don't think there is enough bits for encoding
it all (32-bits in TSC_AUX, right?).

> The obvious thing to do is to pack a version number and pcpu 
> number into
> TSC_AUX.  Usermode would maintain an array of pv_clock parameters, one
> for each pcpu.  If the version number matches, then it uses the
> parameters it has; if not it fetches new parameters and repeats the
> rdtscp.  There's no need to worry about either thread or vcpu context
> switches because you get the (tsc,params) tuple atomically, 
> which is the
> tricky bit without rdtscp.
> 
> (The version number would be truncated wrt the normal pvclock version
> number, but it just needs to be large enough to avoid aliasing from
> wrapping; I'm assuming something like 24 bits version and 8 bits cpu
> number.)

I think a race occurs if the vcpu switches pcpu TWICE
from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
each time on pcpu-A but reads one or more pvclock parameters
(that are too big to be encoded in TSC_AUX) on pcpu-B.
If Xen can atomically bump/change
TSC_AUX on *all* pcpus runniing a guest vcpu, the race
can be avoided.  But I suspect that is too expensive (some
kind of rendezvous required for each bump on any processor).

> > Fortunately, this also has the effect of greatly
> > reducing the version increase frequency.
> 
> I don't think that's going to be a huge issue; fetching time 
> parameters
> with a syscall/hypercall would be on the same order as doing 
> an emulated
> rdtsc, and would only need to happen, say, once per timeslice (100Hz?)
> at the outside.

Even if my assumption of the race (above) is incorrect,
32-bits is not very much time at 100Hz.  But the version
bump needs to occur synchronously with every P/C-state
transition for pvclock to work on non_constant_tsc machines
doesn't it?  How frequent can those transitions occur?
 
> > The rate is synced but the values may not be.  Since
> > software (BIOS or Xen) sets tsc on each processor
> > it is essentially impossible to ensure they are
> > identical.  The rendezvous algorithm should be able
> > to set them so that they are "unobservably" different,
> > but I keep hearing "within 2usec".  (It would be
> > interesting to measure this across a broad set
> > of machines.)  So it's probably prudent to recommend
> > that apps be prepared for the possibility even if
> > it never happens.
> 
> You don't need to guarantee anything stronger than they'd see on bare
> hardware.  You also need to be more precise about exactly what you're
> guaranteeing.
> 
> Are you saying that a single thread will never see regressing tscs? 
> That just requires making sure that Xen gets the tscs synced 
> closer than
> the context switch time of a thread between cpus, which 
> should be possible.
> 
> Or are you making the stronger guarantee that two threads running
> concurrently on different cpus doing rdtsc will see monotonically
> increasing tscs with respect to the ordering of all their operations? 
> That would require arbitrarily close syncing (well, within a 
> the time it
> takes a cacheline to bounce I guess).

I guess this all depends on what Xen is capable of
guaranteeing.  If Xen can provide a "cacheline
bounce guarantee", the app shouldn't have to care.

Linux now seems to provide a cacheline bounce guarantee for
itself, but afaik has no way to communicate that to an app
using raw rdtsc{,p} and all the relevant syscalls have a
monotonicity option and/or have insufficient resolution
to matter.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 22:20         ` Dan Magenheimer
@ 2009-09-21 22:50           ` Jeremy Fitzhardinge
  2009-09-21 23:29             ` Dan Magenheimer
  0 siblings, 1 reply; 34+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-21 22:50 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

On 09/21/09 15:20, Dan Magenheimer wrote:
>>> However, I do need one special case to indicate
>>> emulation vs non-emulation, so wraparound is
>>> still a problem.
>>>       
>> I was assuming you'd just repurpose the existing version number scheme
>> which is always even, and therefore can never equal -1.
>>     
> That wasn't my plan but if it can be made to work (see
> below), it probably saves code in Xen.
>
>   
>> What's the full algorithm for detecting this feature?  Usermode has to
>> establish:
>>
>>    1. It is running under Xen (or not, if you expect this to be
>>       implemented on multiple hypervisors)
>>    2. rdtscp is available
>>    3. the ABI is actually being implemented, ie:
>>          1. the tsc_aux value actually has the correct meaning
>>          2. it has a working mechanism for getting the tsc scaling
>>             parameters
>>          3. (accommodate ways to evolve the ABI in a 
>> back-compatible way)
>> before it can do anything else.
>>     
> Yes, that's what I was thinking.  I was planning on prototyping
> these checks with "userland-rdmsr" but userland-hypercall or
> userland-shared-page could work also.
>
>   
>> If nothing else, its probably worth removing the rdtscp 
>> feature from the
>> logical guest cpuid, so that nothing else tries to use it for its own
>> purposes; in other words, you're exclusively claiming rdtscp for this
>> ABI.  Or you could disable this ABI if a guest kernel tries 
>> to set TSC_AUX.
>>     
> I was thinking that setting pvrdtscp=1 would override
> any kernel use of rdtscp/TSC_AUX, but disabling the
> cpuid has_rdtscp flag and using a different userland
> detection mechanism (than checking cpuid for has_rdtscp)
> would be a better way to avoid possible conflict.
>
>   
>>> I've restricted the scheme to constant_tsc as I think
>>> it breaks down due to nasty races if running on a
>>> machine where the pvclock parameters differ across
>>> different pcpus.  I think the races can only be
>>> avoided if Xen sets the TSC_AUX for all of the
>>> pcpus running a pvrdtscp doman while all are idle.
>>>
>>> Is there a scheme that avoids the races? 
>>>       
>> rdtscp makes it quite easy to avoid races because you get the tsc and
>> metadata about the tsc atomically.  You just need to encode 
>> enough info
>> in the metadata to do the conversion.
>>     
> Yes but I don't think there is enough bits for encoding
> it all (32-bits in TSC_AUX, right?).
>
>   
>> The obvious thing to do is to pack a version number and pcpu 
>> number into
>> TSC_AUX.  Usermode would maintain an array of pv_clock parameters, one
>> for each pcpu.  If the version number matches, then it uses the
>> parameters it has; if not it fetches new parameters and repeats the
>> rdtscp.  There's no need to worry about either thread or vcpu context
>> switches because you get the (tsc,params) tuple atomically, 
>> which is the
>> tricky bit without rdtscp.
>>
>> (The version number would be truncated wrt the normal pvclock version
>> number, but it just needs to be large enough to avoid aliasing from
>> wrapping; I'm assuming something like 24 bits version and 8 bits cpu
>> number.)
>>     
> I think a race occurs if the vcpu switches pcpu TWICE
> from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
> each time on pcpu-A but reads one or more pvclock parameters
> (that are too big to be encoded in TSC_AUX) on pcpu-B.
>   

That shouldn't matter.  Once the process has (tsc,cpu,version) it can
use its own local copy of cpu's pvclock parameters to compute the
tsc->ns conversion.  Once it has that triple, it doesn't matter if it
gets context-switched; the time computation doesn't depend on what CPU
is currently running. 

It only needs to iterate if it gets a version mismatch.  You can
potentially get a livelock if the version is constantly changing between
the rdtscp and the get-pvclock-params, and exacerbated if the process
keeps bouncing between cpus between the two.  But given that the
rdtsc+get-params should take no more than a couple of microseconds, it
seems very unlikely the process is sustaining a megahertz CPU migration
rate.

And even if it fails, the process always has to be prepared to go to
some other time source.

> If Xen can atomically bump/change
> TSC_AUX on *all* pcpus runniing a guest vcpu, the race
> can be avoided.  But I suspect that is too expensive (some
> kind of rendezvous required for each bump on any processor).
>   

Right.  Any synchronized cross-cpu call is going to be very expensive,
and can't be done atomically without some kind of stop-the-world which
is even worse.

> Even if my assumption of the race (above) is incorrect,
> 32-bits is not very much time at 100Hz.  But the version
> bump needs to occur synchronously with every P/C-state
> transition for pvclock to work on non_constant_tsc machines
> doesn't it?  How frequent can those transitions occur?
>   

24 bits at 100Hz is 46ish hours.  So there's a potential alias problem
if the program reads the tsc at precisely 46.603 (ish) hours after its
previous read.  One workaround would be to force a re-read of the timing
parameters every X secs/mins/hours to guarantee that there's no wrap for
some expected rate of param updates.

That said, the standard pvclock algorithm is only 128 times better than
that, and I don't think it has ever considered to be a problem.  I've
never seen an update rate of more than once every few seconds.

Also Xen need only update the version number if something has actually
read that version; if nobody had read the current parameters, there's no
need to update the version when updating them to a new value.  That
would help mitigate the case of rapid param updates and a low rate of
reading.

> I guess this all depends on what Xen is capable of
> guaranteeing.  If Xen can provide a "cacheline
> bounce guarantee", the app shouldn't have to care.
>   

It can't, in princple, sync the tscs at a finer grain than the app can
measure.  It only has the same resources to play with, and there'll
always be some error margin.

> Linux now seems to provide a cacheline bounce guarantee for
> itself, but afaik has no way to communicate that to an app
> using raw rdtsc{,p} and all the relevant syscalls have a
> monotonicity option and/or have insufficient resolution
> to matter.
>   

It's a detail that a usermode app can't rely on anyway.

    J

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 22:50           ` Jeremy Fitzhardinge
@ 2009-09-21 23:29             ` Dan Magenheimer
  2009-09-21 23:55               ` Jeremy Fitzhardinge
  2009-09-22  7:44               ` Jan Beulich
  0 siblings, 2 replies; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-21 23:29 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

> > I think a race occurs if the vcpu switches pcpu TWICE
> > from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
> > each time on pcpu-A but reads one or more pvclock parameters
> > (that are too big to be encoded in TSC_AUX) on pcpu-B.  
> 
> That shouldn't matter.  Once the process has (tsc,cpu,version) it can
> use its own local copy of cpu's pvclock parameters to compute the
> tsc->ns conversion.  Once it has that triple, it doesn't matter if it
> gets context-switched; the time computation doesn't depend on what CPU
> is currently running. 
> 
> It only needs to iterate if it gets a version mismatch.  You can
> potentially get a livelock if the version is constantly 
> changing between
> the rdtscp and the get-pvclock-params, and exacerbated if the process
> keeps bouncing between cpus between the two.  But given that the
> rdtsc+get-params should take no more than a couple of microseconds, it
> seems very unlikely the process is sustaining a megahertz CPU 
> migration
> rate.

Yes, I neglected an important pre-condition.  ASSUME the first
rdtscp on pcpu-A gets a version mismatch so that it must fetch
the parameters again.  Then: the vcpu switches pcpu TWICE
from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
each time on pcpu-A but reads one or more pvclock parameters
(that are too big to be encoded in TSC_AUX) on pcpu-B.

I agree that this is vanishingly low probability but on
a pcpu-oversubscribed machine I think it only takes one
vcpu-to-pcpu reschedule and then a poorly timed interrupt that
causes the vcpu to be unscheduled, and then later rescheduled
on the original processor.

> And even if it fails, the process always has to be prepared to go to
> some other time source.

And the issue is that there's no way to recognize
failure.  Unless... wait... are you assuming that
every unscheduled period results in an adjustment
of the pvclock offset parameter?  That results in
"nanoseconds since guest boot during which any
vcpu is running" rather than "nanoseconds since
guest boot even when all vcpus are idle", right?
That's different than what I had in mind, but I
suppose it works.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 23:29             ` Dan Magenheimer
@ 2009-09-21 23:55               ` Jeremy Fitzhardinge
  2009-09-22  0:11                 ` Dan Magenheimer
  2009-09-22 19:36                 ` Dan Magenheimer
  2009-09-22  7:44               ` Jan Beulich
  1 sibling, 2 replies; 34+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-21 23:55 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

On 09/21/09 16:29, Dan Magenheimer wrote:
>>> I think a race occurs if the vcpu switches pcpu TWICE
>>> from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
>>> each time on pcpu-A but reads one or more pvclock parameters
>>> (that are too big to be encoded in TSC_AUX) on pcpu-B.  
>>>       
>> That shouldn't matter.  Once the process has (tsc,cpu,version) it can
>> use its own local copy of cpu's pvclock parameters to compute the
>> tsc->ns conversion.  Once it has that triple, it doesn't matter if it
>> gets context-switched; the time computation doesn't depend on what CPU
>> is currently running. 
>>
>> It only needs to iterate if it gets a version mismatch.  You can
>> potentially get a livelock if the version is constantly 
>> changing between
>> the rdtscp and the get-pvclock-params, and exacerbated if the process
>> keeps bouncing between cpus between the two.  But given that the
>> rdtsc+get-params should take no more than a couple of microseconds, it
>> seems very unlikely the process is sustaining a megahertz CPU 
>> migration
>> rate.
>>     
> Yes, I neglected an important pre-condition.  ASSUME the first
> rdtscp on pcpu-A gets a version mismatch so that it must fetch
> the parameters again.  Then: the vcpu switches pcpu TWICE
> from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
> each time on pcpu-A but reads one or more pvclock parameters
> (that are too big to be encoded in TSC_AUX) on pcpu-B.
>
> I agree that this is vanishingly low probability but on
> a pcpu-oversubscribed machine I think it only takes one
> vcpu-to-pcpu reschedule and then a poorly timed interrupt that
> causes the vcpu to be unscheduled, and then later rescheduled
> on the original processor.
>   

Sure.  It just has to keep iterating until it gets consistency.  If it
iterates too long (10 times?  100? 1000?) it should give up and assume
something is inherently broken.

>> And even if it fails, the process always has to be prepared to go to
>> some other time source.
>>     
> And the issue is that there's no way to recognize
> failure.

Yeah, that's a basic problem with using naked tsc as a timebase.  Any
app using it needs to be prepared to test the tsc sanity against some
other time reference regularly.

On the other hand, using the tsc as part of a larger ABI works reliably.

This rdtscp proposal is basically the latter, as a variant of the
pvclock algorithm.  I'm mostly interested in it as an implementation for
vsyscall etc, rather than something that apps would use directly.

>  Unless... wait... are you assuming that
> every unscheduled period results in an adjustment
> of the pvclock offset parameter?  That results in
> "nanoseconds since guest boot during which any
> vcpu is running" rather than "nanoseconds since
> guest boot even when all vcpus are idle", right?
> That's different than what I had in mind, but I
> suppose it works.
>   

Not following you here.

    J

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 23:55               ` Jeremy Fitzhardinge
@ 2009-09-22  0:11                 ` Dan Magenheimer
  2009-09-22  0:42                   ` Jeremy Fitzhardinge
  2009-09-22 19:36                 ` Dan Magenheimer
  1 sibling, 1 reply; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-22  0:11 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

> > Yes, I neglected an important pre-condition.  ASSUME the first
> > rdtscp on pcpu-A gets a version mismatch so that it must fetch
> > the parameters again.  Then: the vcpu switches pcpu TWICE
> > from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
> > each time on pcpu-A but reads one or more pvclock parameters
> > (that are too big to be encoded in TSC_AUX) on pcpu-B.
> >
> > I agree that this is vanishingly low probability but on
> > a pcpu-oversubscribed machine I think it only takes one
> > vcpu-to-pcpu reschedule and then a poorly timed interrupt that
> > causes the vcpu to be unscheduled, and then later rescheduled
> > on the original processor.
> >   
> 
> Sure.  It just has to keep iterating until it gets consistency.  If it
> iterates too long (10 times?  100? 1000?) it should give up and assume
> something is inherently broken.

No, I'm not talking about iteration.  In the scenario I'm
trying to describe, the version number hasn't changed on
pcpu-A so the algorithm doesn't iterate.

> On the other hand, using the tsc as part of a larger ABI 
> works reliably.
> 
> This rdtscp proposal is basically the latter, as a variant of the
> pvclock algorithm.  I'm mostly interested in it as an 
> implementation for
> vsyscall etc, rather than something that apps would use directly.
> 
> >  Unless... wait... are you assuming that
> > every unscheduled period results in an adjustment
> > of the pvclock offset parameter?  That results in
> > "nanoseconds since guest boot during which any
> > vcpu is running" rather than "nanoseconds since
> > guest boot even when all vcpus are idle", right?
> > That's different than what I had in mind, but I
> > suppose it works.
> >   
> 
> Not following you here.

I realized after I sent this that I'm not really sure
I understand the pvclock implementation, particularly
under what circumstances the version number changes
or doesn't.  And if this is different in any way
than the versions you are proposing that the app
would see.  So I'm not positive we are considering
the same cases.

Dan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-22  0:11                 ` Dan Magenheimer
@ 2009-09-22  0:42                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 34+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-22  0:42 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

On 09/21/09 17:11, Dan Magenheimer wrote:
>>> Yes, I neglected an important pre-condition.  ASSUME the first
>>> rdtscp on pcpu-A gets a version mismatch so that it must fetch
>>> the parameters again.  Then: the vcpu switches pcpu TWICE
>>> from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
>>> each time on pcpu-A but reads one or more pvclock parameters
>>> (that are too big to be encoded in TSC_AUX) on pcpu-B.
>>>
>>> I agree that this is vanishingly low probability but on
>>> a pcpu-oversubscribed machine I think it only takes one
>>> vcpu-to-pcpu reschedule and then a poorly timed interrupt that
>>> causes the vcpu to be unscheduled, and then later rescheduled
>>> on the original processor.
>>>   
>>>       
>> Sure.  It just has to keep iterating until it gets consistency.  If it
>> iterates too long (10 times?  100? 1000?) it should give up and assume
>> something is inherently broken.
>>     
> No, I'm not talking about iteration.  In the scenario I'm
> trying to describe, the version number hasn't changed on
> pcpu-A so the algorithm doesn't iterate.
>   

Well, not "change" so much as "not updated".  If the program keeps doing
a rdtsc which shows that its local copy of the parameters is out of
date, but its attempts to get up-to-date parameters keeps failing
(because it keeps migrating cpus), then it will keep iterating without
converging.  Specifically, the algorithm would be:

	u64 tsc, time_ns;
	u32 aux;
	unsigned int version, cpu;
again:
	rdtscp(&tsc, &aux);
	cpu = aux >> 24;		/* physical cpu */
	version = aux & ((1 << 24) - 1);

	/* At this point tsc and cpu+version are all fetched
	   atomically and consistent, so context switch doesn't
	   matter here; apply_fixup is not dependent on currently
	   executing cpu. */

	/* note that this prob. needs some local synchronization if
	   the usermode program is multithreaded... */
	if (unlikely(version != pvclockinfo[cpu].version)) {
		struct pvclock info;
		int curcpu;		/* again, physical cpu */

		/* Always fetches current cpu parameters,
		   and tells us which cpu it is for.  If we
		   switched cpus since the rdtscp we won't end
		   up updating the out-of-date info we detected
		   but that doesn't matter because... */
		curcpu = get_new_pvclock_info(&info);
		pvclockinfo[curcpu] = info;

		/* ...we repeat assuming that we're almost certainly
		   still on the same cpu when we do rdtscp again */
		goto again;
	}

	time_ns = apply_fixup(tsc, &pvclockinfo[cpu]);

get_new_pvclock_info() can either be a syscall, hypercall or some other
mechanism which
can get a good atomic snapshot of the params along with cpu number from
a shared memory region.

> I realized after I sent this that I'm not really sure
> I understand the pvclock implementation, particularly
> under what circumstances the version number changes
> or doesn't.  And if this is different in any way
> than the versions you are proposing that the app
> would see.  So I'm not positive we are considering
> the same cases.
>   

The pvclock algorithm only changes the version if the either the tsc
offset or scale have changed.  In the standard pvclock algorithm, a vcpu
sees its own pvclock version change if
either the pcpu undergoes some change which affects the tsc, *or* if the
vcpu gets scheduled on a new pcpu (which could have different
offset/scale). 

In the case we're talking about above, the code isn't pinned to a
particular pcpu or vcpu (as it is usermode code with no real control
over the kernel or xen schedulers), so it has to cope with preempt at
any point.  That's simplified by having the tsc and metadata fetch
atomic, so it can revalidate its parameters every time it fetches the
tsc.  In that case, Xen need only update its internal version numbers
when there's an actual change to the tsc's offset/scale without regard
to vcpu scheduling.  (And of course if the offset/scale end up being
constant, then it will never need to update the offset, and usermode
will only ever end up fetching it once per cpu.)

    J

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 18:36       ` Jeremy Fitzhardinge
  2009-09-21 22:20         ` Dan Magenheimer
@ 2009-09-22  7:39         ` Jan Beulich
  2009-09-22 17:26           ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2009-09-22  7:39 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: kurt.hackel, Dan Magenheimer, Xen-Devel (E-mail), Keir Fraser

>>> Jeremy Fitzhardinge <jeremy@goop.org> 21.09.09 20:36 >>>
>What's the full algorithm for detecting this feature?  Usermode has to
>establish:
>
>   1. It is running under Xen (or not, if you expect this to be
>      implemented on multiple hypervisors)
>   2. rdtscp is available
>   3. the ABI is actually being implemented, ie:
>         1. the tsc_aux value actually has the correct meaning
>         2. it has a working mechanism for getting the tsc scaling
>            parameters

This sub-2 can certainly be assumed to imply the respective sub-1.

>         3. (accommodate ways to evolve the ABI in a back-compatible way)
>
>before it can do anything else.

>The obvious thing to do is to pack a version number and pcpu number into
>TSC_AUX.  Usermode would maintain an array of pv_clock parameters, one
>for each pcpu.  If the version number matches, then it uses the
>parameters it has; if not it fetches new parameters and repeats the
>rdtscp.  There's no need to worry about either thread or vcpu context
>switches because you get the (tsc,params) tuple atomically, which is the
>tricky bit without rdtscp.
>
>(The version number would be truncated wrt the normal pvclock version
>number, but it just needs to be large enough to avoid aliasing from
>wrapping; I'm assuming something like 24 bits version and 8 bits cpu
>number.)

I continue to think that it would be fundamentally wrong to use pCPU
numbers here: Not only do you share information with the app that it
shouldn't really care about, but you also push scalability issues to it
that the kernel is supposed to abstract out for apps.

In particular,
- the interface must not imply an upper bound for the number of
pCPU-s (i.e. a fixed 8-/24-bit separation won't work, but reducing the
version to significantly below 24 bits may cause issues),
- the app must not imply the number of pCPU-s is bounded in any way
(since, due to migration or CPU hotplug, it may grow).

While both can be addressed, this really isn't something an app should
(have to) care about.

Jan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 23:29             ` Dan Magenheimer
  2009-09-21 23:55               ` Jeremy Fitzhardinge
@ 2009-09-22  7:44               ` Jan Beulich
  2009-09-22 15:00                 ` Dan Magenheimer
  1 sibling, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2009-09-22  7:44 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: kurt.hackel, Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 22.09.09 01:29 >>>
>Yes, I neglected an important pre-condition.  ASSUME the first
>rdtscp on pcpu-A gets a version mismatch so that it must fetch
>the parameters again.  Then: the vcpu switches pcpu TWICE
>from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
>each time on pcpu-A but reads one or more pvclock parameters
>(that are too big to be encoded in TSC_AUX) on pcpu-B.

This fundamentally depends on how the pvclock parameters are being
read: While app-accessible MSRs inherently require each of the necessary
RDMSRs to be executed on the correct {p,v}CPU (unless you encode the
CPU number in the RDMSR input), an app accessible shared memory region
wouldn't have that property.

Jan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-22  7:44               ` Jan Beulich
@ 2009-09-22 15:00                 ` Dan Magenheimer
  2009-09-22 15:16                   ` Jan Beulich
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-22 15:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kurt.hackel, Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser

> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 22.09.09 01:29 >>>
> >Yes, I neglected an important pre-condition.  ASSUME the first
> >rdtscp on pcpu-A gets a version mismatch so that it must fetch
> >the parameters again.  Then: the vcpu switches pcpu TWICE
> >from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
> >each time on pcpu-A but reads one or more pvclock parameters
> >(that are too big to be encoded in TSC_AUX) on pcpu-B.
> 
> This fundamentally depends on how the pvclock parameters are being
> read: While app-accessible MSRs inherently require each of 
> the necessary
> RDMSRs to be executed on the correct {p,v}CPU (unless you encode the
> CPU number in the RDMSR input), an app accessible shared memory region
> wouldn't have that property.

Hmmm...  I think a shared memory region still does have that property.
To avoid any possibility of a race, there must be a way to atomically
fetch the full set of values:

{ tsc, tsc_aux, pvclock parameters }.

(How many bits total in pvclock parameters?)

Jeremy's proposal of a userland hypercall ("get_new_pvclock_info")
can do that, but I don't see how a shared memory region can.
But a userland hypercall that writes to userland memory seems
risky.  An app can mmap memory, if it fails to do so (either
accidentally or maliciously), bad things can happen, correct?

Pardon my x86 ignorance again:  If we define a userland rdmsr,
it could overwrite more than just EDX:EAX.  If it overwrites
all registers that can safely be changed by the calling
convention, which registers (how many bits) can it "return"?
I suspect this isn't enough for 32-bit guests, but maybe
it is for 64-bit guests?

Dan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-22 15:00                 ` Dan Magenheimer
@ 2009-09-22 15:16                   ` Jan Beulich
  2009-09-22 17:15                     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Beulich @ 2009-09-22 15:16 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: kurt.hackel, Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 22.09.09 17:00 >>>
>> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 22.09.09 01:29 >>>
>> >Yes, I neglected an important pre-condition.  ASSUME the first
>> >rdtscp on pcpu-A gets a version mismatch so that it must fetch
>> >the parameters again.  Then: the vcpu switches pcpu TWICE
>> >from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
>> >each time on pcpu-A but reads one or more pvclock parameters
>> >(that are too big to be encoded in TSC_AUX) on pcpu-B.
>> 
>> This fundamentally depends on how the pvclock parameters are being
>> read: While app-accessible MSRs inherently require each of 
>> the necessary
>> RDMSRs to be executed on the correct {p,v}CPU (unless you encode the
>> CPU number in the RDMSR input), an app accessible shared memory region
>> wouldn't have that property.
>
>Hmmm...  I think a shared memory region still does have that property.
>To avoid any possibility of a race, there must be a way to atomically
>fetch the full set of values:
>
>{ tsc, tsc_aux, pvclock parameters }.
>
>(How many bits total in pvclock parameters?)

Of course the expectation would be that the in-memory values are also
tagged with a version.

>Jeremy's proposal of a userland hypercall ("get_new_pvclock_info")
>can do that, but I don't see how a shared memory region can.
>But a userland hypercall that writes to userland memory seems
>risky.  An app can mmap memory, if it fails to do so (either
>accidentally or maliciously), bad things can happen, correct?

No, I don't think that's more risky than writing to kernel memory - Xen
would get a page fault, and skip the write (and return -EFAULT).

>Pardon my x86 ignorance again:  If we define a userland rdmsr,
>it could overwrite more than just EDX:EAX.  If it overwrites
>all registers that can safely be changed by the calling
>convention, which registers (how many bits) can it "return"?
>I suspect this isn't enough for 32-bit guests, but maybe
>it is for 64-bit guests?

On 32-bit you have 3 registers if you don't want to touch callee
saved ones.
On 64-bit you have 7 of them (considering the differences between
Unix and Windows calling conventions, and hoping there's no third
set in use somewhere).

Jan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer  I've been looking for)
  2009-09-22 15:16                   ` Jan Beulich
@ 2009-09-22 17:15                     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 34+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-22 17:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: kurt.hackel, Dan Magenheimer, Xen-Devel (E-mail), Keir Fraser

On 09/22/09 08:16, Jan Beulich wrote:
>> Pardon my x86 ignorance again:  If we define a userland rdmsr,
>> it could overwrite more than just EDX:EAX.  If it overwrites
>> all registers that can safely be changed by the calling
>> convention, which registers (how many bits) can it "return"?
>> I suspect this isn't enough for 32-bit guests, but maybe
>> it is for 64-bit guests?
>>     
> On 32-bit you have 3 registers if you don't want to touch callee
> saved ones.
> On 64-bit you have 7 of them (considering the differences between
> Unix and Windows calling conventions, and hoping there's no third
> set in use somewhere).
>   

It doesn't really matter what registers you choose (but 3 is not enough;
you need around 200 bits of state for the pvclock params).  This special
rdtsc (presumably done in the same way as the Xen cpuid, with the
XEN_EMULATE_PREFIX) and would need to be carefully emitted in an inline
asm, which can do whatever other fixups are required save registers and
move values into the right place (gcc inline asm will pretty much
automate this).

But I think doing this direct from usermode is a bad idea; interactions
with Xen should be mediated by the kernel, even if just via a
/dev/xen/pvclock driver.

    J

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-22  7:39         ` Jan Beulich
@ 2009-09-22 17:26           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 34+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-22 17:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: kurt.hackel, Dan Magenheimer, Xen-Devel (E-mail), Keir Fraser

On 09/22/09 00:39, Jan Beulich wrote:
>>   1. It is running under Xen (or not, if you expect this to be
>>      implemented on multiple hypervisors)
>>   2. rdtscp is available
>>   3. the ABI is actually being implemented, ie:
>>         1. the tsc_aux value actually has the correct meaning
>>         2. it has a working mechanism for getting the tsc scaling
>>            parameters
>>     
> This sub-2 can certainly be assumed to imply the respective sub-1.
>   

Yeah, they're the minimum requirements of a "working ABI".  But I think
we should also have something workable if only rdtsc is available.

>> The obvious thing to do is to pack a version number and pcpu number into
>> TSC_AUX.  Usermode would maintain an array of pv_clock parameters, one
>> for each pcpu.  If the version number matches, then it uses the
>> parameters it has; if not it fetches new parameters and repeats the
>> rdtscp.  There's no need to worry about either thread or vcpu context
>> switches because you get the (tsc,params) tuple atomically, which is the
>> tricky bit without rdtscp.
>>
>> (The version number would be truncated wrt the normal pvclock version
>> number, but it just needs to be large enough to avoid aliasing from
>> wrapping; I'm assuming something like 24 bits version and 8 bits cpu
>> number.)
>>     
> I continue to think that it would be fundamentally wrong to use pCPU
> numbers here: Not only do you share information with the app that it
> shouldn't really care about, but you also push scalability issues to it
> that the kernel is supposed to abstract out for apps.
>   
As far as usermode is concerned, they're just tags to distinguish
distinct sets of parameters.  We could remap them from actual pcpu
numbers to some other key space, but I don't see much point in doing
so.  The numbers are meaningless to usermode and have no inherent meaning.

(Of course we could add some inherent structure to them, like adding
node numbers for NUMA systems, so that usermode has at least some idea
of how it is being mapped to hardware, at least at that instant.  But
that's a whole other discussion.)

> In particular,
> - the interface must not imply an upper bound for the number of
> pCPU-s (i.e. a fixed 8-/24-bit separation won't work, but reducing the
> version to significantly below 24 bits may cause issues),
>   

Yeah.  I was considering a mechanism whereby the version/cpu split was a
runtime option fetched from Xen.  Running out of space for CPU numbers
would be a disaster, but a smaller version space can be dealt with by
making sure that there's at new pvclock param update before the version
wraps (which you can achieve by requiring an update every X units of
wallclock time, where X is less than the expected minimum time of a wrap).

> - the app must not imply the number of pCPU-s is bounded in any way
> (since, due to migration or CPU hotplug, it may grow).
>   

Usermode might have to use a more flexible structure than a simple array
to handle arbitrary parameter keys (aka pcpu numbers).

> While both can be addressed, this really isn't something an app should
> (have to) care about.
>   

I agree.  All this machinery should be wrapped up in the form of
vsyscall.  That would simplify many aspects of this discussion.

    J

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-21 23:55               ` Jeremy Fitzhardinge
  2009-09-22  0:11                 ` Dan Magenheimer
@ 2009-09-22 19:36                 ` Dan Magenheimer
  2009-09-22 19:52                   ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-22 19:36 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org]
> This rdtscp proposal is basically the latter, as a variant of the
> pvclock algorithm.  I'm mostly interested in it as an 
> implementation for
> vsyscall etc, rather than something that apps would use directly.

> From: Jan Beulich [mailto:JBeulich@novell.com]
> I continue to think that it would be fundamentally wrong to use pCPU
> numbers here: Not only do you share information with the app that it
> shouldn't really care about, but you also push scalability 
> issues to it
> that the kernel is supposed to abstract out for apps.

While I have been hopeful that we can identify a solution that
can solve both problems (vsyscall+pvclock and pvrdtscp),
I have been concerned we might run into a fundamental conflict
since both of us may be attempting to use TSC_AUX
for somewhat different purposes.  Then in taking
a step back to think about this, I realized we may
be farther apart in our objectives than I first thought.
So I thought it would be a good idea to revisit
some assumptions.

I am assuming that rdtsc and rdtscp are always emulated;
but for some "high frequency timestamp apps" (HFTSAs),
trying to define a mechanism where rdtsc/rdtscp
are always correct AND, in certain constrained
environments, also fast (non-emulated).

Any userland pvclock algorithm still requires a rdtsc
(or rdtscp) instruction which -- EXCEPT in those
certain constrained environments -- is emulated.
But the whole point of pvclock is to be faster than
entering the hypervisor, right?

Are you (Jeremy) still assuming that rdtsc/rdtscp are NOT
emulated?  Or are you trying to define a vsyscall+pvclock
mechanism for the same constrained environments
so that HFTSAs have a choice of using clock_gettime
instead of pvrdtsc, either of which will be fast?
Or am I missing another option altogether?

Dan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-22 19:36                 ` Dan Magenheimer
@ 2009-09-22 19:52                   ` Jeremy Fitzhardinge
  2009-09-22 20:22                     ` Dan Magenheimer
  0 siblings, 1 reply; 34+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-22 19:52 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

On 09/22/09 12:36, Dan Magenheimer wrote:
> Are you (Jeremy) still assuming that rdtsc/rdtscp are NOT
> emulated?  Or are you trying to define a vsyscall+pvclock
> mechanism for the same constrained environments
> so that HFTSAs have a choice of using clock_gettime
> instead of pvrdtsc, either of which will be fast?
>   

Yes, I'm assuming they're not emulated.  If you're emulating them
there's no reason to add any extra complexity to usermode by adding any
other ABI: rdtsc can be rdtsc and rdtscp can be rdtscp with no
Xen/ABI-imposed constraints on TSC_AUX.

Once you're talking about layering another ABI onto the tsc, then
there's no need to consider emulation because you can do all the
necessary correction to get a canonical timestamp without it.

    J

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-22 19:52                   ` Jeremy Fitzhardinge
@ 2009-09-22 20:22                     ` Dan Magenheimer
  2009-09-22 22:18                       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Magenheimer @ 2009-09-22 20:22 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

> On 09/22/09 12:36, Dan Magenheimer wrote:
> > Are you (Jeremy) still assuming that rdtsc/rdtscp are NOT
> > emulated?  Or are you trying to define a vsyscall+pvclock
> > mechanism for the same constrained environments
> > so that HFTSAs have a choice of using clock_gettime
> > instead of pvrdtsc, either of which will be fast?  
> 
> Yes, I'm assuming they're not emulated.

OK, that's what I feared.

I don't know how this decision will be made, but any pvclock
and pvrdtsc design work is very dependent on the decision.

> If you're emulating them
> there's no reason to add any extra complexity to usermode by 
> adding any
> other ABI: rdtsc can be rdtsc and rdtscp can be rdtscp with no
> Xen/ABI-imposed constraints on TSC_AUX.

The reason is to improve performance while preserving
correctness for applications that need to do tens-to-hundreds
of thousands "timestamp reads" without changing the underlying
OS.  Whether this is a GOOD reason is subject to interpretation,
but it is a reason.

> Once you're talking about layering another ABI onto the tsc, then
> there's no need to consider emulation because you can do all the
> necessary correction to get a canonical timestamp without it.

But only at the cost of losing correctness for (whether
you consider them fundamentally broken or not) apps that
depend on the rdtsc instruction to deliver the
architecturally-defined functionality and may silently
fail or corrupt data if rdtsc silently doesn't behave as
defined.

Dan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: rdtscP and xen (and maybe the app-tsc answer I've been looking for)
  2009-09-22 20:22                     ` Dan Magenheimer
@ 2009-09-22 22:18                       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 34+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-22 22:18 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: kurt.hackel, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

On 09/22/09 13:22, Dan Magenheimer wrote:
> The reason is to improve performance while preserving
> correctness for applications that need to do tens-to-hundreds
> of thousands "timestamp reads" without changing the underlying
> OS.  Whether this is a GOOD reason is subject to interpretation,
> but it is a reason.
>   

I don't think there's anything new to add to this line of discussion.

    J

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2009-09-22 22:18 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-18 16:30 rdtscP and xen (and maybe the app-tsc answer I've been looking for) Dan Magenheimer
2009-09-18 20:27 ` Dan Magenheimer
2009-09-18 22:55   ` Jeremy Fitzhardinge
2009-09-19 15:34     ` Dan Magenheimer
2009-09-21 14:47       ` Dan Magenheimer
2009-09-21 18:36       ` Jeremy Fitzhardinge
2009-09-21 22:20         ` Dan Magenheimer
2009-09-21 22:50           ` Jeremy Fitzhardinge
2009-09-21 23:29             ` Dan Magenheimer
2009-09-21 23:55               ` Jeremy Fitzhardinge
2009-09-22  0:11                 ` Dan Magenheimer
2009-09-22  0:42                   ` Jeremy Fitzhardinge
2009-09-22 19:36                 ` Dan Magenheimer
2009-09-22 19:52                   ` Jeremy Fitzhardinge
2009-09-22 20:22                     ` Dan Magenheimer
2009-09-22 22:18                       ` Jeremy Fitzhardinge
2009-09-22  7:44               ` Jan Beulich
2009-09-22 15:00                 ` Dan Magenheimer
2009-09-22 15:16                   ` Jan Beulich
2009-09-22 17:15                     ` Jeremy Fitzhardinge
2009-09-22  7:39         ` Jan Beulich
2009-09-22 17:26           ` Jeremy Fitzhardinge
2009-09-21  8:17   ` Jan Beulich
2009-09-21 14:04     ` Dan Magenheimer
2009-09-21 14:18       ` Jan Beulich
2009-09-21 15:25         ` Dan Magenheimer
2009-09-21 15:41           ` Keir Fraser
2009-09-21 15:53             ` Keir Fraser
2009-09-21 16:55               ` Dan Magenheimer
2009-09-21 17:02                 ` Keir Fraser
2009-09-21 17:56                   ` Dan Magenheimer
2009-09-21 18:17                     ` Keir Fraser
2009-09-21 21:47                       ` Dan Magenheimer
2009-09-21 16:03           ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.