All of lore.kernel.org
 help / color / mirror / Atom feed
* write_tsc in a PV domain?
@ 2009-08-25 21:54 Dan Magenheimer
  2009-08-25 22:28 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-25 21:54 UTC (permalink / raw)
  To: Xen-Devel (E-mail)

Is it "legal" to write to the TSC, e.g. via wrmsr(0x10,x,y),
in a PV kernel?  Assuming this were executed and would cause
a GPF, I can't find the code in Xen that would handle it, or
even ignore it.  There are uses of write_tsc in
linux-2.6.18-xen... perhaps that code never gets executed?

Thanks,
Dan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-25 21:54 write_tsc in a PV domain? Dan Magenheimer
@ 2009-08-25 22:28 ` Jeremy Fitzhardinge
  2009-08-25 23:09   ` Dan Magenheimer
  0 siblings, 1 reply; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-08-25 22:28 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail)

On 08/25/09 14:54, Dan Magenheimer wrote:
> Is it "legal" to write to the TSC, e.g. via wrmsr(0x10,x,y),
> in a PV kernel?  Assuming this were executed and would cause
> a GPF, I can't find the code in Xen that would handle it, or
> even ignore it.
>   

arch/x86/traps.c:emulate_privileged_op(), case 0x30.  It looks like
writing to 0x10 would be silently ignored.  Allowing it would require
careful handling to avoid screwing up timekeeping (you'd need to update
the timekeeping parameters), but also fairly pointless because it would
only affect the pcpu that the vcpu happens to be running on at that moment.

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: write_tsc in a PV domain?
  2009-08-25 22:28 ` Jeremy Fitzhardinge
@ 2009-08-25 23:09   ` Dan Magenheimer
  2009-08-26  6:23     ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-25 23:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail)

> > Is it "legal" to write to the TSC, e.g. via wrmsr(0x10,x,y),
> > in a PV kernel?  Assuming this were executed and would cause
> > a GPF, I can't find the code in Xen that would handle it, or
> > even ignore it.
> 
> arch/x86/traps.c:emulate_privileged_op(), case 0x30.  It looks like
> writing to 0x10 would be silently ignored.

Hmmm... maybe I am misreading the code but it looks like the
default case will end up with "goto fail" which will not
update IP and so will infinite loop trapping on that instruction.

It appears that write_tsc calls are made in linux-2.6.18 (though
apparently never get executed) but disappear somewhere before
2.6.24 and don't exist in 2.6.30 either.  So perhaps write_tsc
has never been executed in a PV guest and just doesn't work.

> Allowing it would require
> careful handling to avoid screwing up timekeeping (you'd need 
> to update
> the timekeeping parameters), but also fairly pointless 
> because it would
> only affect the pcpu that the vcpu happens to be running on 
> at that moment.

I'm still working on TSC emulation which will return
Xen system time.  The physical TSC won't get changed,
but maintaining an offset is necessary if its
possible for TSC to be "written".  I guess I will
ignore that possibility for now.

Hmmm... what about save/restore/migration?  For pvclock
to work properly across save/restore/migration, a Xen system
time offset must already be handled, so I'm thinking I
don't need to worry about that case.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-25 23:09   ` Dan Magenheimer
@ 2009-08-26  6:23     ` Keir Fraser
  2009-08-26 15:42       ` Dan Magenheimer
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-08-26  6:23 UTC (permalink / raw)
  To: Dan Magenheimer, Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail)

On 26/08/2009 00:09, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> arch/x86/traps.c:emulate_privileged_op(), case 0x30.  It looks like
>> writing to 0x10 would be silently ignored.
> 
> Hmmm... maybe I am misreading the code but it looks like the
> default case will end up with "goto fail" which will not
> update IP and so will infinite loop trapping on that instruction.
> 
> It appears that write_tsc calls are made in linux-2.6.18 (though
> apparently never get executed) but disappear somewhere before
> 2.6.24 and don't exist in 2.6.30 either.  So perhaps write_tsc
> has never been executed in a PV guest and just doesn't work.

Jeremy is correct. The TSC MSR cannot be written. Most that will happen is
that Xen will print a warning message, but the WRMSR instruction will always
be skipped over.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: write_tsc in a PV domain?
  2009-08-26  6:23     ` Keir Fraser
@ 2009-08-26 15:42       ` Dan Magenheimer
  2009-08-26 15:58         ` Keir Fraser
  2009-08-26 19:45         ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-26 15:42 UTC (permalink / raw)
  To: Keir Fraser, Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail)

> >> arch/x86/traps.c:emulate_privileged_op(), case 0x30.  It looks like
> >> writing to 0x10 would be silently ignored.
> > 
> > Hmmm... maybe I am misreading the code but it looks like the
> > default case will end up with "goto fail" which will not
> > update IP and so will infinite loop trapping on that instruction.
> > 
> > It appears that write_tsc calls are made in linux-2.6.18 (though
> > apparently never get executed) but disappear somewhere before
> > 2.6.24 and don't exist in 2.6.30 either.  So perhaps write_tsc
> > has never been executed in a PV guest and just doesn't work.
> 
> Jeremy is correct. The TSC MSR cannot be written. Most that 
> will happen is
> that Xen will print a warning message, but the WRMSR 
> instruction will always
> be skipped over.

OK, I see, wrmsr_hypervisor_regs(0x10) and mce_wrmsr(0x10) and
rdmsr_safe(0x10) all return 0, so the code at "invalid:" is
executed and a warning is printk'd.  So in the current
implementation, write_tsc is skipped over.

But ARCHITECTURALLY does Xen consider write_tsc to be a no-op
for PV domains, or is this just a case that's never been
encountered before?  In other words, if a future PV OS had a
good reason to write_tsc, would we implement it (and make
the necessary adjustments to Xen's usages of tsc) or just say,
sorry, not allowed?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-26 15:42       ` Dan Magenheimer
@ 2009-08-26 15:58         ` Keir Fraser
  2009-08-26 19:45         ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 61+ messages in thread
From: Keir Fraser @ 2009-08-26 15:58 UTC (permalink / raw)
  To: Dan Magenheimer, Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail)

On 26/08/2009 16:42, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> OK, I see, wrmsr_hypervisor_regs(0x10) and mce_wrmsr(0x10) and
> rdmsr_safe(0x10) all return 0, so the code at "invalid:" is
> executed and a warning is printk'd.  So in the current
> implementation, write_tsc is skipped over.
> 
> But ARCHITECTURALLY does Xen consider write_tsc to be a no-op
> for PV domains, or is this just a case that's never been
> encountered before?  In other words, if a future PV OS had a
> good reason to write_tsc, would we implement it (and make
> the necessary adjustments to Xen's usages of tsc) or just say,
> sorry, not allowed?

There'd have to be a good argument for supporting it. I don't think we ever
will.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-26 15:42       ` Dan Magenheimer
  2009-08-26 15:58         ` Keir Fraser
@ 2009-08-26 19:45         ` Jeremy Fitzhardinge
  2009-08-26 20:23           ` Dan Magenheimer
  1 sibling, 1 reply; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-08-26 19:45 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail), Keir Fraser

On 08/26/09 08:42, Dan Magenheimer wrote:
> But ARCHITECTURALLY does Xen consider write_tsc to be a no-op
> for PV domains, or is this just a case that's never been
> encountered before?  In other words, if a future PV OS had a
> good reason to write_tsc, would we implement it (and make
> the necessary adjustments to Xen's usages of tsc) or just say,
> sorry, not allowed?
>   

You can think of it this way:  a Xen PV VCPU has no tsc.  There is a
register that can be read with "rdtsc", but that're purely part of Xen's
time ABI and is not independently useful.  The ABI includes no notion of
writing to that register.  Usermode code can execute "rdtsc", but
without access to the rest of the time parameters it just returns some
undefined bits with no relationship to time.

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: write_tsc in a PV domain?
  2009-08-26 19:45         ` Jeremy Fitzhardinge
@ 2009-08-26 20:23           ` Dan Magenheimer
  2009-08-26 22:30             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-26 20:23 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Keir Fraser

> On 08/26/09 08:42, Dan Magenheimer wrote:
> > But ARCHITECTURALLY does Xen consider write_tsc to be a no-op
> > for PV domains, or is this just a case that's never been
> > encountered before?  In other words, if a future PV OS had a
> > good reason to write_tsc, would we implement it (and make
> > the necessary adjustments to Xen's usages of tsc) or just say,
> > sorry, not allowed?   
> 
> You can think of it this way:  a Xen PV VCPU has no tsc.  There is a
> register that can be read with "rdtsc", but that're purely 
> part of Xen's
> time ABI and is not independently useful.  The ABI includes 
> no notion of
> writing to that register.  Usermode code can execute "rdtsc", but
> without access to the rest of the time parameters it just returns some
> undefined bits with no relationship to time.

While I think I understand entirely why you would want to
think of it that way, there's thousands (millions?) of applications
out there that would beg to differ.  They DO assume that
rdtsc bears "some" relationship to time.  Indeed Linux itself
does.  Exactly what that relationship to time is defined to be is
open to debate, and whether Xen supports whatever relationship
is defined is also debatable (especially in the presence of
migration).  But defining rdtsc as returning random bits
is not an acceptable solution for Xen.  Dom0 won't even
boot if rdtsc returns random bits so Xen must already be
guaranteeing that rdtsc has "some" relationship to time.
We've been lucky so far with allowing rdtsc to execute directly
in hardware, but we really do need to fix it properly.

But since applications cannot WRITE to tsc and Xen has some
control over the OS->Xen PV API, it might be safe to define that
write_tsc is a no-op.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-26 20:23           ` Dan Magenheimer
@ 2009-08-26 22:30             ` Jeremy Fitzhardinge
  2009-08-26 23:10               ` Dan Magenheimer
  2009-08-27  8:48               ` Alan Cox
  0 siblings, 2 replies; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-08-26 22:30 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail), Keir Fraser

On 08/26/09 13:23, Dan Magenheimer wrote:
>> You can think of it this way: a Xen PV VCPU has no tsc. There is a
>> register that can be read with "rdtsc", but that're purely 
>> part of Xen's
>> time ABI and is not independently useful.  The ABI includes 
>> no notion of
>> writing to that register.  Usermode code can execute "rdtsc", but
>> without access to the rest of the time parameters it just returns some
>> undefined bits with no relationship to time.
>>     
> While I think I understand entirely why you would want to
> think of it that way, there's thousands (millions?) of applications
> out there that would beg to differ.  They DO assume that
> rdtsc bears "some" relationship to time.

They are wrong.  Linux doesn't offer the tsc to usermode for its use. 
The closest it gets is vgettimeofday, which we could implement better.

>   Indeed Linux itself
> does. 

A pv linux guest doesn't have a TSC in the same way that it doesn't have
a TSS or any number of other CPU features.  It would be a grave error
for the kernel to use a tsc-based clocksource rather than the Xen pv
clocksource.  A Xen PV VCPU bears a passing resemblance to an Intel x86
CPU, but should not be confused with one.

>  Exactly what that relationship to time is defined to be is
> open to debate, and whether Xen supports whatever relationship
> is defined is also debatable (especially in the presence of
> migration).  But defining rdtsc as returning random bits
> is not an acceptable solution for Xen.  Dom0 won't even
> boot if rdtsc returns random bits so Xen must already be
> guaranteeing that rdtsc has "some" relationship to time.
>   

No, it really doesn't.  It provides a PV clock, which includes "rdtsc"
as part of its ABI.  It is not a general tsc.  You can't meaningfully
execute "rdtsc" without also being (indirectly) aware of what pcpu its
running on and applying the appropriate corrections to turn it into
system monotonic time.  Executing rdtsc willy-nilly gets you useless
results; fortunately no PV Xen kernel does that.

> We've been lucky so far with allowing rdtsc to execute directly
> in hardware, but we really do need to fix it properly.
No, that's false.  The current Xen time model works fine for all guests
using it correctly.

Emulating rdtsc for hvm guests is another question entirely.

> But since applications cannot WRITE to tsc and Xen has some
> control over the OS->Xen PV API, it might be safe to define that
> write_tsc is a no-op.
>   

No, write_tsc is meaningless, and anyone trying to execute it is not
even wrong.

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: write_tsc in a PV domain?
  2009-08-26 22:30             ` Jeremy Fitzhardinge
@ 2009-08-26 23:10               ` Dan Magenheimer
  2009-08-27  8:39                 ` Chris Lalancette
  2009-08-27  8:48               ` Alan Cox
  1 sibling, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-26 23:10 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Keir Fraser

> > While I think I understand entirely why you would want to
> > think of it that way, there's thousands (millions?) of applications
> > out there that would beg to differ.  They DO assume that
> > rdtsc bears "some" relationship to time.
> 
> They are wrong.  Linux doesn't offer the tsc to usermode for its use. 
> The closest it gets is vgettimeofday, which we could implement better.

Linux doesn't have to offer it.  The Intel x86 CPU does.  It's
a legal instruction for an app to use and (quote from Intel
SDM) "is guaranteed to return a monotonically increasing
unique value whenever executed except for 64-bit wraparound."
While that's not precisely a "relationship" to time, mere
mortals programming are likely to interpret it that way.

(Keir, please note that it says monotonically-increasing,
not monotonically-non-decreasing, so the current softtsc
implementation for HVM I think is incorrect.)

> >   Indeed Linux itself does. 
> 
> A pv linux guest doesn't have a TSC in the same way that it 
> doesn't have
> a TSS or any number of other CPU features.  It would be a grave error
> for the kernel to use a tsc-based clocksource rather than the Xen pv
> clocksource.  A Xen PV VCPU bears a passing resemblance to an 
> Intel x86
> CPU, but should not be confused with one.

So are you going to guarantee that 2.6.31 Linux when running
on Xen has no uses or dependencies on rdtsc delivering anything
other than a random value?

> >  Exactly what that relationship to time is defined to be is
> > open to debate, and whether Xen supports whatever relationship
> > is defined is also debatable (especially in the presence of
> > migration).  But defining rdtsc as returning random bits
> > is not an acceptable solution for Xen.  Dom0 won't even
> > boot if rdtsc returns random bits so Xen must already be
> > guaranteeing that rdtsc has "some" relationship to time.
> 
> No, it really doesn't.  It provides a PV clock, which includes "rdtsc"
> as part of its ABI.  It is not a general tsc.  You can't meaningfully
> execute "rdtsc" without also being (indirectly) aware of what pcpu its
> running on and applying the appropriate corrections to turn it into
> system monotonic time.  Executing rdtsc willy-nilly gets you useless
> results; fortunately no PV Xen kernel does that.

While what you are saying may seem reasonable, I think you
will find by looking at linux-2.6.18-xen that it is not true
in reality.  If you trap kernel uses of rdtsc and return random
values, dom0 will not boot.

> > We've been lucky so far with allowing rdtsc to execute directly
> > in hardware, but we really do need to fix it properly.
> No, that's false.  The current Xen time model works fine for 
> all guests
> using it correctly.
> 
> Emulating rdtsc for hvm guests is another question entirely.

In the end, I don't care if rdtsc's in the kernel are emulated
(and the patch I submitted earlier doesn't emulate them other
than to do a "slow" rdtsc).  But apps don't care if they are
running on an HVM or a PVM, so if they use rdtsc, even if you
believe that usage of rdtsc is incorrect, rdtsc must deliver
what the Intel ABI guarantees.

> > But since applications cannot WRITE to tsc and Xen has some
> > control over the OS->Xen PV API, it might be safe to define that
> > write_tsc is a no-op.
> 
> No, write_tsc is meaningless, and anyone trying to execute it is not
> even wrong.

In that case, are you saying it is an illegal instruction for a PV
guest to execute?  If so, we should not ignore it, we should fail
the guest.  But that would be unfortunate for the RHEL5-64bit
PV guests that actually DO use it.

Dan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-26 23:10               ` Dan Magenheimer
@ 2009-08-27  8:39                 ` Chris Lalancette
  2009-08-27 13:00                   ` Dan Magenheimer
  0 siblings, 1 reply; 61+ messages in thread
From: Chris Lalancette @ 2009-08-27  8:39 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser

Dan Magenheimer wrote:
> In that case, are you saying it is an illegal instruction for a PV
> guest to execute?  If so, we should not ignore it, we should fail
> the guest.  But that would be unfortunate for the RHEL5-64bit
> PV guests that actually DO use it.

Wait, what?  Could you point out where this is in RHEL-5 64-bit PV?  The only
case of write_tsc() I see in the code is in arch/i386/kernel/smpboot.c, which is
not used by the Xen PV implementation in RHEL-5.  Where else in the PV
implementation does a write_tsc?

-- 
Chris Lalancette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-26 22:30             ` Jeremy Fitzhardinge
  2009-08-26 23:10               ` Dan Magenheimer
@ 2009-08-27  8:48               ` Alan Cox
  2009-08-27 19:10                 ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 61+ messages in thread
From: Alan Cox @ 2009-08-27  8:48 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Dan Magenheimer, Xen-Devel (E-mail), Keir Fraser

> as part of its ABI.  It is not a general tsc.  You can't meaningfully
> execute "rdtsc" without also being (indirectly) aware of what pcpu its
> running on and applying the appropriate corrections to turn it into
> system monotonic time.  Executing rdtsc willy-nilly gets you useless
> results; fortunately no PV Xen kernel does that.

Actually for user space this isn't at all true. You can use rdtsc
directly and sample the data for things like profiling then correct for
things like spikes and skews from processor switches by filtering.

> No, write_tsc is meaningless, and anyone trying to execute it is not
> even wrong.

Writing to the tsc is perfectly reasonable providing the tsc is an
advertised feature. Being able to use the tsc becomes much more relevant
with newer processors which have sane tsc implementations in the
architecture however.

Unfortunately if you hide the tsc and hide the tsc flag in the cpu info
lots of stuff doesn't run due to crap coding 8(

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: write_tsc in a PV domain?
  2009-08-27  8:39                 ` Chris Lalancette
@ 2009-08-27 13:00                   ` Dan Magenheimer
  2009-08-27 13:17                     ` Chris Lalancette
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-27 13:00 UTC (permalink / raw)
  To: Chris Lalancette; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser

> Dan Magenheimer wrote:
> > In that case, are you saying it is an illegal instruction for a PV
> > guest to execute?  If so, we should not ignore it, we should fail
> > the guest.  But that would be unfortunate for the RHEL5-64bit
> > PV guests that actually DO use it.
> 
> Wait, what?  Could you point out where this is in RHEL-5 
> 64-bit PV?  The only
> case of write_tsc() I see in the code is in 
> arch/i386/kernel/smpboot.c, which is
> not used by the Xen PV implementation in RHEL-5.  Where else in the PV
> implementation does a write_tsc?

Hi Chris --

I was surprised also, and digging deeper it looks like I was mistaken.

I instrumented a hypervisor so that Xen would printk a console
message if it was ignoring a wrmsr and was getting output
when I launched a RHEL-5 PV guest.  But I refined the
printk and it is NOT wrmsr(0x10) so you're right, it is
NOT a write_tsc.

Thanks for pointing out my error.
Dan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-27 13:00                   ` Dan Magenheimer
@ 2009-08-27 13:17                     ` Chris Lalancette
  0 siblings, 0 replies; 61+ messages in thread
From: Chris Lalancette @ 2009-08-27 13:17 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser

Dan Magenheimer wrote:
>> Dan Magenheimer wrote:
>>> In that case, are you saying it is an illegal instruction for a PV
>>> guest to execute?  If so, we should not ignore it, we should fail
>>> the guest.  But that would be unfortunate for the RHEL5-64bit
>>> PV guests that actually DO use it.
>> Wait, what?  Could you point out where this is in RHEL-5 
>> 64-bit PV?  The only
>> case of write_tsc() I see in the code is in 
>> arch/i386/kernel/smpboot.c, which is
>> not used by the Xen PV implementation in RHEL-5.  Where else in the PV
>> implementation does a write_tsc?
> 
> Hi Chris --
> 
> I was surprised also, and digging deeper it looks like I was mistaken.
> 
> I instrumented a hypervisor so that Xen would printk a console
> message if it was ignoring a wrmsr and was getting output
> when I launched a RHEL-5 PV guest.  But I refined the
> printk and it is NOT wrmsr(0x10) so you're right, it is
> NOT a write_tsc.
> 
> Thanks for pointing out my error.

OK, cool, no problem.  I just wanted to make sure I wasn't missing something.

Thanks,
-- 
Chris Lalancette

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-27  8:48               ` Alan Cox
@ 2009-08-27 19:10                 ` Jeremy Fitzhardinge
  2009-08-28  3:29                   ` Dan Magenheimer
  0 siblings, 1 reply; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-08-27 19:10 UTC (permalink / raw)
  To: Alan Cox; +Cc: Dan Magenheimer, Xen-Devel (E-mail), Keir Fraser

On 08/27/09 01:48, Alan Cox wrote:
>> as part of its ABI.  It is not a general tsc.  You can't meaningfully
>> execute "rdtsc" without also being (indirectly) aware of what pcpu its
>> running on and applying the appropriate corrections to turn it into
>> system monotonic time.  Executing rdtsc willy-nilly gets you useless
>> results; fortunately no PV Xen kernel does that.
>>     
> Actually for user space this isn't at all true. You can use rdtsc
> directly and sample the data for things like profiling then correct for
> things like spikes and skews from processor switches by filtering.
>   

If an app is sophisticated to do this correctly then it doesn't need any
special assistance from a hypervisor to make the tsc well-behaved.  It
should continue to work even in a Xen guest where both the process can
skip between VCPUs and the VCPUs can skip between PCPUs.

>> No, write_tsc is meaningless, and anyone trying to execute it is not
>> even wrong.
>>     
> Writing to the tsc is perfectly reasonable providing the tsc is an
> advertised feature. Being able to use the tsc becomes much more relevant
> with newer processors which have sane tsc implementations in the
> architecture however.
>   

Apparently on some large servers the tsc is only synced and sane within
a NUMA node, and not globally across all processors, so any app which
assumed sane tsc behaviour would break when the hardware gets scaled up.

But in this case I'm talking specifically about a Xen PV guest, where
the tsc is claimed for use by the Xen clocksource ABI.

> Unfortunately if you hide the tsc and hide the tsc flag in the cpu info
> lots of stuff doesn't run due to crap coding 8(
>   

And you can't actually hide the TSC flag in cpuid without virtualization
extensions.

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: write_tsc in a PV domain?
  2009-08-27 19:10                 ` Jeremy Fitzhardinge
@ 2009-08-28  3:29                   ` Dan Magenheimer
  2009-08-28  9:49                     ` Alan Cox
  2009-08-28 17:02                     ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-28  3:29 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Alan Cox; +Cc: Xen-Devel (E-mail), Keir Fraser

> On 08/27/09 01:48, Alan Cox wrote:
> >> as part of its ABI.  It is not a general tsc.  You can't 
> meaningfully
> >> execute "rdtsc" without also being (indirectly) aware of 
> what pcpu its
> >> running on and applying the appropriate corrections to turn it into
> >> system monotonic time.  Executing rdtsc willy-nilly gets 
> you useless
> >> results; fortunately no PV Xen kernel does that.
> >>     
> > Actually for user space this isn't at all true. You can use rdtsc
> > directly and sample the data for things like profiling then 
> correct for
> > things like spikes and skews from processor switches by filtering. 
> 
> If an app is sophisticated to do this correctly then it 
> doesn't need any
> special assistance from a hypervisor to make the tsc well-behaved.  It
> should continue to work even in a Xen guest where both the process can
> skip between VCPUs and the VCPUs can skip between PCPUs.

No, I don't think this is true.  An enterprise app that binds processes
to fixed physical processors on a physical machine can make
assumptions about the results of rdtsc that aren't valid when
the vcpus can skip between pcpus.  Further, like Linux itself,
applications may test assumptions about tsc at startup that are
assumed to remain valid for the life of the app, which is
perfectly reasonable on a physical machine and a bad mistake
in a virtualized environment.

> >> No, write_tsc is meaningless, and anyone trying to execute 
> it is not
> >> even wrong.
> >>     
> > Writing to the tsc is perfectly reasonable providing the tsc is an
> > advertised feature. Being able to use the tsc becomes much 
> more relevant
> > with newer processors which have sane tsc implementations in the
> > architecture however.
> 
> Apparently on some large servers the tsc is only synced and 
> sane within
> a NUMA node, and not globally across all processors, so any app which
> assumed sane tsc behaviour would break when the hardware gets 
> scaled up.

True, but any app that tries to run on a NUMA machine without
being aware of the idiosyncracies of a NUMA machine probably
has worse problems to deal with than tsc sync.  Further, there
are many many apps that will likely never ever run on those
machines.  Are we going to penalize all apps all the time
because some might run some of the time on a machine where
tsc is not synced?

> But in this case I'm talking specifically about a Xen PV guest, where
> the tsc is claimed for use by the Xen clocksource ABI.

I just don't understand how you can say that a valid userland
instruction is "claimed for use" by Xen (or Linux or both).

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-28  3:29                   ` Dan Magenheimer
@ 2009-08-28  9:49                     ` Alan Cox
  2009-08-28 15:16                       ` Dan Magenheimer
  2009-08-28 17:02                     ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 61+ messages in thread
From: Alan Cox @ 2009-08-28  9:49 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser

> No, I don't think this is true.  An enterprise app that binds processes
> to fixed physical processors on a physical machine can make
> assumptions about the results of rdtsc that aren't valid when
> the vcpus can skip between pcpus.  Further, like Linux itself,

They rarely make the right assumptions

> applications may test assumptions about tsc at startup that are
> assumed to remain valid for the life of the app, which is
> perfectly reasonable on a physical machine

No it isn't because of things like suspend/resume.

> True, but any app that tries to run on a NUMA machine without
> being aware of the idiosyncracies of a NUMA machine probably
> has worse problems to deal with than tsc sync.  Further, there

Disagree - this is true if your NUMA factor is large but quite a few
machines today are "vaguely NUMA" - the NUMA factor is low enough the app
doesn't need to care. Anyway you don't need NUMA to see TSC skew between
cores.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: write_tsc in a PV domain?
  2009-08-28  9:49                     ` Alan Cox
@ 2009-08-28 15:16                       ` Dan Magenheimer
  2009-08-28 15:30                         ` Alan Cox
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-28 15:16 UTC (permalink / raw)
  To: Alan Cox; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser

Hi Alan --

> > No, I don't think this is true.  An enterprise app that 
> binds processes
> > to fixed physical processors on a physical machine can make
> > assumptions about the results of rdtsc that aren't valid when
> > the vcpus can skip between pcpus.  Further, like Linux itself,
> 
> They rarely make the right assumptions

I freely admit that there are a high percentage of
apps-that-use-rdtsc that are at risk of being buggy
if moved from a "tsc safe" machine to a "tsc unsafe"
machine.  But, echoing your earlier reply, there are
some that are careful and smart about using rdtsc.

Jeremy's claim is that because some apps-that-use-
rdtsc risk bugginess, Xen can claim rdtsc for its own
use and effectively disallow all uses of rdtsc in any
app by breaking the existing, sometimes-useful semantics
of the instruction.

> > True, but any app that tries to run on a NUMA machine without
> > being aware of the idiosyncracies of a NUMA machine probably
> > has worse problems to deal with than tsc sync.  Further, there
> 
> Disagree - this is true if your NUMA factor is large but quite a few
> machines today are "vaguely NUMA" - the NUMA factor is low 
> enough the app
> doesn't need to care. Anyway you don't need NUMA to see TSC 
> skew between cores.

Yes, but I think we are agreeing here.  My point, poorly
made I admit, is that there are a lot of different machine
topologies and we can't force all applications to
conform to the lowest common denominator.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-28 15:16                       ` Dan Magenheimer
@ 2009-08-28 15:30                         ` Alan Cox
  2009-08-28 17:49                           ` rdtsc: correctness vs performance on Xen (and KVM?) Dan Magenheimer
  2009-08-28 17:49                           ` write_tsc in a PV domain? Dan Magenheimer
  0 siblings, 2 replies; 61+ messages in thread
From: Alan Cox @ 2009-08-28 15:30 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser

> Jeremy's claim is that because some apps-that-use-
> rdtsc risk bugginess, Xen can claim rdtsc for its own
> use and effectively disallow all uses of rdtsc in any
> app by breaking the existing, sometimes-useful semantics
> of the instruction.

If Xen is hiding the tsc cpu feature from the kernel/apps it can. One
problem there is a lot of grotty code simply explodes without rdtsc
working.

The alternative is to virtualise the TSC as some other hypedvisors do but
that has other impacts.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-28  3:29                   ` Dan Magenheimer
  2009-08-28  9:49                     ` Alan Cox
@ 2009-08-28 17:02                     ` Jeremy Fitzhardinge
  2009-08-28 17:49                       ` Dan Magenheimer
  1 sibling, 1 reply; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-08-28 17:02 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail), Keir Fraser, Alan Cox

On 08/27/09 20:29, Dan Magenheimer wrote:
>> If an app is sophisticated to do this correctly then it 
>> doesn't need any
>> special assistance from a hypervisor to make the tsc well-behaved.  It
>> should continue to work even in a Xen guest where both the process can
>> skip between VCPUs and the VCPUs can skip between PCPUs.
>>     
> No, I don't think this is true.  An enterprise app that binds processes
> to fixed physical processors on a physical machine can make
> assumptions about the results of rdtsc that aren't valid when
> the vcpus can skip between pcpus. 

You can bind a vcpu to a pcpu or group of pcpus with the right tsc
properties.  At this point you're talking about a specialized
non-portable app with very sensitive dependencies on the system software
and underlying hardware, so requiring some special effort to virtualize
it doesn't seem like a big problem.

>  Further, like Linux itself,
> applications may test assumptions about tsc at startup that are
> assumed to remain valid for the life of the app, which is
> perfectly reasonable on a physical machine and a bad mistake
> in a virtualized environment.
>   

Not really.  An app can't tell whether its initial test happened to be
in a stable period that will be later upset by a power event,
suspend/resume, migration via some other mechanism (like
vserver/containers), etc, etc.  An app making such assumptions will be
very machine and system dependent, and not at all portable.

> True, but any app that tries to run on a NUMA machine without
> being aware of the idiosyncracies of a NUMA machine probably
> has worse problems to deal with than tsc sync.  Further, there
> are many many apps that will likely never ever run on those
> machines.

Who can say?  Effects caused by locality issues will only result in
performance problems rather than outright correctness problems.

>   Are we going to penalize all apps all the time
> because some might run some of the time on a machine where
> tsc is not synced?
>   
They're already penalized.  The population of machines with a tsc which
can be used in the manner you're suggesting is very small, and even then
there are strong caveats.

>> But in this case I'm talking specifically about a Xen PV guest, where
>> the tsc is claimed for use by the Xen clocksource ABI.
>>     
> I just don't understand how you can say that a valid userland
> instruction is "claimed for use" by Xen (or Linux or both).
>   

Apps are free to try and use the tsc in any way they feel like, but it
has never had any guaranteed properties.  Some uses are completely
reasonable (like using it as some entropy to seed an RNG, for example). 
At one point the kernel did disable the tsc for usermode use, but that
was quickly reverted (or perhaps it never made it to mainline) because
its not for the kernel to break backwards compatibility for the sake of
second-guessing usermode.

I think this is getting a bit repetitive.

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: write_tsc in a PV domain?
  2009-08-28 17:02                     ` Jeremy Fitzhardinge
@ 2009-08-28 17:49                       ` Dan Magenheimer
  2009-08-28 23:01                         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-28 17:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Keir Fraser, Alan Cox

> I think this is getting a bit repetitive.

True, and we are going down some unfortunate
ratholes.  So let's see if we can focus on
the core of the disagreement.

> Apps are free to try and use the tsc in any way they
> feel like, but it has never had any
> GUARANTEED [djm's emphasis] properties.

I think this is the key difference of opinion which
must be resolved.  If what you say is true, your
other positions make sense.  If it is false,
they make much less sense.  (And unfortunately
it is not a black and white issue.)

There ARE guaranteed properties specified by
the Intel SDM for any _single_ processor,
namely that rdtsc is "guaranteed to return
a monotonically increasing unique value whenever
executed, except for 64-bit counter wraparound.
Intel guarantees that the time-stamp counter
will not wrap-around within 10 years after being
reset."  Both uses of the word "guarantee"
are quoted from the Intel SDM.

What is NOT guaranteed, but is widely and
incorrectly assumed to be implied and has
gotten us into this mess, is that
the same properties applies across multiple
processors.  And there are notable examples
of systems where the properties do NOT apply.
So it is true that an app that
does not know conclusively that certain threads
are running on certain processors cannot
always safely use rdtsc to obtain the
single-processor-guaranteed results.

BUT some software systems (including VMware) do
provide this guarantee across multiple processors.
And recent families of both Intel and AMD
multi-core have advanced to the point where
the properties apply across all cores, so
on the vast majority (but admittedly not all)
of future physical systems, apps can and will
use rdtsc and expect the properties to apply
(whether guaranteed or not).

So in your opinion, some systems are broken
so Xen should assume all future systems are
broken.  In my opinion, the problem is being
fixed in hardware and has always been fixed
in VMware, so Xen should look to the future
not the past.

Does that sound like a good summary of this
disagreement?

P.S. Summarizing the broader discussion on a
new thread.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* rdtsc: correctness vs performance on Xen (and KVM?)
  2009-08-28 15:30                         ` Alan Cox
@ 2009-08-28 17:49                           ` Dan Magenheimer
  2009-08-31 23:52                             ` Dan Magenheimer
  2009-08-28 17:49                           ` write_tsc in a PV domain? Dan Magenheimer
  1 sibling, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-28 17:49 UTC (permalink / raw)
  To: Xen-Devel (E-mail); +Cc: Jeremy Fitzhardinge, Keir Fraser, Alan Cox

To summarize:

Xen and KVM currently allow rdtsc to be executed
directly by userland.  As a result, apps that
use rdtsc smartly and effectively on (some) physical
machines may break badly in Xen or KVM because of
the disassociation of physical and virtual cpus.
(Readers not familiar with why rdtsc is a problem,
can read e.g. http://en.wikipedia.org/wiki/Rdtsc)

VMware always emulates rdtsc, both for kernel and
userland rdtsc's. (I don't know what HyperV does.)

Xen currently has a boot option to always emulate
rdtsc in HVM guests and just added code such that
the same boot option will always emulate rdtsc for
userland-only in PVM guests.  There is some agreement
in the Xen community that rdtsc emulation should
always be the default though the default is currently
off.  KVM is having a similar discussion and, I'm
told, has also come to the conclusion that emulating
rdtsc is a necessary evil.

The problem is that emulating rdtsc is slow.  On
my dual-core Conroe, rdtsc is about 72 cycles and
emulating rdtsc (returning a fixed frequency 1GHz
Xen monotonic system time) is over 15x slower.
This is a big hit for apps that do tens to hundreds
of thousands of rdtsc's per processor per second.
(And yes these apps are more common than one
might think.)

VMware has the advantage of binary translation;
rdtsc can be translated to return a "conforming"
value in ~200 cycles (on an older processor so
probably faster if you are comparing against my
dual-core Conroe numbers above).  This value
is "stale" (not linear with wallclock time).
For VMs that need rdtsc to more accurately reflect
wallclock time, full emulation can be optionally
enabled for a VM.

I'm searching for alternatives that provide the
correctness of emulation, but better performance
than emulation.  Jeremy points out that the
pvclock mechanism in upstream Linux works well,
but the pvclock data is currently only exposed
to kernel... and exposing it to userland still
requires apps-using-rdtsc to be rewritten.
But Jeremy claims that all apps-that-use-rdtsc
MUST be rewritten because using rdtsc is unsafe,
and that they should be rewritten to use
gettimeofday (or actually vgettimeofday).
But on older OS's (including the vast majority
of installed units) and machines where tsc is
"unsafe", gettimeofday can be MUCH slower than
emulating rdtsc.  So telling app writers to
convert all uses of rdtsc to gettimeofday is
not an acceptable solution for these apps in
the shortterm.

My current thinking is that we (the Linux and
Xen and KVM community) should architect a
userland API using the pvclock mechanism.
The underlying implementation of this API would
utilize Linux only to "register" the mechanism,
preferably via a module so that it, like
disk and network frontends, could easily be
bolted on to shipping OS's.  Individual uses
of "pvclock_read" would need no syscall... like
the kernel pvclock mechanism, they need only
access memory to get the necessary scaling
and offset data.  Once instantiated, rdtsc
is executed directly by the app as part of the
pvclock protocol.  If never registered,
rdtsc would always be trapped and emulated.

I realize this idea is half-baked, but would like
to invite other TSC/time experts to determine
if some or all of the idea might be used to
achieve a fully-baked solution.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: write_tsc in a PV domain?
  2009-08-28 15:30                         ` Alan Cox
  2009-08-28 17:49                           ` rdtsc: correctness vs performance on Xen (and KVM?) Dan Magenheimer
@ 2009-08-28 17:49                           ` Dan Magenheimer
  1 sibling, 0 replies; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-28 17:49 UTC (permalink / raw)
  To: Alan Cox; +Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser

> > Jeremy's claim is that because some apps-that-use-
> > rdtsc risk bugginess, Xen can claim rdtsc for its own
> > use and effectively disallow all uses of rdtsc in any
> > app by breaking the existing, sometimes-useful semantics
> > of the instruction.
> 
> If Xen is hiding the tsc cpu feature from the
> kernel/apps it can.

True, it can, but Xen does not currently do so and there
has been no proposal for Xen to do so.  And given Xen's
policy of supporting all existing applications, I don't
expect that a proposal to hide the tsc cpu feature
will fly.

> One problem there is a lot of grotty code simply
> explodes without rdtsc working.

Indeed.  While it might be satisfying to legislate
against stupidity, it rarely works. :-)

> The alternative is to virtualise the TSC as some other 
> hypedvisors do but that has other impacts.

Yes, this is where this whole discussion started.
Let me summarize, but start a separate thread to do so.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-28 17:49                       ` Dan Magenheimer
@ 2009-08-28 23:01                         ` Jeremy Fitzhardinge
  2009-08-29 17:51                           ` Dan Magenheimer
  0 siblings, 1 reply; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-08-28 23:01 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail), Keir Fraser, Alan Cox

On 08/28/09 10:49, Dan Magenheimer wrote:
>> Apps are free to try and use the tsc in any way they
>> feel like, but it has never had any
>> GUARANTEED [djm's emphasis] properties.
>>     
> I think this is the key difference of opinion which
> must be resolved.  If what you say is true, your
> other positions make sense.  If it is false,
> they make much less sense.  (And unfortunately
> it is not a black and white issue.)
>
> There ARE guaranteed properties specified by
> the Intel SDM for any _single_ processor,
> namely that rdtsc is "guaranteed to return
> a monotonically increasing unique value whenever
> executed, except for 64-bit counter wraparound.
> Intel guarantees that the time-stamp counter
> will not wrap-around within 10 years after being
> reset."  Both uses of the word "guarantee"
> are quoted from the Intel SDM.
>   

Yes, but those are fairly weak guarantees.  It does not guarantee that
the tsc won't change rate arbitrarily, or stop outright between reads.

> What is NOT guaranteed, but is widely and
> incorrectly assumed to be implied and has
> gotten us into this mess, is that
> the same properties applies across multiple
> processors.

Yes, Linux offers even weaker guarantees than Intel.  Aside from the
processor migration issue, the tsc can jump arbitrarily as a result of
suspend/resume (ie, it can be non-monotonic).

>   And there are notable examples
> of systems where the properties do NOT apply.
> So it is true that an app that
> does not know conclusively that certain threads
> are running on certain processors cannot
> always safely use rdtsc to obtain the
> single-processor-guaranteed results.
>
> BUT some software systems (including VMware) do
> provide this guarantee across multiple processors.
> And recent families of both Intel and AMD
> multi-core have advanced to the point where
> the properties apply across all cores, so
> on the vast majority (but admittedly not all)
> of future physical systems, apps can and will
> use rdtsc and expect the properties to apply
> (whether guaranteed or not).
>   

Even very recent processors with "constant" tscs (ie, they don't change
rate with the core frequency) stop in certain power states.  Any
motherboard design which runs packages in different clock-domains will
lose tsc-sync between those packages, regardless of what's in the packages.

The "sane tsc" properties are primarily for the benefit of kernels, to
allow them to make better use of the tsc.  They will have enough
knowledge of the overall system architecture to know how and when the
tsc can be trusted.  Usermode apps can try to piggyback onto this if
they like, but they're in much more treacherous territory.  They can
never know what the underlying system design is, or whether its really
safe to trust the tsc's sanity.  And without some explicit guarantees on
Linux's part, the tsc will still be non-monotonic over suspend/resume
(in all its many forms).

> So in your opinion, some systems are broken
> so Xen should assume all future systems are
> broken.  In my opinion, the problem is being
> fixed in hardware and has always been fixed
> in VMware, so Xen should look to the future
> not the past.
>
> Does that sound like a good summary of this
> disagreement?
>
>   

Not quite.

You are talking about three different cases:

   1. the reliability of the tsc in a PV guest in kernel mode
   2. the reliability of the tsc in a PV guest in user mode
   3. the reliability of the tsc in an HVM guest

I don't think 1. needs any attention.  The current scheme works fine.

The only option for 3 is to try make a best-effort of tsc quality, which
ranges from trapping every rdtsc to make them all give globally
monotonic results, or use the other VT/SVM features to apply an offset
from the raw tsc to a guest tsc, etc.  Either way the situation isn't
much different from running native (ie, apps will see basically the same
tsc behaviour as in the native case, to some degree of approximation).

So, there's case 2: pv usermode.  There are four classes of apps worth
considering here:

   1. Old apps which make unwarranted assumptions about the behavour of
      the tsc.  They assume they're basically running on some equivalent
      of a P54, and so will get junk on any modernish system with SMP
      and/or power management.  If people are still using such apps, it
      probably means their performance isn't critically dependent on the
      tsc.
   2. More sophisticated apps which know the tsc has some limitations
      and try to mitigate them by filtering discontinuities, using
      rdtscp, etc.  They're best-effort, but they inherently lack enough
      information to do a complete job (they have to guess at where
      power transitions occured, etc).
   3. New apps which know about modern processor capabilities, and
      attempt to rely on constant_tsc forgoing all the best-effort
      filtering, etc
   4. Apps which use gettimeofday() and/or clock_gettime() for all time
      measurement.  They're guaranteed to get consistent time results,
      perhaps at the cost of a syscall.  On systems which support it,
      they'll get vsyscall implementations which avoid the syscall while
      still using the best-possible clocksource.  Even if they don't a
      syscall will outperform an emulated rdtsc.

Class 1 apps are just broken.  We can try to emulate a UP, no-PM
processor for them, and that's probably best done in an HVM domain. 
There's no need to go to extraordinary efforts for them because the
native hardware certainly won't.

Class 2 apps will work as well as ever in a Xen PV domain as-is.  If
they use rdtscp then they will be able to correlate the tsc to the
underlying pcpu and manage consistency that way.  If they pin threads to
VCPUs, then they may also requre VCPUs to be pinned to PCPUs.  But
there's no need to make deep changes to Xen's tsc handling to
accommodate them.

Class 3 apps will get a bit of a rude surprise in a PV Xen domain.  But
they're also new enough to use another mechanism to get time.  They're
new enough to "know" that gettimeofday can be very efficient, and should
not be going down the rathole of using rdtsc directly.  And unless
they're going to be restricted to a very narrow class of machines (for
example, not my relatively new Core2 laptop which stops the "constant"
tsc in deep sleep modes), they need to fall back to being a class 2 or 4
app anyway.

Class 4 apps are not well-served under Xen.  I think the vsyscall
mechanism will be disabled and they'll always end up doing a real
syscall.  However, I think it would be relatively easy to add a new
vgettimeofday implementation which directly uses the pvclock mechanism
from usermode (the same code would work equally well for Xen and KVM). 
There's no need to add a new usermode ABI to get quick, high-quality
time in usermode.  Performance-wise it would be more or less
indistinguishable from using a raw rdtsc, but it has the benefit of
getting full cooperation from the kernel and Xen, and can take into
account all tsc variations (if any).


So if you want to address these problems, it seems to me you'll get most
bang for the buck by fixing (v)gettimeofday to use pvclock, and
convincing app writers to trust in gettimeofday.

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: write_tsc in a PV domain?
  2009-08-28 23:01                         ` Jeremy Fitzhardinge
@ 2009-08-29 17:51                           ` Dan Magenheimer
  2009-08-31 18:11                             ` Dan Magenheimer
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-29 17:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Keir Fraser, Alan Cox

(Reordered with most important points first...)

> You are talking about three different cases:

I agree with your analysis for case 1 and case 3.

> So, there's case 2: pv usermode.  There are four
> classes of apps worth considering here:

I agree with your classification.  But a key point
is that VMware provides correctness for all
of these classes.  AND provides it at much better
performance than trap-and-emulate.  AND provides
correctness+performance regardless of the underlying
OS (e.g. even "old" OS's such as RHEL4 and RHEL5).
AND provides it regardless whether the guest OS is
32-bit or 64-bit.  AND, by the way, provides it for
your case 1 (PV OS) and case 3 (HVM) as well.

> So if you want to address these problems, it seems to me 
> you'll get most
> bang for the buck by fixing (v)gettimeofday to use pvclock, and
> convincing app writers to trust in gettimeofday.

(Partially irrelevant point, but gettimeofday returns
microseconds which is not enough resolution for many
cases where rdtsc has been used in apps.  Clock_gettime
is the relevant API I think.)

If we can come up with a way for a kernel-loadable module
to handle some equivalent of clock_gettime so that
the most widely used shipping PV OS's can provide a
pvclock interface to apps, this might be workable.
If we tell app providers and customers: "You
can choose either performance OR correctness but
not both, unless you upgrade to a new OS (that is
not even available yet)", I don't think that will
be acceptable.

Any ideas on how pvclock might be provided through
a module that could be added to, eg. RHEL4 or RHEL5?

> > There ARE guaranteed properties specified by
> > the Intel SDM for any _single_ processor...
> 
> Yes, but those are fairly weak guarantees.  It does not guarantee that
> the tsc won't change rate arbitrarily, or stop outright between reads.

They are weak guarantees only if one uses rdtsc
to accurately track wallclock time.  They are
perfectly useful guarantees if one simply wants to
either timestamp data to record ordering (e.g.
for journaling or transaction replay), or
approximate the passing of time to provide
approximate execution metrics (e.g. for
performance tools).

> > What is NOT guaranteed, but is widely and
> > incorrectly assumed to be implied and has
> > gotten us into this mess, is that
> > the same properties applies across multiple
> > processors.
> 
> Yes, Linux offers even weaker guarantees than Intel.  Aside from the
> processor migration issue, the tsc can jump arbitrarily as a result of
> suspend/resume (ie, it can be non-monotonic).

Please explain.  Suspend/resume is an S state isn't
it?  Is it possible to suspend/resume one processor
in an SMP system and not another processor?  I think
not.  Your point is valid for C-states and P-states
but those are what Intel/AMD has fixed in the most
recent families of multi-core processors.

So I don't see how (in the most recent familes of
processors) tsc can be non-monotonic.

> Even very recent processors with "constant" tscs (ie, they 
> don't change
> rate with the core frequency) stop in certain power states.

For the most recent families of processors, the TSC
continues to run at a fixed rate even for all the
P-states and C-states.  We should confirm this with
Intel and AMD.

> Any motherboard design which runs packages in different
> clock-domains will lose tsc-sync between those packages,
> regardless of what's in the packages.

I'm told this is not true for recent multi-socket systems
where the sockets are on the same motherboard.  And at
least one large vendor that ships a new one-socket-per-
motherboard NUMA-ish system claims that it is not even
true when the sockets are on different motherboards.

Dan

(no further replies below, remaining original text retained
for context)

> You are talking about three different cases:
> 
>    1. the reliability of the tsc in a PV guest in kernel mode
>    2. the reliability of the tsc in a PV guest in user mode
>    3. the reliability of the tsc in an HVM guest
> 
> I don't think 1. needs any attention.  The current scheme works fine.
> 
> The only option for 3 is to try make a best-effort of tsc 
> quality, which
> ranges from trapping every rdtsc to make them all give globally
> monotonic results, or use the other VT/SVM features to apply an offset
> from the raw tsc to a guest tsc, etc.  Either way the situation isn't
> much different from running native (ie, apps will see 
> basically the same
> tsc behaviour as in the native case, to some degree of approximation).
> 
> So, there's case 2: pv usermode.  There are four classes of apps worth
> considering here:
> 
>    1. Old apps which make unwarranted assumptions about the 
> behavour of
>       the tsc.  They assume they're basically running on some 
> equivalent
>       of a P54, and so will get junk on any modernish system with SMP
>       and/or power management.  If people are still using 
> such apps, it
>       probably means their performance isn't critically 
> dependent on the
>       tsc.
>    2. More sophisticated apps which know the tsc has some limitations
>       and try to mitigate them by filtering discontinuities, using
>       rdtscp, etc.  They're best-effort, but they inherently 
> lack enough
>       information to do a complete job (they have to guess at where
>       power transitions occured, etc).
>    3. New apps which know about modern processor capabilities, and
>       attempt to rely on constant_tsc forgoing all the best-effort
>       filtering, etc
>    4. Apps which use gettimeofday() and/or clock_gettime() 
> for all time
>       measurement.  They're guaranteed to get consistent time results,
>       perhaps at the cost of a syscall.  On systems which support it,
>       they'll get vsyscall implementations which avoid the 
> syscall while
>       still using the best-possible clocksource.  Even if they don't a
>       syscall will outperform an emulated rdtsc.
> 
> Class 1 apps are just broken.  We can try to emulate a UP, no-PM
> processor for them, and that's probably best done in an HVM domain. 
> There's no need to go to extraordinary efforts for them because the
> native hardware certainly won't.
> 
> Class 2 apps will work as well as ever in a Xen PV domain as-is.  If
> they use rdtscp then they will be able to correlate the tsc to the
> underlying pcpu and manage consistency that way.  If they pin 
> threads to
> VCPUs, then they may also requre VCPUs to be pinned to PCPUs.  But
> there's no need to make deep changes to Xen's tsc handling to
> accommodate them.
> 
> Class 3 apps will get a bit of a rude surprise in a PV Xen 
> domain.  But
> they're also new enough to use another mechanism to get time.  They're
> new enough to "know" that gettimeofday can be very efficient, 
> and should
> not be going down the rathole of using rdtsc directly.  And unless
> they're going to be restricted to a very narrow class of machines (for
> example, not my relatively new Core2 laptop which stops the "constant"
> tsc in deep sleep modes), they need to fall back to being a 
> class 2 or 4
> app anyway.
> 
> Class 4 apps are not well-served under Xen.  I think the vsyscall
> mechanism will be disabled and they'll always end up doing a real
> syscall.  However, I think it would be relatively easy to add a new
> vgettimeofday implementation which directly uses the pvclock mechanism
> from usermode (the same code would work equally well for Xen 
> and KVM). 
> There's no need to add a new usermode ABI to get quick, high-quality
> time in usermode.  Performance-wise it would be more or less
> indistinguishable from using a raw rdtsc, but it has the benefit of
> getting full cooperation from the kernel and Xen, and can take into
> account all tsc variations (if any).
> 
> 
> So if you want to address these problems, it seems to me 
> you'll get most
> bang for the buck by fixing (v)gettimeofday to use pvclock, and
> convincing app writers to trust in gettimeofday.
> 
>     J
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: write_tsc in a PV domain?
  2009-08-29 17:51                           ` Dan Magenheimer
@ 2009-08-31 18:11                             ` Dan Magenheimer
  2009-08-31 19:06                               ` Keir Fraser
  2009-08-31 19:18                               ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-31 18:11 UTC (permalink / raw)
  To: dan.magenheimer, Jeremy Fitzhardinge
  Cc: Xen-Devel (E-mail), Keir Fraser, Alan Cox

I'm experimenting with clock_gettime(), gettimeofday(),
and rdtsc with a 2.6.30 64-bit pvguest.  I have tried both
with kernel.vsyscall64 equal to 0 and 1 (but haven't seen
any significant difference between the two).  I have
confirmed from sysfs that clocksource=xen

I have yet to get a measurement of either syscall that
is better than 2.5x WORSE than emulating rdtsc. On
my dual-core Conroe (Intel E6850) with 64-bit Xen and
32-bit dom0, I get approximately:

rdtsc native: 22ns
softtsc (rdtsc emulated): 360ns
gettime syscall w/softtsc: 1400ns
gettime syscall native tsc: 980ns
gettimeofday w/softtsc: 1750ns
gettimeofday native tsc: 900ns

I'm hoping this is either a bug in the 2.6.30 xen
pvclock implementation or in my measurement methodology,
so would welcome others measuring this.

A couple other minor observations:
1) The syscalls seem to be somewhat slower when usermode
   rdtscs are being emulated, by approximately the cost
   of emulating an rdtsc.  I suppose this makes
   sense since vsyscalls are executed in userland
   and since vgettimeofday does a rdtsc.  However it
   complicates strategy if emulating rdtsc is the default.
2) The syscall clock_getres() does not seem to reflect
   the fact that 

> -----Original Message-----
> From: Dan Magenheimer 
> Sent: Saturday, August 29, 2009 11:52 AM
> To: Jeremy Fitzhardinge
> Cc: Alan Cox; Xen-Devel (E-mail); Keir Fraser
> Subject: RE: [Xen-devel] write_tsc in a PV domain?
> 
> 
> (Reordered with most important points first...)
> 
> > You are talking about three different cases:
> 
> I agree with your analysis for case 1 and case 3.
> 
> > So, there's case 2: pv usermode.  There are four
> > classes of apps worth considering here:
> 
> I agree with your classification.  But a key point
> is that VMware provides correctness for all
> of these classes.  AND provides it at much better
> performance than trap-and-emulate.  AND provides
> correctness+performance regardless of the underlying
> OS (e.g. even "old" OS's such as RHEL4 and RHEL5).
> AND provides it regardless whether the guest OS is
> 32-bit or 64-bit.  AND, by the way, provides it for
> your case 1 (PV OS) and case 3 (HVM) as well.
> 
> > So if you want to address these problems, it seems to me 
> > you'll get most
> > bang for the buck by fixing (v)gettimeofday to use pvclock, and
> > convincing app writers to trust in gettimeofday.
> 
> (Partially irrelevant point, but gettimeofday returns
> microseconds which is not enough resolution for many
> cases where rdtsc has been used in apps.  Clock_gettime
> is the relevant API I think.)
> 
> If we can come up with a way for a kernel-loadable module
> to handle some equivalent of clock_gettime so that
> the most widely used shipping PV OS's can provide a
> pvclock interface to apps, this might be workable.
> If we tell app providers and customers: "You
> can choose either performance OR correctness but
> not both, unless you upgrade to a new OS (that is
> not even available yet)", I don't think that will
> be acceptable.
> 
> Any ideas on how pvclock might be provided through
> a module that could be added to, eg. RHEL4 or RHEL5?
> 
> > > There ARE guaranteed properties specified by
> > > the Intel SDM for any _single_ processor...
> > 
> > Yes, but those are fairly weak guarantees.  It does not 
> guarantee that
> > the tsc won't change rate arbitrarily, or stop outright 
> between reads.
> 
> They are weak guarantees only if one uses rdtsc
> to accurately track wallclock time.  They are
> perfectly useful guarantees if one simply wants to
> either timestamp data to record ordering (e.g.
> for journaling or transaction replay), or
> approximate the passing of time to provide
> approximate execution metrics (e.g. for
> performance tools).
> 
> > > What is NOT guaranteed, but is widely and
> > > incorrectly assumed to be implied and has
> > > gotten us into this mess, is that
> > > the same properties applies across multiple
> > > processors.
> > 
> > Yes, Linux offers even weaker guarantees than Intel.  Aside from the
> > processor migration issue, the tsc can jump arbitrarily as 
> a result of
> > suspend/resume (ie, it can be non-monotonic).
> 
> Please explain.  Suspend/resume is an S state isn't
> it?  Is it possible to suspend/resume one processor
> in an SMP system and not another processor?  I think
> not.  Your point is valid for C-states and P-states
> but those are what Intel/AMD has fixed in the most
> recent families of multi-core processors.
> 
> So I don't see how (in the most recent familes of
> processors) tsc can be non-monotonic.
> 
> > Even very recent processors with "constant" tscs (ie, they 
> > don't change
> > rate with the core frequency) stop in certain power states.
> 
> For the most recent families of processors, the TSC
> continues to run at a fixed rate even for all the
> P-states and C-states.  We should confirm this with
> Intel and AMD.
> 
> > Any motherboard design which runs packages in different
> > clock-domains will lose tsc-sync between those packages,
> > regardless of what's in the packages.
> 
> I'm told this is not true for recent multi-socket systems
> where the sockets are on the same motherboard.  And at
> least one large vendor that ships a new one-socket-per-
> motherboard NUMA-ish system claims that it is not even
> true when the sockets are on different motherboards.
> 
> Dan
> 
> (no further replies below, remaining original text retained
> for context)
> 
> > You are talking about three different cases:
> > 
> >    1. the reliability of the tsc in a PV guest in kernel mode
> >    2. the reliability of the tsc in a PV guest in user mode
> >    3. the reliability of the tsc in an HVM guest
> > 
> > I don't think 1. needs any attention.  The current scheme 
> works fine.
> > 
> > The only option for 3 is to try make a best-effort of tsc 
> > quality, which
> > ranges from trapping every rdtsc to make them all give globally
> > monotonic results, or use the other VT/SVM features to 
> apply an offset
> > from the raw tsc to a guest tsc, etc.  Either way the 
> situation isn't
> > much different from running native (ie, apps will see 
> > basically the same
> > tsc behaviour as in the native case, to some degree of 
> approximation).
> > 
> > So, there's case 2: pv usermode.  There are four classes of 
> apps worth
> > considering here:
> > 
> >    1. Old apps which make unwarranted assumptions about the 
> > behavour of
> >       the tsc.  They assume they're basically running on some 
> > equivalent
> >       of a P54, and so will get junk on any modernish 
> system with SMP
> >       and/or power management.  If people are still using 
> > such apps, it
> >       probably means their performance isn't critically 
> > dependent on the
> >       tsc.
> >    2. More sophisticated apps which know the tsc has some 
> limitations
> >       and try to mitigate them by filtering discontinuities, using
> >       rdtscp, etc.  They're best-effort, but they inherently 
> > lack enough
> >       information to do a complete job (they have to guess at where
> >       power transitions occured, etc).
> >    3. New apps which know about modern processor capabilities, and
> >       attempt to rely on constant_tsc forgoing all the best-effort
> >       filtering, etc
> >    4. Apps which use gettimeofday() and/or clock_gettime() 
> > for all time
> >       measurement.  They're guaranteed to get consistent 
> time results,
> >       perhaps at the cost of a syscall.  On systems which 
> support it,
> >       they'll get vsyscall implementations which avoid the 
> > syscall while
> >       still using the best-possible clocksource.  Even if 
> they don't a
> >       syscall will outperform an emulated rdtsc.
> > 
> > Class 1 apps are just broken.  We can try to emulate a UP, no-PM
> > processor for them, and that's probably best done in an HVM domain. 
> > There's no need to go to extraordinary efforts for them because the
> > native hardware certainly won't.
> > 
> > Class 2 apps will work as well as ever in a Xen PV domain as-is.  If
> > they use rdtscp then they will be able to correlate the tsc to the
> > underlying pcpu and manage consistency that way.  If they pin 
> > threads to
> > VCPUs, then they may also requre VCPUs to be pinned to PCPUs.  But
> > there's no need to make deep changes to Xen's tsc handling to
> > accommodate them.
> > 
> > Class 3 apps will get a bit of a rude surprise in a PV Xen 
> > domain.  But
> > they're also new enough to use another mechanism to get 
> time.  They're
> > new enough to "know" that gettimeofday can be very efficient, 
> > and should
> > not be going down the rathole of using rdtsc directly.  And unless
> > they're going to be restricted to a very narrow class of 
> machines (for
> > example, not my relatively new Core2 laptop which stops the 
> "constant"
> > tsc in deep sleep modes), they need to fall back to being a 
> > class 2 or 4
> > app anyway.
> > 
> > Class 4 apps are not well-served under Xen.  I think the vsyscall
> > mechanism will be disabled and they'll always end up doing a real
> > syscall.  However, I think it would be relatively easy to add a new
> > vgettimeofday implementation which directly uses the 
> pvclock mechanism
> > from usermode (the same code would work equally well for Xen 
> > and KVM). 
> > There's no need to add a new usermode ABI to get quick, high-quality
> > time in usermode.  Performance-wise it would be more or less
> > indistinguishable from using a raw rdtsc, but it has the benefit of
> > getting full cooperation from the kernel and Xen, and can take into
> > account all tsc variations (if any).
> > 
> > 
> > So if you want to address these problems, it seems to me 
> > you'll get most
> > bang for the buck by fixing (v)gettimeofday to use pvclock, and
> > convincing app writers to trust in gettimeofday.
> > 
> >     J
> >

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-31 18:11                             ` Dan Magenheimer
@ 2009-08-31 19:06                               ` Keir Fraser
  2009-08-31 21:06                                 ` Dan Magenheimer
  2009-08-31 19:18                               ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-08-31 19:06 UTC (permalink / raw)
  To: Dan Magenheimer, Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Alan Cox

On 31/08/2009 19:11, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> I have yet to get a measurement of either syscall that
> is better than 2.5x WORSE than emulating rdtsc. On
> my dual-core Conroe (Intel E6850) with 64-bit Xen and
> 32-bit dom0, I get approximately:
> 
> rdtsc native: 22ns
> softtsc (rdtsc emulated): 360ns

Trap-and-emulate in 360ns seems astoundingly good. Perhaps too good to be
true?

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-31 18:11                             ` Dan Magenheimer
  2009-08-31 19:06                               ` Keir Fraser
@ 2009-08-31 19:18                               ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-08-31 19:18 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail), Keir Fraser, Alan Cox

On 08/31/09 11:11, Dan Magenheimer wrote:
> I'm experimenting with clock_gettime(), gettimeofday(),
> and rdtsc with a 2.6.30 64-bit pvguest.  I have tried both
> with kernel.vsyscall64 equal to 0 and 1 (but haven't seen
> any significant difference between the two).  I have
> confirmed from sysfs that clocksource=xen
>   

Yeah, as I said, I wouldn't expect vsyscall to work under Xen at the
moment; the Xen clocksource will disable it.  Clocksources can implement
a "vread" method for use from a vsyscall, but from a quick look it
didn't appear we could use it as-is (because the pvclock info isn't
mapped into userspace, and the current vsyscall code assumes a single
set of parameters rather than percpu).

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: write_tsc in a PV domain?
  2009-08-31 19:06                               ` Keir Fraser
@ 2009-08-31 21:06                                 ` Dan Magenheimer
  2009-09-01  7:16                                   ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-31 21:06 UTC (permalink / raw)
  To: Keir Fraser, Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Alan Cox

> > I have yet to get a measurement of either syscall that
> > is better than 2.5x WORSE than emulating rdtsc. On
> > my dual-core Conroe (Intel E6850) with 64-bit Xen and
> > 32-bit dom0, I get approximately:
> > 
> > rdtsc native: 22ns
> > softtsc (rdtsc emulated): 360ns
> 
> Trap-and-emulate in 360ns seems astoundingly good. Perhaps 
> too good to be true?

I measured with the patch you checked in as 20128.

I tried a couple of tests, first changing pv_soft_rdtsc
to always return a value with the 4 LSB of the return
value cleared, second with the 4 LSB of the return value
set.  Both were properly reflected by a userland rdtsc.
So it looks like the correct emulation code is executing.

And get_s_time() always returns nanoseconds, correct?
So consecutive emulated rdtsc's should return values
that differ by the amount of nsec necessary to do
the emulation, right?  I ran 2 million rdtsc's in
a loop and took the average so, ignoring loop
and load/store overhead, the 360ns appears to be
an accurate measurement.

A thousand cycles to trap, decode, call get_s_time,
and return seems astoundingly good?  Probably it's
faster than a vmexit because there's so much less state
to save.  But still it's 15x slower than a raw rdtsc.

If you have ideas on how to test the measurement further,
I'd be happy to give them a spin.

Dan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-08-28 17:49                           ` rdtsc: correctness vs performance on Xen (and KVM?) Dan Magenheimer
@ 2009-08-31 23:52                             ` Dan Magenheimer
  2009-09-01  0:22                               ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-08-31 23:52 UTC (permalink / raw)
  To: dan.magenheimer, Xen-Devel (E-mail)
  Cc: Jeremy Fitzhardinge, Keir Fraser, Alan Cox

> My current thinking is that we (the Linux and
> Xen and KVM community) should architect a
> userland API using the pvclock mechanism.

OK, here's a slightly refined proposal.  To
reiterate, the problem is that Xen's current
mechanism for handling the rdtsc instruction
may silently provide incorrect results while
alternative mechanisms are too slow (vs VMware
which is both fast and correct).  My goal is to
provide a paravirtualized tsc mechanism for apps
running on Xen that is reliably correct,
is not dependent on a particular OS or
processor family, is approximately as fast
as rdtsc (or at least much faster than emulated
rdtsc), provides adequate (e.g. nanosecond)
resolution, does not require recompilation to
work both on Xen and bare metal, and works properly
across: vcpu-to-pcpu rescheduling even on NUMA
machines; system sleep/hibernation; and 
save/restore/migration between machines with
dissimilar clock rates.  Implementation requires
changes in Xen and "the app" but no OS changes
thus making it still viable on legacy OS's
and possibly(?) HVM domains.  Note that
only apps that need to sample time on the
order of >5-100K/core/second would use this;
for other apps, rdtsc emulation overhead
is probably negligible (<0.2%).

0)  Xen implements rdtsc emulation by default
1)  Guest OS is launched with pvtsc=1 in vm.cfg
2)  App running on guest OS sets up a SIGILL handler
3)  App executes a special rdmsr instruction or
    hypercall.
4a) If SIGILL results, not running on Xen at all,
    or on old Xen; app uses rdtsc at own risk. Done.
4b) Else, rdmsr/hypercall returns virtual address of
    special pvclock page ("pvclock_va").
5)  App executes another special rdmsr instruction/
    hypercall to disable rdtsc emulation.  This
    affects ALL execution for all processes in this VM.
6)  Xen maintains mapping of pvclock_va to a
    different physical page for each processor
    and transparently handles TLB misses for
    pvclock_va
7)  App uses (unemulated) rdtsc and applies
    pvclock algorithm (using values in memory
    at pvclock_va) resulting in pvtsc, which
    is nanoseconds since VM start.  App can
    further apply local algorithms to enforce
    monotonicity or frequency scaling as desired.

Comments appreciated.  I realize that this is hacky
and ugly... better alternatives gladly solicited.

Thanks,
Dan

P.S. While it would be nice if we could just tell
apps to use a fast vgettimeofday equivalent, this
does not exist today and, even if it did, would not
be widely available for years in the kernel running under
most enterprise app deployments (and, even then,
only on 64-bit Linux.)

> -----Original Message-----
> From: Dan Magenheimer 
> Sent: Friday, August 28, 2009 11:50 AM
> To: Xen-Devel (E-mail)
> Cc: Jeremy Fitzhardinge; Keir Fraser; Alan Cox
> Subject: rdtsc: correctness vs performance on Xen (and KVM?)
> 
> 
> To summarize:
> 
> Xen and KVM currently allow rdtsc to be executed
> directly by userland.  As a result, apps that
> use rdtsc smartly and effectively on (some) physical
> machines may break badly in Xen or KVM because of
> the disassociation of physical and virtual cpus.
> (Readers not familiar with why rdtsc is a problem,
> can read e.g. http://en.wikipedia.org/wiki/Rdtsc)
> 
> VMware always emulates rdtsc, both for kernel and
> userland rdtsc's. (I don't know what HyperV does.)
> 
> Xen currently has a boot option to always emulate
> rdtsc in HVM guests and just added code such that
> the same boot option will always emulate rdtsc for
> userland-only in PVM guests.  There is some agreement
> in the Xen community that rdtsc emulation should
> always be the default though the default is currently
> off.  KVM is having a similar discussion and, I'm
> told, has also come to the conclusion that emulating
> rdtsc is a necessary evil.
> 
> The problem is that emulating rdtsc is slow.  On
> my dual-core Conroe, rdtsc is about 72 cycles and
> emulating rdtsc (returning a fixed frequency 1GHz
> Xen monotonic system time) is over 15x slower.
> This is a big hit for apps that do tens to hundreds
> of thousands of rdtsc's per processor per second.
> (And yes these apps are more common than one
> might think.)
> 
> VMware has the advantage of binary translation;
> rdtsc can be translated to return a "conforming"
> value in ~200 cycles (on an older processor so
> probably faster if you are comparing against my
> dual-core Conroe numbers above).  This value
> is "stale" (not linear with wallclock time).
> For VMs that need rdtsc to more accurately reflect
> wallclock time, full emulation can be optionally
> enabled for a VM.
> 
> I'm searching for alternatives that provide the
> correctness of emulation, but better performance
> than emulation.  Jeremy points out that the
> pvclock mechanism in upstream Linux works well,
> but the pvclock data is currently only exposed
> to kernel... and exposing it to userland still
> requires apps-using-rdtsc to be rewritten.
> But Jeremy claims that all apps-that-use-rdtsc
> MUST be rewritten because using rdtsc is unsafe,
> and that they should be rewritten to use
> gettimeofday (or actually vgettimeofday).
> But on older OS's (including the vast majority
> of installed units) and machines where tsc is
> "unsafe", gettimeofday can be MUCH slower than
> emulating rdtsc.  So telling app writers to
> convert all uses of rdtsc to gettimeofday is
> not an acceptable solution for these apps in
> the shortterm.
> 
> My current thinking is that we (the Linux and
> Xen and KVM community) should architect a
> userland API using the pvclock mechanism.
> The underlying implementation of this API would
> utilize Linux only to "register" the mechanism,
> preferably via a module so that it, like
> disk and network frontends, could easily be
> bolted on to shipping OS's.  Individual uses
> of "pvclock_read" would need no syscall... like
> the kernel pvclock mechanism, they need only
> access memory to get the necessary scaling
> and offset data.  Once instantiated, rdtsc
> is executed directly by the app as part of the
> pvclock protocol.  If never registered,
> rdtsc would always be trapped and emulated.
> 
> I realize this idea is half-baked, but would like
> to invite other TSC/time experts to determine
> if some or all of the idea might be used to
> achieve a fully-baked solution.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-08-31 23:52                             ` Dan Magenheimer
@ 2009-09-01  0:22                               ` Jeremy Fitzhardinge
  2009-09-01 13:54                                 ` Dan Magenheimer
  0 siblings, 1 reply; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-01  0:22 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail), Keir Fraser, Alan Cox

On 08/31/09 16:52, Dan Magenheimer wrote:
> work both on Xen and bare metal, and works properly
> across: vcpu-to-pcpu rescheduling even on NUMA
> machines; system sleep/hibernation; and 
> save/restore/migration between machines with
> dissimilar clock rates. 

But it will only do this when running under Xen.  If running on bare
metal, there will be nothing providing the correction info to the app,
and it will be no better than using raw rdtsc with all its limitations. 
In practice this means that the app will have to have some other code
path anyway.

>  Implementation requires
> changes in Xen and "the app" but no OS changes
> thus making it still viable on legacy OS's
> and possibly(?) HVM domains.  Note that
> only apps that need to sample time on the
> order of >5-100K/core/second would use this;
> for other apps, rdtsc emulation overhead
> is probably negligible (<0.2%).
>
> 0)  Xen implements rdtsc emulation by default
> 1)  Guest OS is launched with pvtsc=1 in vm.cfg
> 2)  App running on guest OS sets up a SIGILL handler
> 3)  App executes a special rdmsr instruction or
>     hypercall.
>   

No way to do direct hypercalls from usermode, so it would need to be an
illegal instruction (like cpuid).

But really it should be a system-wide kernel setting, set via sysctl or
something.

> 4a) If SIGILL results, not running on Xen at all,
>     or on old Xen; app uses rdtsc at own risk. Done.
> 4b) Else, rdmsr/hypercall returns virtual address of
>     special pvclock page ("pvclock_va").
>   
This can't be done without changing the kernel; Xen can't just start
sticking stuff into usermode mappings (how does Xen even know where a
given OS's usermode is?).

And again, usermode can't do hypercalls and I don't think we should
start making fake rdmsrs start working in usermode.

> 5)  App executes another special rdmsr instruction/
>     hypercall to disable rdtsc emulation.  This
>     affects ALL execution for all processes in this VM.
>   

Once enabled, it should just stay enabled.  System-wide is very coarse
anyway (since there's no guarantee that all apps will use the mechanism).

> 6)  Xen maintains mapping of pvclock_va to a
>     different physical page for each processor
>     and transparently handles TLB misses for
>     pvclock_va
>   

If you mean that a given VA has a per-cpu mapping, it requires percpu
pagetables.  That's not possible in Linux with PV pagetables (since two
tasks/threads on different cpus sharing the same mm will use the same
pagetable).

> 7)  App uses (unemulated) rdtsc and applies
>     pvclock algorithm (using values in memory
>     at pvclock_va) resulting in pvtsc, which
>     is nanoseconds since VM start.  App can
>     further apply local algorithms to enforce
>     monotonicity or frequency scaling as desired.
>
> Comments appreciated.  I realize that this is hacky
> and ugly... better alternatives gladly solicited.
>   

In general even Linux's specialised APIs are entirely unused (sendfile,
vmsplice, etc).  Something as esoteric as this will be pretty much unused.

This can be entirely done within the vsyscall mechansim without any app
changes.  There's no reason no to.

> P.S. While it would be nice if we could just tell
> apps to use a fast vgettimeofday equivalent, this
> does not exist today and, even if it did, would not
> be widely available for years in the kernel running under
> most enterprise app deployments (and, even then,
> only on 64-bit Linux.)
>   

These rationales are very unconvincing:

Making vsyscall work on 32bit is just a matter of doing it; apparently
nobody has put the effort into it, but there's no fundimental reason why
it wouldn't work.  Besides, who runs enterprise apps on 32-bit these
days?  Anything requiring even moderate amounts of memory is better run
on 64-bit.

Your mechanism will require kernel changes anyway, so there's no getting
around that.

Once vsyscall does Xen/KVM properly, then every app will automatically
do the right thing without modification.  There's no need for
specialized APIs that nobody will end up using anyway.  It only makes
sense to go to this kind of effort if it ends up making a plain "rdtsc"
have the properties you want it to have.

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: write_tsc in a PV domain?
  2009-08-31 21:06                                 ` Dan Magenheimer
@ 2009-09-01  7:16                                   ` Keir Fraser
  0 siblings, 0 replies; 61+ messages in thread
From: Keir Fraser @ 2009-09-01  7:16 UTC (permalink / raw)
  To: Dan Magenheimer, Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Alan Cox

On 31/08/2009 22:06, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> A thousand cycles to trap, decode, call get_s_time,
> and return seems astoundingly good?  Probably it's
> faster than a vmexit because there's so much less state
> to save.  But still it's 15x slower than a raw rdtsc.

A kernel trap used to take about a microsecond. Maybe it has got faster on
new processors.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01  0:22                               ` Jeremy Fitzhardinge
@ 2009-09-01 13:54                                 ` Dan Magenheimer
  2009-09-01 14:34                                   ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-09-01 13:54 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Keir Fraser, Alan Cox

Hi Jeremy --

Thanks for the feedback!

> Making vsyscall work...

While I highly respect your opinion, and while vsyscall
may be a fine choice in the future, it just doesn't
solve the problem today and won't solve it ever for
currently shipping PV OS's.  If you can figure out a
way to allow vsyscall to be installed as a module and
still achieve its performance, it
might be a possible solution, but otherwise we have
to go around the OS to solve this problem.

The rdtsc instruction will be fully emulated by default
in Xen 4.0, and before that releases I need to find
a fast alternative for those apps that are dependent
on BOTH its correct functionality AND high performance.

> > work both on Xen and bare metal, and works properly
> > across: vcpu-to-pcpu rescheduling even on NUMA
> > machines; system sleep/hibernation; and 
> > save/restore/migration between machines with
> > dissimilar clock rates. 
> 
> But it will only do this when running under Xen.  If running on bare
> metal, there will be nothing providing the correction info to the app,
> and it will be no better than using raw rdtsc with all its 
> limitations. 
> In practice this means that the app will have to have some other code
> path anyway.

Yes, that's true.  I'm not trying to legislate whether
an app can use rdtsc or not on a physical machine, just
trying to provide the same guarantees for a rdtsc executed
in a virtual environment as already provided for a a
physical environment, but without significant performance
cost.

> > 3)  App executes a special rdmsr instruction or
> >     hypercall.
> 
> No way to do direct hypercalls from usermode, so it would 
> need to be an illegal instruction (like cpuid).
> ...and I don't think we should
> start making fake rdmsrs start working in usermode.

I'm told (by Keir) that it might be possible to allow certain
hypercalls to be executed from userland.  I haven't
investigated yet.  But a "fake rdmsr" might be a better
answer anyway; enlightened Windows and HyperV already use
a fake rdmsr, correct?

But I'm not keen on it either and am open to alternatives.

> But really it should be a system-wide kernel setting, set via 
> sysctl or something.

I'm not sure what you are suggesting here.

> > 4a) If SIGILL results, not running on Xen at all,
> >     or on old Xen; app uses rdtsc at own risk. Done.
> > 4b) Else, rdmsr/hypercall returns virtual address of
> >     special pvclock page ("pvclock_va").
> >   
> This can't be done without changing the kernel; Xen can't just start
> sticking stuff into usermode mappings (how does Xen even know where a
> given OS's usermode is?).

It doesn't have to be a usermode mapping, it just needs
to be a "magic" address; it can (for example) be in the
virtual address space Xen has reserved for itself.

> > 5)  App executes another special rdmsr instruction/
> >     hypercall to disable rdtsc emulation.  This
> >     affects ALL execution for all processes in this VM.
> 
> Once enabled, it should just stay enabled.  System-wide is very coarse
> anyway (since there's no guarantee that all apps will use the 
> mechanism).

Yes this is an ugly potential issue.  Fortunately, many
enterprise class apps essentially are the machine; and this
may be even more true in a virtualized world.

Again, I'm not keen on this either but I don't see an
alternative.

> > 6)  Xen maintains mapping of pvclock_va to a
> >     different physical page for each processor
> >     and transparently handles TLB misses for
> >     pvclock_va
> 
> If you mean that a given VA has a per-cpu mapping, it requires percpu
> pagetables.  That's not possible in Linux with PV pagetables 
> (since two
> tasks/threads on different cpus sharing the same mm will use the same
> pagetable).

What the OS can do is completely irrelevant.  The mapping
is handled entirely by Xen so the OS will never even
see a page fault for this address.

Note also that one-page-per-cpu is not needed.  The page
is readonly and there is no sensitive information in
a pvclock data structure so many per-cpu-pvclock-structs
could be on the same page.

> In general even Linux's specialised APIs are entirely unused 
> (sendfile,
> vmsplice, etc).  Something as esoteric as this will be pretty 
> much unused.

If apps are happy with the performance of emulated
rdtsc, there's no reason for them to use it, so I would
be happy if this pvtsc ABI never gets used.  However,
most enterprise apps are sensitive to a performance hit
of several percent and will be eager to try alternatives.

> This can be entirely done within the vsyscall mechansim 
> without any app
> changes.  There's no reason no to.

Performance with app portability is the reason.
 
> > P.S. While it would be nice if we could just tell
> > apps to use a fast vgettimeofday equivalent, this
> > does not exist today and, even if it did, would not
> > be widely available for years in the kernel running under
> > most enterprise app deployments (and, even then,
> > only on 64-bit Linux.)
> 
> These rationales are very unconvincing:
> 
> Making vsyscall work on 32bit is just a matter of doing it; apparently
> nobody has put the effort into it, but there's no fundimental 
> reason why
> it wouldn't work.  Besides, who runs enterprise apps on 32-bit these
> days?  Anything requiring even moderate amounts of memory is 
> better run
> on 64-bit.

Many people run enterprise apps on 32-bit these days, and
I'm not planning on forcing them to switch.  But 32-bit
vs 64-bit is a small parenthetical objection, not
particularly relevant to the main issue.
 
> Your mechanism will require kernel changes anyway, so there's 
> no getting
> around that.

I think that's exactly what the proposal does: gets around
requiring kernel changes.  If kernel changes are required
(other than bolting on a kernel loadable module),
pvtsc is also not an acceptable solution.

> Once vsyscall does Xen/KVM properly, then every app will automatically
> do the right thing without modification.  There's no need for
> specialized APIs that nobody will end up using anyway.

I fully agree that vsyscall is the right longterm answer
but telling the app providers to switch to something that
is non-existent in 100% of their deployments today, has not
yet been implemented sufficiently to be measured, and
probably won't exceed 50% of their deployments within
five years... well I don't expect them to be convinced.

> It only makes
> sense to go to this kind of effort if it ends up making a 
> plain "rdtsc"
> have the properties you want it to have.

Intel and AMD are responsible for making a plain rdtsc have
the properties you want it to have in a physical environment
and apparently they've done a good enough job that apps are
using it today (albeit with an added layer of glue to handle
certain SMP systems).

Emulating rdtsc provides the same properties in a virtual
environment but at a significant performance cost.

pvtsc is only intended to retrieve some of that performance.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 13:54                                 ` Dan Magenheimer
@ 2009-09-01 14:34                                   ` Keir Fraser
  2009-09-01 14:53                                     ` Dan Magenheimer
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-09-01 14:34 UTC (permalink / raw)
  To: Dan Magenheimer, Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Alan Cox

On 01/09/2009 14:54, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> 
>> Making vsyscall work...
> 
> While I highly respect your opinion, and while vsyscall
> may be a fine choice in the future, it just doesn't
> solve the problem today and won't solve it ever for
> currently shipping PV OS's.  If you can figure out a
> way to allow vsyscall to be installed as a module and
> still achieve its performance, it
> might be a possible solution, but otherwise we have
> to go around the OS to solve this problem.

Do you believe there's a solution which doesn't involve PV kernel
modifications? I think the suggestions you've made so far would require such
modifications.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 14:34                                   ` Keir Fraser
@ 2009-09-01 14:53                                     ` Dan Magenheimer
  2009-09-01 15:08                                       ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-09-01 14:53 UTC (permalink / raw)
  To: Keir Fraser, Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Alan Cox

> >> Making vsyscall work...
> > 
> > While I highly respect your opinion, and while vsyscall
> > may be a fine choice in the future, it just doesn't
> > solve the problem today and won't solve it ever for
> > currently shipping PV OS's.  If you can figure out a
> > way to allow vsyscall to be installed as a module and
> > still achieve its performance, it
> > might be a possible solution, but otherwise we have
> > to go around the OS to solve this problem.
> 
> Do you believe there's a solution which doesn't involve PV kernel
> modifications? I think the suggestions you've made so far 
> would require such modifications.

That is certainly my goal.  I *think* the proposal
does NOT require PV OS mods as the communication
is strictly between an app and Xen.  However, I'm
really not familiar with all the subtleties of the
x86 architecture so could be missing something.
I think these are the two key architectural dependencies
that I'm not certain of:

1) fake rdmsr (or hypercall if it works) returns a virtual
   address within a range of addresses that is not "owned by"
   the OS (e.g. maybe in Xen address space?).  The page is
   only readable outside of ring 0, but writeable in ring 0
   (by Xen).
2) All TLB misses on this page are handled directly by Xen
   so the OS never sees the address/page.

If these are OK, and you see other parts of the proposal
that require PV kernel mods, please point them out.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 14:53                                     ` Dan Magenheimer
@ 2009-09-01 15:08                                       ` Keir Fraser
  2009-09-01 15:26                                         ` Dan Magenheimer
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-09-01 15:08 UTC (permalink / raw)
  To: Dan Magenheimer, Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Alan Cox

On 01/09/2009 15:53, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> 1) fake rdmsr (or hypercall if it works) returns a virtual
>    address within a range of addresses that is not "owned by"
>    the OS (e.g. maybe in Xen address space?).  The page is
>    only readable outside of ring 0, but writeable in ring 0
>    (by Xen).
> 2) All TLB misses on this page are handled directly by Xen
>    so the OS never sees the address/page.

I think these are probably possible, at least for a 64-bit hypervisor which
isn't playing segment limit tricks.

> If these are OK, and you see other parts of the proposal
> that require PV kernel mods, please point them out.

Won't the pvclock computation be per-cpu? How will you deal with that?

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 15:08                                       ` Keir Fraser
@ 2009-09-01 15:26                                         ` Dan Magenheimer
  2009-09-01 15:32                                           ` Jan Beulich
  2009-09-01 15:43                                           ` Keir Fraser
  0 siblings, 2 replies; 61+ messages in thread
From: Dan Magenheimer @ 2009-09-01 15:26 UTC (permalink / raw)
  To: Keir Fraser, Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Alan Cox

> On 01/09/2009 15:53, "Dan Magenheimer" 
> <dan.magenheimer@oracle.com> wrote:
> 
> > 1) fake rdmsr (or hypercall if it works) returns a virtual
> >    address within a range of addresses that is not "owned by"
> >    the OS (e.g. maybe in Xen address space?).  The page is
> >    only readable outside of ring 0, but writeable in ring 0
> >    (by Xen).
> > 2) All TLB misses on this page are handled directly by Xen
> >    so the OS never sees the address/page.
> 
> I think these are probably possible, at least for a 64-bit 
> hypervisor which
> isn't playing segment limit tricks.

Will it work for pv32_on_64?  (I don't care much about
32-bit hypervisor.)
 
> > If these are OK, and you see other parts of the proposal
> > that require PV kernel mods, please point them out.
> 
> Won't the pvclock computation be per-cpu? How will you deal with
> that?

Hmmm... is it possible for the same virtual address/page
to map to a different physical address/page on each processor?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 15:26                                         ` Dan Magenheimer
@ 2009-09-01 15:32                                           ` Jan Beulich
  2009-09-01 15:56                                             ` Dan Magenheimer
  2009-09-01 15:43                                           ` Keir Fraser
  1 sibling, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2009-09-01 15:32 UTC (permalink / raw)
  To: Keir Fraser, Jeremy Fitzhardinge, Dan Magenheimer
  Cc: Xen-Devel (E-mail), Alan Cox

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 17:26 >>>
>> On 01/09/2009 15:53, "Dan Magenheimer" 
>> <dan.magenheimer@oracle.com> wrote:
>> 
>> > 1) fake rdmsr (or hypercall if it works) returns a virtual
>> >    address within a range of addresses that is not "owned by"
>> >    the OS (e.g. maybe in Xen address space?).  The page is
>> >    only readable outside of ring 0, but writeable in ring 0
>> >    (by Xen).
>> > 2) All TLB misses on this page are handled directly by Xen
>> >    so the OS never sees the address/page.
>> 
>> I think these are probably possible, at least for a 64-bit 
>> hypervisor which
>> isn't playing segment limit tricks.
>
>Will it work for pv32_on_64?  (I don't care much about
>32-bit hypervisor.)

It can be made work - you just need to properly arrange this and the
compatibility p2m table.
 
>> > If these are OK, and you see other parts of the proposal
>> > that require PV kernel mods, please point them out.
>> 
>> Won't the pvclock computation be per-cpu? How will you deal with
>> that?
>
>Hmmm... is it possible for the same virtual address/page
>to map to a different physical address/page on each processor?

Not within today's Xen or Linux (which both assume a global kernel
address space, in particular non-root page table entries mapping kernel
space to be the same in all address spaces - you'd need separate entries
at all levels for this).

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 15:26                                         ` Dan Magenheimer
  2009-09-01 15:32                                           ` Jan Beulich
@ 2009-09-01 15:43                                           ` Keir Fraser
  1 sibling, 0 replies; 61+ messages in thread
From: Keir Fraser @ 2009-09-01 15:43 UTC (permalink / raw)
  To: Dan Magenheimer, Jeremy Fitzhardinge; +Cc: Xen-Devel (E-mail), Alan Cox

On 01/09/2009 16:26, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> I think these are probably possible, at least for a 64-bit
>> hypervisor which
>> isn't playing segment limit tricks.
> 
> Will it work for pv32_on_64?  (I don't care much about
> 32-bit hypervisor.)

It could do. Space is reserved at the top of 4GB for the M2P tables, and I
suppose such a mapping could go there.

>> Won't the pvclock computation be per-cpu? How will you deal with
>> that?
> 
> Hmmm... is it possible for the same virtual address/page
> to map to a different physical address/page on each processor?

Not without PV guest kernel support. The guest kernel manages the page
directories. And Linux runs threads on exactly the same pagetables across
different cpus. That would have to change.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 15:32                                           ` Jan Beulich
@ 2009-09-01 15:56                                             ` Dan Magenheimer
  2009-09-01 16:04                                               ` Jan Beulich
  2009-09-01 16:06                                               ` Keir Fraser
  0 siblings, 2 replies; 61+ messages in thread
From: Dan Magenheimer @ 2009-09-01 15:56 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser, Jeremy Fitzhardinge
  Cc: Xen-Devel (E-mail), Alan Cox

> >> > If these are OK, and you see other parts of the proposal
> >> > that require PV kernel mods, please point them out.
> >> 
> >> Won't the pvclock computation be per-cpu? How will you deal with
> >> that?
> >
> >Hmmm... is it possible for the same virtual address/page
> >to map to a different physical address/page on each processor?
> 
> Not within today's Xen or Linux (which both assume a global kernel
> address space, in particular non-root page table entries 
> mapping kernel
> space to be the same in all address spaces - you'd need 
> separate entries
> at all levels for this).

OK, I forgot: No software-accessible TLB.

Can you think of any trick (that doesn't require the cost of a
trap/hypercall) to allow an app to determine what pcpu
it is running on?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 15:56                                             ` Dan Magenheimer
@ 2009-09-01 16:04                                               ` Jan Beulich
  2009-09-01 16:41                                                 ` Dan Magenheimer
  2009-09-01 21:25                                                 ` Keir Fraser
  2009-09-01 16:06                                               ` Keir Fraser
  1 sibling, 2 replies; 61+ messages in thread
From: Jan Beulich @ 2009-09-01 16:04 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser, Alan Cox

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 17:56 >>>
>Can you think of any trick (that doesn't require the cost of a
>trap/hypercall) to allow an app to determine what pcpu
>it is running on?

Just like what is being used to allow apps to get the CPU number on native
kernels (or the vCPU one on Xen-ified ones): Have a GDT entry the limit of
which is the number you want, and have the app use the lsl instruction to
get at it.

I am, however, always a little bit concerned when it comes to exposing
information that shouldn't really be exposed, due to the possibility of
overlooking potential misuses. In the specific case here, I can't see at all
why you'd the pCPU number exposed - after all the kernel can do what
you want apps to do without having that information.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 15:56                                             ` Dan Magenheimer
  2009-09-01 16:04                                               ` Jan Beulich
@ 2009-09-01 16:06                                               ` Keir Fraser
  2009-09-01 16:55                                                 ` Dan Magenheimer
  1 sibling, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-09-01 16:06 UTC (permalink / raw)
  To: Dan Magenheimer, Jan Beulich, Jeremy Fitzhardinge
  Cc: Xen-Devel (E-mail), Alan Cox

On 01/09/2009 16:56, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

>> Not within today's Xen or Linux (which both assume a global kernel
>> address space, in particular non-root page table entries
>> mapping kernel
>> space to be the same in all address spaces - you'd need
>> separate entries
>> at all levels for this).
> 
> OK, I forgot: No software-accessible TLB.
> 
> Can you think of any trick (that doesn't require the cost of a
> trap/hypercall) to allow an app to determine what pcpu
> it is running on?

I can't think of any that don't require kernel modifications. Which takes us
back to considering vsyscall, perhaps.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 16:04                                               ` Jan Beulich
@ 2009-09-01 16:41                                                 ` Dan Magenheimer
  2009-09-02  7:05                                                   ` Jan Beulich
  2009-09-01 21:25                                                 ` Keir Fraser
  1 sibling, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-09-01 16:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Cox, Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser, Alan

> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 17:56 >>>
> >Can you think of any trick (that doesn't require the cost of a
> >trap/hypercall) to allow an app to determine what pcpu
> >it is running on?
> 
> Just like what is being used to allow apps to get the CPU 
> number on native
> kernels (or the vCPU one on Xen-ified ones): Have a GDT entry 
> the limit of
> which is the number you want, and have the app use the lsl 
> instruction to
> get at it.

Can you explain more?  Will this work for a userland
process to get its current pcpu (not vcpu)?

> I am, however, always a little bit concerned when it comes to exposing
> information that shouldn't really be exposed, due to the 
> possibility of
> overlooking potential misuses. In the specific case here, I 
> can't see at all
> why you'd the pCPU number exposed

There is one pvclock "struct" for each pcpu.  We want
an app to "see" the right one.  If that's not possible,
we want the app to see the whole array of them and be
able to properly index into the array.

If possible, I'd like to see if we can identify a solution
at all, and then discard it if the issues are too difficult
to overcome.

> after all the kernel can do what
> you want apps to do without having that information.

In the current Linux 2.6.30 implementation of pvclock
it can do it, but it can't do it fast.  In versions
of the kernel prior to 2.2.28(?), it can't do it at
all, correct?

Thanks,
Dan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 16:06                                               ` Keir Fraser
@ 2009-09-01 16:55                                                 ` Dan Magenheimer
  0 siblings, 0 replies; 61+ messages in thread
From: Dan Magenheimer @ 2009-09-01 16:55 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich, Jeremy Fitzhardinge
  Cc: Xen-Devel (E-mail), Alan Cox

> >> Not within today's Xen or Linux (which both assume a global kernel
> >> address space, in particular non-root page table entries
> >> mapping kernel
> >> space to be the same in all address spaces - you'd need
> >> separate entries
> >> at all levels for this).
> > 
> > OK, I forgot: No software-accessible TLB.
> > 
> > Can you think of any trick (that doesn't require the cost of a
> > trap/hypercall) to allow an app to determine what pcpu
> > it is running on?
> 
> I can't think of any that don't require kernel modifications. 
> Which takes us
> back to considering vsyscall, perhaps.
> 
>  -- Keir

If a solution that doesn't require kernel mods is not
possible, then I suspect apps will continue to use
rdtsc as-is and suffer the emulation overhead.
Requiring all customers to update the OS underlying
these apps is a non-starter.

Also, it has yet to be proven that pvclock can
work in a vsyscall.  Doesn't the same per-cpu in
userspace problem exist?   Pvclock without vsyscall
has been measured and is too slow, so until a vsyscall
version of pvclock
is implemented and measured (let alone upstream or
available in distros), it's hard to call it an
alternative to consider, even for the future.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen  (and KVM?)
  2009-09-01 16:04                                               ` Jan Beulich
  2009-09-01 16:41                                                 ` Dan Magenheimer
@ 2009-09-01 21:25                                                 ` Keir Fraser
  2009-09-01 22:08                                                   ` Dan Magenheimer
  2009-09-02  7:01                                                   ` Jan Beulich
  1 sibling, 2 replies; 61+ messages in thread
From: Keir Fraser @ 2009-09-01 21:25 UTC (permalink / raw)
  To: Jan Beulich, Dan Magenheimer
  Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Alan Cox

On 01/09/2009 17:04, "Jan Beulich" <JBeulich@novell.com> wrote:

>>>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 17:56 >>>
>> Can you think of any trick (that doesn't require the cost of a
>> trap/hypercall) to allow an app to determine what pcpu
>> it is running on?
> 
> Just like what is being used to allow apps to get the CPU number on native
> kernels (or the vCPU one on Xen-ified ones): Have a GDT entry the limit of
> which is the number you want, and have the app use the lsl instruction to
> get at it.

Yes, that's true. Xen could provide such a segment descriptor in its private
area of the GDT. The issue then would be that, in a compound pvclock
operation spanning multiple machine instructions, the pCPU number revealed
by the LSL instruction can be stale by the time it is used later in the
compound operation.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: RE: rdtsc: correctness vs performance on Xen  (and KVM?)
  2009-09-01 21:25                                                 ` Keir Fraser
@ 2009-09-01 22:08                                                   ` Dan Magenheimer
  2009-09-01 22:21                                                     ` Jeremy Fitzhardinge
  2009-09-02  7:16                                                     ` Jan Beulich
  2009-09-02  7:01                                                   ` Jan Beulich
  1 sibling, 2 replies; 61+ messages in thread
From: Dan Magenheimer @ 2009-09-01 22:08 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich
  Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Alan Cox

> > Just like what is being used to allow apps to get the CPU 
> number on native
> > kernels (or the vCPU one on Xen-ified ones): Have a GDT 
> entry the limit of
> > which is the number you want, and have the app use the lsl 
> instruction to
> > get at it.
> 
> Yes, that's true. Xen could provide such a segment descriptor 
> in its private
> area of the GDT. The issue then would be that, in a compound pvclock
> operation spanning multiple machine instructions, the pCPU 
> number revealed
> by the LSL instruction can be stale by the time it is used 
> later in the
> compound operation.

The algorithm could check the pCPU number before and after
reading the pvclock data and doing the rdtsc, and if they
don't match, start again.  (Doesn't the pvclock algorithm
already do that with some versioning number in the pvclock
data itself to ensure that the rest of the data didn't
change while it was being read?)

I'm clueless about GDTs and the LSL instrution so would
need some help prototyping this.

Dan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen  (and KVM?)
  2009-09-01 22:08                                                   ` Dan Magenheimer
@ 2009-09-01 22:21                                                     ` Jeremy Fitzhardinge
  2009-09-01 22:41                                                       ` Dan Magenheimer
  2009-09-02  7:16                                                     ` Jan Beulich
  1 sibling, 1 reply; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-01 22:21 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Alan Cox, Xen-Devel (E-mail), Keir Fraser, Jan Beulich

On 09/01/09 15:08, Dan Magenheimer wrote:
>>> Just like what is being used to allow apps to get the CPU 
>>>       
>> number on native
>>     
>>> kernels (or the vCPU one on Xen-ified ones): Have a GDT 
>>>       
>> entry the limit of
>>     
>>> which is the number you want, and have the app use the lsl 
>>>       
>> instruction to
>>     
>>> get at it.
>>>       
>> Yes, that's true. Xen could provide such a segment descriptor 
>> in its private
>> area of the GDT. The issue then would be that, in a compound pvclock
>> operation spanning multiple machine instructions, the pCPU 
>> number revealed
>> by the LSL instruction can be stale by the time it is used 
>> later in the
>> compound operation.
>>     
> The algorithm could check the pCPU number before and after
> reading the pvclock data and doing the rdtsc, and if they
> don't match, start again.  (Doesn't the pvclock algorithm
> already do that with some versioning number in the pvclock
> data itself to ensure that the rest of the data didn't
> change while it was being read?)
>   
There's still a race there, if the thread switched PCPU twice during the
operation:

    <running on PCPU A>
    get CPU #
    <switch to PCPU B>
    read tsc
    apply corrections from (from PCPU A)
    <switch to PCPU A>
    check CPU # is the same as we started with: all OK!

note that the <switch to PCPU B> could either be a result of the Xen
scheduler moving the VCPU *or* the Linux scheduler moving the thread to
a different VCPU.  In the former case, Xen could update a version
counter to help detect the discontinuity, but it doesn't really know
about guest scheduling decisions.  I guess the guest kernel could update
the pvclock version counter itself.

> I'm clueless about GDTs and the LSL instrution so would
> need some help prototyping this.
>   

It's what vsyscall already uses.  Your scheme is precisely analogous to
what's already there.

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: RE: rdtsc: correctness vs performance on Xen  (and KVM?)
  2009-09-01 22:21                                                     ` Jeremy Fitzhardinge
@ 2009-09-01 22:41                                                       ` Dan Magenheimer
  2009-09-01 23:26                                                         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Magenheimer @ 2009-09-01 22:41 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Jan, Xen-Devel (E-mail), Beulich, Keir Fraser, Alan Cox

> There's still a race there

Good point.  Essentially we need to ensure that
{{rdtsc and the pvclock struct values for PCPU-X}}
are obtained atomically and there's no way to guarantee
that (at least without incurring overhead that's likely
to exceed just emulating rdtsc to begin with).
 
> > I'm clueless about GDTs and the LSL instrution so would
> > need some help prototyping this.
> >   
> 
> It's what vsyscall already uses.  Your scheme is precisely 
> analogous to
> what's already there.

(...except if it can be done entirely in the app with no
OS dependencies)

Won't pvclock+vsyscall have the same race?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen  (and KVM?)
  2009-09-01 22:41                                                       ` Dan Magenheimer
@ 2009-09-01 23:26                                                         ` Jeremy Fitzhardinge
  2009-09-02  7:20                                                           ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-01 23:26 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: Xen-Devel (E-mail), Jan Beulich, Keir Fraser, Alan Cox

On 09/01/09 15:41, Dan Magenheimer wrote:
> Won't pvclock+vsyscall have the same race?
>   

Yes, it would need to be resolved either way.

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 21:25                                                 ` Keir Fraser
  2009-09-01 22:08                                                   ` Dan Magenheimer
@ 2009-09-02  7:01                                                   ` Jan Beulich
  1 sibling, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2009-09-02  7:01 UTC (permalink / raw)
  To: Keir Fraser, Dan Magenheimer
  Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Alan Cox

>>> Keir Fraser <keir.fraser@eu.citrix.com> 01.09.09 23:25 >>>
>On 01/09/2009 17:04, "Jan Beulich" <JBeulich@novell.com> wrote:
>
>>>>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 17:56 >>>
>>> Can you think of any trick (that doesn't require the cost of a
>>> trap/hypercall) to allow an app to determine what pcpu
>>> it is running on?
>> 
>> Just like what is being used to allow apps to get the CPU number on native
>> kernels (or the vCPU one on Xen-ified ones): Have a GDT entry the limit of
>> which is the number you want, and have the app use the lsl instruction to
>> get at it.
>
>Yes, that's true. Xen could provide such a segment descriptor in its private
>area of the GDT. The issue then would be that, in a compound pvclock

And in fact there already is such a descriptor, just with DPL=0.

>operation spanning multiple machine instructions, the pCPU number revealed
>by the LSL instruction can be stale by the time it is used later in the
>compound operation.

Correct.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 16:41                                                 ` Dan Magenheimer
@ 2009-09-02  7:05                                                   ` Jan Beulich
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2009-09-02  7:05 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Keir Fraser, AlanCox

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 18:41 >>>
>> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 01.09.09 17:56 >>>
>> >Can you think of any trick (that doesn't require the cost of a
>> >trap/hypercall) to allow an app to determine what pcpu
>> >it is running on?
>> 
>> Just like what is being used to allow apps to get the CPU 
>> number on native
>> kernels (or the vCPU one on Xen-ified ones): Have a GDT entry 
>> the limit of
>> which is the number you want, and have the app use the lsl 
>> instruction to
>> get at it.
>
>Can you explain more?  Will this work for a userland
>process to get its current pcpu (not vcpu)?

Sure, if the descriptor's DPL is set to 3.

>> I am, however, always a little bit concerned when it comes to exposing
>> information that shouldn't really be exposed, due to the 
>> possibility of
>> overlooking potential misuses. In the specific case here, I 
>> can't see at all
>> why you'd the pCPU number exposed
>
>There is one pvclock "struct" for each pcpu.  We want
>an app to "see" the right one.  If that's not possible,
>we want the app to see the whole array of them and be
>able to properly index into the array.

These pvclock structs should be per vCPU, shouldn't they? The
hypervisor ensures that the per-vCPU structure reflects the proper
state on the pCPU that vCPU is currently running on.

>> after all the kernel can do what
>> you want apps to do without having that information.
>
>In the current Linux 2.6.30 implementation of pvclock
>it can do it, but it can't do it fast.  In versions
>of the kernel prior to 2.2.28(?), it can't do it at
>all, correct?

I don't think it ever uses a pCPU number. If it does, just point me to
where this is happening.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-01 22:08                                                   ` Dan Magenheimer
  2009-09-01 22:21                                                     ` Jeremy Fitzhardinge
@ 2009-09-02  7:16                                                     ` Jan Beulich
  1 sibling, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2009-09-02  7:16 UTC (permalink / raw)
  To: Keir Fraser, Dan Magenheimer
  Cc: Jeremy Fitzhardinge, Xen-Devel (E-mail), Alan Cox

>>> Dan Magenheimer <dan.magenheimer@oracle.com> 02.09.09 00:08 >>>
>> > Just like what is being used to allow apps to get the CPU 
>> number on native
>> > kernels (or the vCPU one on Xen-ified ones): Have a GDT 
>> entry the limit of
>> > which is the number you want, and have the app use the lsl 
>> instruction to
>> > get at it.
>> 
>> Yes, that's true. Xen could provide such a segment descriptor 
>> in its private
>> area of the GDT. The issue then would be that, in a compound pvclock
>> operation spanning multiple machine instructions, the pCPU 
>> number revealed
>> by the LSL instruction can be stale by the time it is used 
>> later in the
>> compound operation.
>
>The algorithm could check the pCPU number before and after
>reading the pvclock data and doing the rdtsc, and if they
>don't match, start again.  (Doesn't the pvclock algorithm
>already do that with some versioning number in the pvclock
>data itself to ensure that the rest of the data didn't
>change while it was being read?)

No, that won't do - the underlying pCPU may change multiple times
during that process.

>I'm clueless about GDTs and the LSL instrution so would
>need some help prototyping this.

As said in another reply, such a descriptor already exists
(PER_CPU_GDT_ENTRY).

But as also already said, I doubt you really need this.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen  (and KVM?)
  2009-09-01 23:26                                                         ` Jeremy Fitzhardinge
@ 2009-09-02  7:20                                                           ` Keir Fraser
  2009-09-02 21:44                                                             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-09-02  7:20 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Dan Magenheimer
  Cc: Xen-Devel (E-mail), Jan Beulich, Alan Cox

On 02/09/2009 00:26, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:

> On 09/01/09 15:41, Dan Magenheimer wrote:
>> Won't pvclock+vsyscall have the same race?
> 
> Yes, it would need to be resolved either way.

The problem is a bit easier with vsyscall potentially. For example, give
each thread its own vsyscall clock data area (easy?), updated by kernel
whenever the thread is scheduled, and increment a version counter, checked
before and after by the vsyscall operation.

Well, I don't know how easy or fast that could actually be implemented, but
I'm at least confident it could work. But it does need kernel assistance.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen  (and KVM?)
  2009-09-02  7:20                                                           ` Keir Fraser
@ 2009-09-02 21:44                                                             ` Jeremy Fitzhardinge
  2009-09-02 21:50                                                               ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-02 21:44 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Dan Magenheimer, Xen-Devel (E-mail), Jan Beulich, Alan Cox

On 09/02/09 00:20, Keir Fraser wrote:
> The problem is a bit easier with vsyscall potentially. For example, give
> each thread its own vsyscall clock data area (easy?), updated by kernel
> whenever the thread is scheduled, and increment a version counter, checked
> before and after by the vsyscall operation.
>   

Yes.  Perhaps the very simplest way would be to make the kernel update
the pvclock version counter on context switch, the same way Xen does;
that would allow the usermode vsyscall code to use exactly the same
algorithm as the kernel code.  Would Xen cope with that?

> Well, I don't know how easy or fast that could actually be implemented, but
> I'm at least confident it could work. But it does need kernel assistance.
>   

Yes.  I'm very uneasy about letting usermode have direct access to bits
of Xen without the kernel's knowledge anyway.  It suddenly means we need
to not only maintain a Xen<->kernel ABI, but a Xen<->usermode ABI as well.

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen  (and KVM?)
  2009-09-02 21:44                                                             ` Jeremy Fitzhardinge
@ 2009-09-02 21:50                                                               ` Keir Fraser
  2009-09-02 22:05                                                                 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2009-09-02 21:50 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Magenheimer, Xen-Devel (E-mail), Jan Beulich, Alan Cox

On 02/09/2009 22:44, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:

> On 09/02/09 00:20, Keir Fraser wrote:
>> The problem is a bit easier with vsyscall potentially. For example, give
>> each thread its own vsyscall clock data area (easy?), updated by kernel
>> whenever the thread is scheduled, and increment a version counter, checked
>> before and after by the vsyscall operation.
>>   
> 
> Yes.  Perhaps the very simplest way would be to make the kernel update
> the pvclock version counter on context switch, the same way Xen does;
> that would allow the usermode vsyscall code to use exactly the same
> algorithm as the kernel code.  Would Xen cope with that?

Yes, that's basically how I would envision it working. The main missing
detail afaics is how to manage and access the required per-thread data.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen  (and KVM?)
  2009-09-02 21:50                                                               ` Keir Fraser
@ 2009-09-02 22:05                                                                 ` Jeremy Fitzhardinge
  2009-09-03  8:23                                                                   ` Jan Beulich
  2009-09-03 14:22                                                                   ` Dan Magenheimer
  0 siblings, 2 replies; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-02 22:05 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Dan Magenheimer, Xen-Devel (E-mail), Jan Beulich, Alan Cox

On 09/02/09 14:50, Keir Fraser wrote:
>> Yes.  Perhaps the very simplest way would be to make the kernel update
>> the pvclock version counter on context switch, the same way Xen does;
>> that would allow the usermode vsyscall code to use exactly the same
>> algorithm as the kernel code.  Would Xen cope with that?
>>     
> Yes, that's basically how I would envision it working. The main missing
> detail afaics is how to manage and access the required per-thread data.
>   

I was imagining:

   1. Add a hypercall to set the desired location of the clock
      correction info rather than putting it in the shared-info area
      (akin to vcpu placement).  KVM already has this; they write the
      address to a magic MSR.
   2. Pack all the clock structures into a single page, indexed by vcpu
      number
   3. Map that RO into userspace via fixmap, like the vsyscall page itself
   4. Use the lsl trick to get the current vcpu to index into the array,
      then compute a time value using tsc with corrections; iterate if
      version stamp changes under our feet.
   5. On context switch, the kernel would increment the version of the
      *old* vcpu clock structure, so that when the usermode code
      re-checks the version at the end of its time calculation, it can
      tell that it has a stale vcpu and it needs to iterate with a new
      vcpu+clock structure

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-02 22:05                                                                 ` Jeremy Fitzhardinge
@ 2009-09-03  8:23                                                                   ` Jan Beulich
  2009-09-03 17:29                                                                     ` Jeremy Fitzhardinge
  2009-09-03 14:22                                                                   ` Dan Magenheimer
  1 sibling, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2009-09-03  8:23 UTC (permalink / raw)
  To: Keir Fraser, Jeremy Fitzhardinge
  Cc: Dan Magenheimer, Xen-Devel (E-mail), Alan Cox

>>> Jeremy Fitzhardinge <jeremy@goop.org> 03.09.09 00:05 >>>
>   1. Add a hypercall to set the desired location of the clock
>      correction info rather than putting it in the shared-info area
>      (akin to vcpu placement).  KVM already has this; they write the
>      address to a magic MSR.

But this is already subject to placement, as it's part of the vcpu_info
structure. While of course you don't want to make the whole vcpu_info
visible to guests, it would seem awkward to further segregate the
shared_info pieces. I'd rather consider adding a second (optional) copy
of it, since the updating of this is rather little overhead in Xen, but
using this in the kernel time handling code would eliminate the
potential for accessing all the vcpu_info fields using percpu_read().

>   2. Pack all the clock structures into a single page, indexed by vcpu
>      number

That adds a scalability issue, albeit a relatively light one: You shouldn't
anymore assume there's a limit on the number of vCPU-s. 

>   3. Map that RO into userspace via fixmap, like the vsyscall page itself
>   4. Use the lsl trick to get the current vcpu to index into the array,
>      then compute a time value using tsc with corrections; iterate if
>      version stamp changes under our feet.
>   5. On context switch, the kernel would increment the version of the
>      *old* vcpu clock structure, so that when the usermode code
>      re-checks the version at the end of its time calculation, it can
>      tell that it has a stale vcpu and it needs to iterate with a new
>      vcpu+clock structure

I don't think you can re-use the hypervisor updated version field here,
unless you add a protocol on how the two updaters avoid collision.
struct vcpu_time_info has a padding field, which might be designated
as guest-kernel-version.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: RE: rdtsc: correctness vs performance on Xen  (and KVM?)
  2009-09-02 22:05                                                                 ` Jeremy Fitzhardinge
  2009-09-03  8:23                                                                   ` Jan Beulich
@ 2009-09-03 14:22                                                                   ` Dan Magenheimer
  1 sibling, 0 replies; 61+ messages in thread
From: Dan Magenheimer @ 2009-09-03 14:22 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Keir Fraser
  Cc: Xen-Devel (E-mail), Jan Beulich, Alan Cox

> I was imagining:
> 
>    1. Add a hypercall to set the desired location of the clock
>       correction info rather than putting it in the shared-info area
>       (akin to vcpu placement).  KVM already has this; they write the
>       address to a magic MSR.
>    2. Pack all the clock structures into a single page, 
> indexed by vcpu
>       number
>    3. Map that RO into userspace via fixmap, like the 
> vsyscall page itself
>    4. Use the lsl trick to get the current vcpu to index into 
> the array,
>       then compute a time value using tsc with corrections; iterate if
>       version stamp changes under our feet.
>    5. On context switch, the kernel would increment the version of the
>       *old* vcpu clock structure, so that when the usermode code
>       re-checks the version at the end of its time calculation, it can
>       tell that it has a stale vcpu and it needs to iterate with a new
>       vcpu+clock structure

It would be nice to see a prototyped version of this so
it could be confirmed that it works, the kernel impact
can be evaluated, performance can be measured, and,
if all looks good, distros can start putting it into
their kernels.

Also, it would be nice if there is some way for apps to
determine if it is present and working, e.g.

if (clock_gettime_performance_doesnt_suck)
   t = clock_gettime();
else {
   t= rdtsc();
   apply_post_processing(t);
}

as apparently sysctl.vsyscall64==1 is not sufficient.

In fact, if there can be agreement as to how this
determination can be done (sysctl.fastpvclock==1??)
apps could start getting ready.

Dan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen   (and KVM?)
  2009-09-03  8:23                                                                   ` Jan Beulich
@ 2009-09-03 17:29                                                                     ` Jeremy Fitzhardinge
  2009-09-04  7:19                                                                       ` Jan Beulich
  0 siblings, 1 reply; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-03 17:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Dan Magenheimer, Xen-Devel (E-mail), Keir Fraser, Alan Cox

On 09/03/09 01:23, Jan Beulich wrote:
>>   1. Add a hypercall to set the desired location of the clock
>>      correction info rather than putting it in the shared-info area
>>      (akin to vcpu placement).  KVM already has this; they write the
>>      address to a magic MSR.
>>     
> But this is already subject to placement, as it's part of the vcpu_info
> structure. While of course you don't want to make the whole vcpu_info
> visible to guests, it would seem awkward to further segregate the
> shared_info pieces. I'd rather consider adding a second (optional) copy
> of it, since the updating of this is rather little overhead in Xen,

Hm, I guess that's possible.  Though once you've added a new "other time
struct" pointer, it would be easier to just make Xen update that pointer
rather than update two.  I don't think a guest is going to know/care
about having two versions of the info (except that it opens the
possibility of getting confused by looking at the wrong one).

I'd propose that there'd be just one, and the non-valid pvclock
structure have its version set to 0xffffffff, since a guest should never
see a version in that state.

>  but
> using this in the kernel time handling code would eliminate the
> potential for accessing all the vcpu_info fields using percpu_read().
>   

I don't think that's a big concern.  The kernel's pvclock handing is
common between Xen and KVM now, and it just gets a pointer to the
structure; it never accesses it as a percpu variable.

>>   2. Pack all the clock structures into a single page, indexed by vcpu
>>      number
>>     
> That adds a scalability issue, albeit a relatively light one: You shouldn't
> anymore assume there's a limit on the number of vCPU-s. 
>   

Well, that's up to the kernel rather than Xen.  If there a lot of CPUs
it can span multiple pages.  There's no need to make them physically
contiguous, since the kernel never needs to treat them as an array and
we can map disjoint pages contiguously into userspace (it might take a
chunk of fixmap slots).

I guess one concern is that it ends up exposing the scheduling info
about all the VCPUs to all usermode.  I doubt that's a problem in
itself, but who knows if it could be used as part of a larger attack.

>>   5. On context switch, the kernel would increment the version of the
>>      *old* vcpu clock structure, so that when the usermode code
>>      re-checks the version at the end of its time calculation, it can
>>      tell that it has a stale vcpu and it needs to iterate with a new
>>      vcpu+clock structure
>>     
> I don't think you can re-use the hypervisor updated version field here,
> unless you add a protocol on how the two updaters avoid collision.
> struct vcpu_time_info has a padding field, which might be designated
> as guest-kernel-version.
>   

There's no padding.  It would be an extension of the pvclock ABI, which
KVM also implements, so we'd need to make sure they can cope too.

We only need to worry about Xen preempting a kernel update rather than
the other way around.  I think it ends up being very simple:

void ctxtsw_update_pvclock(struct pvclock_vcpu_time_info *pvclock)
{
	BUG_ON(preemptible());	/* Switching VCPUs would be a disaster */

	/*
	 * We just need to update version; if Xen did it behind our back, then
	 * that's OK with us.  We should never see an update-in-progress because Xen
	 * will always completely update the pvclock structure before rescheduling the
	 * VCPU, so version should always be even.  We don't care if Xen updates the
	 * timing parameters here because we're not in the middle of a clock read.
	 * Usermode might be in the middle of a read, but all it needs to see is version
	 * changing to a new even number, even if this add gets preempted by Xen in
	 * the middle.  There are no cross-PCPU writes going on, so we don't need to
	 * worry about bus-level atomicity.
	 */
	pvclock->version += 2;
}

Looks like this would work for KVM too.

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen (and KVM?)
  2009-09-03 17:29                                                                     ` Jeremy Fitzhardinge
@ 2009-09-04  7:19                                                                       ` Jan Beulich
  2009-09-04 15:44                                                                         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2009-09-04  7:19 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Dan Magenheimer, Xen-Devel (E-mail), Keir Fraser, Alan Cox

>>> Jeremy Fitzhardinge <jeremy@goop.org> 03.09.09 19:29 >>>
>On 09/03/09 01:23, Jan Beulich wrote:
>> I don't think you can re-use the hypervisor updated version field here,
>> unless you add a protocol on how the two updaters avoid collision.
>> struct vcpu_time_info has a padding field, which might be designated
>> as guest-kernel-version.
>>   
>
>There's no padding.  It would be an extension of the pvclock ABI, which
>KVM also implements, so we'd need to make sure they can cope too.

struct pvclock_vcpu_time_info has a 'pad0' field afaics.

>We only need to worry about Xen preempting a kernel update rather than
>the other way around.  I think it ends up being very simple:
>
>void ctxtsw_update_pvclock(struct pvclock_vcpu_time_info *pvclock)
>{
>	BUG_ON(preemptible());	/* Switching VCPUs would be a disaster */
>
>	/*
>	 * We just need to update version; if Xen did it behind our back, then
>	 * that's OK with us.  We should never see an update-in-progress because Xen
>	 * will always completely update the pvclock structure before rescheduling the
>	 * VCPU, so version should always be even.  We don't care if Xen updates the
>	 * timing parameters here because we're not in the middle of a clock read.
>	 * Usermode might be in the middle of a read, but all it needs to see is version
>	 * changing to a new even number, even if this add gets preempted by Xen in
>	 * the middle.  There are no cross-PCPU writes going on, so we don't need to
>	 * worry about bus-level atomicity.
>	 */
>	pvclock->version += 2;
>}

No, that won't work as-is, because you can't guarantee the compiler to
translate this to and add-with-memory-operand. While avoiding a bus
lock here indeed seems possible (as long as it is clear that user mode will
never be interested in reading other than the instance of the CPU it's
currently running on), you won't get away without inline assembly.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: RE: rdtsc: correctness vs performance on Xen  (and KVM?)
  2009-09-04  7:19                                                                       ` Jan Beulich
@ 2009-09-04 15:44                                                                         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 61+ messages in thread
From: Jeremy Fitzhardinge @ 2009-09-04 15:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Dan Magenheimer, Xen-Devel (E-mail), Keir Fraser, Alan Cox

On 09/04/09 00:19, Jan Beulich wrote:
> struct pvclock_vcpu_time_info has a 'pad0' field afaics.
>   

Ah, yes, I was looking at wall_clock.  We could claim the padding for
"local version", but it would require a 64-bit unpreemptible read, which
is awkward on 32-bit.

>> We only need to worry about Xen preempting a kernel update rather than
>> the other way around.  I think it ends up being very simple:
>>
>> void ctxtsw_update_pvclock(struct pvclock_vcpu_time_info *pvclock)
>> {
>> 	BUG_ON(preemptible());	/* Switching VCPUs would be a disaster */
>>
>> 	/*
>> 	 * We just need to update version; if Xen did it behind our back, then
>> 	 * that's OK with us.  We should never see an update-in-progress because Xen
>> 	 * will always completely update the pvclock structure before rescheduling the
>> 	 * VCPU, so version should always be even.  We don't care if Xen updates the
>> 	 * timing parameters here because we're not in the middle of a clock read.
>> 	 * Usermode might be in the middle of a read, but all it needs to see is version
>> 	 * changing to a new even number, even if this add gets preempted by Xen in
>> 	 * the middle.  There are no cross-PCPU writes going on, so we don't need to
>> 	 * worry about bus-level atomicity.
>> 	 */
>> 	pvclock->version += 2;
>> }
>>     
> No, that won't work as-is, because you can't guarantee the compiler to
> translate this to and add-with-memory-operand. While avoiding a bus
> lock here indeed seems possible (as long as it is clear that user mode will
> never be interested in reading other than the instance of the CPU it's
> currently running on), you won't get away without inline assembly.
>   

I don't think that matters, even if the compiler generates a preemptable
sequence: the end result will always be a changed version number.  Even
if we end up rolling back the version to a smaller number (because Xen
did multiple pvclock updates while it preempted us) nothing will get
confused because nothing observed those intermediate versions.  Xen
itself doesn't care about the version number (its effectively
write-only).  KVM is the same.

    J

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2009-09-04 15:44 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-25 21:54 write_tsc in a PV domain? Dan Magenheimer
2009-08-25 22:28 ` Jeremy Fitzhardinge
2009-08-25 23:09   ` Dan Magenheimer
2009-08-26  6:23     ` Keir Fraser
2009-08-26 15:42       ` Dan Magenheimer
2009-08-26 15:58         ` Keir Fraser
2009-08-26 19:45         ` Jeremy Fitzhardinge
2009-08-26 20:23           ` Dan Magenheimer
2009-08-26 22:30             ` Jeremy Fitzhardinge
2009-08-26 23:10               ` Dan Magenheimer
2009-08-27  8:39                 ` Chris Lalancette
2009-08-27 13:00                   ` Dan Magenheimer
2009-08-27 13:17                     ` Chris Lalancette
2009-08-27  8:48               ` Alan Cox
2009-08-27 19:10                 ` Jeremy Fitzhardinge
2009-08-28  3:29                   ` Dan Magenheimer
2009-08-28  9:49                     ` Alan Cox
2009-08-28 15:16                       ` Dan Magenheimer
2009-08-28 15:30                         ` Alan Cox
2009-08-28 17:49                           ` rdtsc: correctness vs performance on Xen (and KVM?) Dan Magenheimer
2009-08-31 23:52                             ` Dan Magenheimer
2009-09-01  0:22                               ` Jeremy Fitzhardinge
2009-09-01 13:54                                 ` Dan Magenheimer
2009-09-01 14:34                                   ` Keir Fraser
2009-09-01 14:53                                     ` Dan Magenheimer
2009-09-01 15:08                                       ` Keir Fraser
2009-09-01 15:26                                         ` Dan Magenheimer
2009-09-01 15:32                                           ` Jan Beulich
2009-09-01 15:56                                             ` Dan Magenheimer
2009-09-01 16:04                                               ` Jan Beulich
2009-09-01 16:41                                                 ` Dan Magenheimer
2009-09-02  7:05                                                   ` Jan Beulich
2009-09-01 21:25                                                 ` Keir Fraser
2009-09-01 22:08                                                   ` Dan Magenheimer
2009-09-01 22:21                                                     ` Jeremy Fitzhardinge
2009-09-01 22:41                                                       ` Dan Magenheimer
2009-09-01 23:26                                                         ` Jeremy Fitzhardinge
2009-09-02  7:20                                                           ` Keir Fraser
2009-09-02 21:44                                                             ` Jeremy Fitzhardinge
2009-09-02 21:50                                                               ` Keir Fraser
2009-09-02 22:05                                                                 ` Jeremy Fitzhardinge
2009-09-03  8:23                                                                   ` Jan Beulich
2009-09-03 17:29                                                                     ` Jeremy Fitzhardinge
2009-09-04  7:19                                                                       ` Jan Beulich
2009-09-04 15:44                                                                         ` Jeremy Fitzhardinge
2009-09-03 14:22                                                                   ` Dan Magenheimer
2009-09-02  7:16                                                     ` Jan Beulich
2009-09-02  7:01                                                   ` Jan Beulich
2009-09-01 16:06                                               ` Keir Fraser
2009-09-01 16:55                                                 ` Dan Magenheimer
2009-09-01 15:43                                           ` Keir Fraser
2009-08-28 17:49                           ` write_tsc in a PV domain? Dan Magenheimer
2009-08-28 17:02                     ` Jeremy Fitzhardinge
2009-08-28 17:49                       ` Dan Magenheimer
2009-08-28 23:01                         ` Jeremy Fitzhardinge
2009-08-29 17:51                           ` Dan Magenheimer
2009-08-31 18:11                             ` Dan Magenheimer
2009-08-31 19:06                               ` Keir Fraser
2009-08-31 21:06                                 ` Dan Magenheimer
2009-09-01  7:16                                   ` Keir Fraser
2009-08-31 19:18                               ` Jeremy Fitzhardinge

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.