From: Glauber Costa <glommer@redhat.com>
To: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, avi@redhat.com
Subject: [PATCH 5/5] add documentation about kvmclock
Date: Thu, 15 Apr 2010 14:37:28 -0400 [thread overview]
Message-ID: <1271356648-5108-6-git-send-email-glommer@redhat.com> (raw)
In-Reply-To: <1271356648-5108-5-git-send-email-glommer@redhat.com>
This patch adds a new file, kvm/kvmclock.txt, describing
the mechanism we use in kvmclock.
Signed-off-by: Glauber Costa <glommer@redhat.com>
---
Documentation/kvm/kvmclock.txt | 138 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 138 insertions(+), 0 deletions(-)
create mode 100644 Documentation/kvm/kvmclock.txt
diff --git a/Documentation/kvm/kvmclock.txt b/Documentation/kvm/kvmclock.txt
new file mode 100644
index 0000000..21008bb
--- /dev/null
+++ b/Documentation/kvm/kvmclock.txt
@@ -0,0 +1,138 @@
+KVM Paravirtual Clocksource driver
+Glauber Costa, Red Hat Inc.
+==================================
+
+1. General Description
+=======================
+
+Keeping time in virtual machine is acknowledged as a hard problem. The most
+basic mode of operation, usually done by older guests, assumes a fixed length
+between timer interrupts. It then counts the number of interrupts and
+calculates elapsed time. This method fails easily in virtual machines, since
+we can't guarantee that the virtual interrupt will be delivered in time.
+
+Another possibility is to emulate modern devices like HPET, or any other we
+see fit. A modern guest which implements something like the clocksource
+infrastructure, can then ask this virtual device about current time when it
+needs to. The problem with this approach, is that it bumps the guest out
+of guest mode operation, and in some cases, even to userspace very frequently.
+
+In this context, the best approach is to provide the guest with a
+virtualization-aware (paravirtual) clock device. It the asks the hypervisor
+about current time, guaranteeing both stable and accurate timekeeping.
+
+2. kvmclock basics
+===========================
+
+When supported by the hypervisor, guests can register a memory page
+to contain kvmclock data. This page has to be present in guest's address space
+throughout its whole life. The hypervisor continues to write to it until it is
+explicitly disabled or the guest is turned off.
+
+2.1 kvmclock availability
+-------------------------
+
+Guests that want to take advantage of kvmclock should first check its
+availability through cpuid.
+
+kvm features are presented to the guest in leaf 0x40000001. Bit 3 indicates
+the present of kvmclock. Bit 0 indicates that kvmclock is present, but the
+old MSR set must be used. See section 2.3 for details.
+
+2.2 kvmclock functionality
+--------------------------
+
+Two MSRs are provided by the hypervisor, controlling kvmclock operation:
+
+ * MSR_KVM_WALL_CLOCK, value 0x4b564d00 and
+ * MSR_KVM_SYSTEM_TIME, value 0x4b564d01.
+
+The first one is only used in rare situations, like boot-time and a
+suspend-resume cycle. Data is disposable, and after used, the guest
+may use it for something else. This is hardly a hot path for anything.
+The Hypervisor fills in the address provided through this MSR with the
+following structure:
+
+struct pvclock_wall_clock {
+ u32 version;
+ u32 sec;
+ u32 nsec;
+} __attribute__((__packed__));
+
+Guest should only trust data to be valid when version haven't changed before
+and after reads of sec and nsec. Besides not changing, it has to be an even
+number. Hypervisor may write an odd number to version field to indicate that
+an update is in progress.
+
+MSR_KVM_SYSTEM_TIME, on the other hand, has persistent data, and is
+constantly updated by the hypervisor with time information. The data
+written in this MSR contains two pieces of information: the address in which
+the guests expects time data to be present 4-byte aligned or'ed with an
+enabled bit. If one wants to shutdown kvmclock, it just needs to write
+anything that has 0 as its last bit.
+
+Time information presented by the hypervisor follows the structure:
+
+struct pvclock_vcpu_time_info {
+ u32 version;
+ u32 pad0;
+ u64 tsc_timestamp;
+ u64 system_time;
+ u32 tsc_to_system_mul;
+ s8 tsc_shift;
+ u8 pad[3];
+} __attribute__((__packed__));
+
+The version field plays the same role as with the one in struct
+pvclock_wall_clock. The other fields, are:
+
+ a. tsc_timestamp: the guest-visible tsc (result of rdtsc + tsc_offset) of
+ this cpu at the moment we recorded system_time. Note that some time is
+ inevitably spent between system_time and tsc_timestamp measurements.
+ Guests can subtract this quantity from the current value of tsc to obtain
+ a delta to be added to system_time
+
+ b. system_time: this is the most recent host-time we could be provided with.
+ host gets it through ktime_get_ts, using whichever clocksource is
+ registered at the moment
+
+ c. tsc_to_system_mul: this is the number that tsc delta has to be multiplied
+ by in order to obtain time in nanoseconds. Hypervisor is free to change
+ this value in face of events like cpu frequency change, pcpu migration,
+ etc.
+
+ d. tsc_shift: guests must shift
+
+With this information available, guest calculates current time as:
+
+ T = kt + to_nsec(tsc - tsc_0)
+
+2.3 Compatibility MSRs
+----------------------
+
+Guests running on top of older hypervisors may have to use a different set of
+MSRs. This is because originally, kvmclock MSRs were exported within a
+reserved range by accident. Guests should check cpuid leaf 0x40000001 for the
+presence of kvmclock. If bit 3 is disabled, but bit 0 is enabled, guests can
+have access to kvmclock functionality through
+
+ * MSR_KVM_WALL_CLOCK_OLD, value 0x11 and
+ * MSR_KVM_SYSTEM_TIME_OLD, value 0x12.
+
+Note, however, that this is deprecated.
+
+3. Migration
+============
+
+Two ioctls are provided to aid the task of migration:
+
+ * KVM_GET_CLOCK and
+ * KVM_SET_CLOCK
+
+Their aim is to control an offset that can be summed to system_time, in order
+to guarantee monotonicity on the time over guest migration. Source host
+executes KVM_GET_CLOCK, obtaining the last valid timestamp in this host, while
+destination sets it with KVM_SET_CLOCK. It's the destination responsibility to
+never return time that is less than that.
+
+
--
1.6.2.2
next prev parent reply other threads:[~2010-04-15 18:40 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-04-15 18:37 [PATCH 0/5] pv clock misc fixes Glauber Costa
2010-04-15 18:37 ` [PATCH 1/5] Add a global synchronization point for pvclock Glauber Costa
2010-04-15 18:37 ` [PATCH 2/5] change msr numbers for kvmclock Glauber Costa
2010-04-15 18:37 ` [PATCH 3/5] Try using new kvm clock msrs Glauber Costa
2010-04-15 18:37 ` [PATCH 4/5] export new cpuid KVM_CAP Glauber Costa
2010-04-15 18:37 ` Glauber Costa [this message]
2010-04-15 19:28 ` [PATCH 5/5] add documentation about kvmclock Randy Dunlap
2010-04-15 20:10 ` Glauber Costa
2010-04-17 18:58 ` [PATCH 4/5] export new cpuid KVM_CAP Avi Kivity
2010-04-19 14:50 ` Glauber Costa
2010-04-20 9:29 ` Avi Kivity
2010-04-17 18:55 ` [PATCH 3/5] Try using new kvm clock msrs Avi Kivity
2010-04-17 18:51 ` [PATCH 2/5] change msr numbers for kvmclock Avi Kivity
2010-04-16 20:23 ` [PATCH 1/5] Add a global synchronization point for pvclock Marcelo Tosatti
2010-04-16 20:36 ` Jeremy Fitzhardinge
2010-04-16 21:05 ` Zachary Amsden
2010-04-19 10:39 ` Peter Zijlstra
2010-04-19 10:50 ` Avi Kivity
2010-04-19 11:05 ` Peter Zijlstra
2010-04-19 11:10 ` Avi Kivity
2010-04-19 14:21 ` Glauber Costa
2010-04-19 14:33 ` Avi Kivity
2010-04-19 14:46 ` Peter Zijlstra
2010-04-19 16:18 ` Jeremy Fitzhardinge
2010-04-20 9:31 ` Avi Kivity
2010-04-20 18:23 ` Jeremy Fitzhardinge
2010-04-20 18:54 ` Avi Kivity
2010-04-20 19:42 ` Jeremy Fitzhardinge
2010-04-21 0:07 ` Zachary Amsden
2010-04-22 13:11 ` Glauber Costa
2010-04-23 1:44 ` Zachary Amsden
2010-04-23 9:34 ` Avi Kivity
2010-04-23 19:22 ` Jeremy Fitzhardinge
2010-04-23 19:25 ` Avi Kivity
2010-04-23 21:31 ` Zachary Amsden
2010-04-23 21:35 ` Jeremy Fitzhardinge
2010-04-23 21:41 ` Zachary Amsden
2010-04-24 9:30 ` Avi Kivity
2010-04-24 9:29 ` Avi Kivity
2010-04-19 16:11 ` Jeremy Fitzhardinge
2010-04-19 14:26 ` Glauber Costa
2010-04-19 16:19 ` Jeremy Fitzhardinge
2010-04-19 18:25 ` Glauber Costa
2010-04-20 1:57 ` Marcelo Tosatti
2010-04-20 9:35 ` Avi Kivity
2010-04-20 12:59 ` Glauber Costa
2010-04-20 15:16 ` Avi Kivity
2010-04-21 0:01 ` Zachary Amsden
2010-04-21 8:06 ` Avi Kivity
2010-04-17 18:48 ` Avi Kivity
2010-04-17 18:49 ` Avi Kivity
2010-04-19 10:43 ` Peter Zijlstra
2010-04-19 10:47 ` Avi Kivity
2010-04-19 10:56 ` Peter Zijlstra
2010-04-19 11:13 ` Avi Kivity
2010-04-19 11:19 ` Peter Zijlstra
2010-04-19 11:40 ` Avi Kivity
2010-04-19 14:32 ` Glauber Costa
2010-04-19 14:37 ` Avi Kivity
2010-04-19 10:46 ` Peter Zijlstra
2010-04-19 10:49 ` Avi Kivity
2010-04-19 10:51 ` Peter Zijlstra
2010-04-19 10:54 ` Avi Kivity
2010-04-19 18:35 ` Zachary Amsden
2010-04-20 9:39 ` Avi Kivity
2010-04-21 0:05 ` Zachary Amsden
2010-04-21 8:08 ` Avi Kivity
2010-04-19 10:49 ` Peter Zijlstra
2010-04-19 10:53 ` Avi Kivity
2010-04-19 10:59 ` Peter Zijlstra
2010-04-19 11:35 ` Avi Kivity
2010-10-25 23:30 ` Jeremy Fitzhardinge
2010-10-26 8:14 ` Avi Kivity
2010-10-26 10:49 ` Glauber Costa
2010-10-26 17:04 ` Jeremy Fitzhardinge
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1271356648-5108-6-git-send-email-glommer@redhat.com \
--to=glommer@redhat.com \
--cc=avi@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).