linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases
@ 2013-06-20 19:16 David Vrabel
  2013-06-20 19:16 ` [PATCH 1/4] hrtimers: provide a hrtimers_late_resume() call David Vrabel
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: David Vrabel @ 2013-06-20 19:16 UTC (permalink / raw)
  To: xen-devel
  Cc: David Vrabel, Konrad Rzeszutek Wilk, linux-kernel, John Stultz,
	Thomas Gleixner

Xen guests use the Xen wallclock as their persistent clock.  This is a
software only clock in the hypervisor that is used by guests instead
of a real hardware RTC.

The kernel has limited support for updating the persistent clock or
RTC when NTP is synced.  This has the following limitations:

* The persistent clock is not updated on step changes.  This leaves a
  window where it will be incorrect (while NTP resyncs).

* Xen guests use the Xen wallclock as their persistent clock.  dom0
  maintains this clock so it is persistent for domUs but not dom0
  itself.

These limitations mean that guests started before NTP is synchronized
will start with an incorrect wallclock time and the hardware RTC will
not be updated (as on bare metal).

These series fixes the above limitations and depends on "x86: increase
precision of x86_platform.get/set_wallclock()" which was previously
posted.

Changes since v4:

Dropped the change to disable non-boot CPUs during suspend on Xen as
migration downtime was too poor.  Instead, provide
hrtimers_late_resume() for use by Xen's resume code to replace the
call of clock_was_set().  Fix two unused variable warnings.

Changes since v3:

Add a new clock_was_set notifier chain. Use this instead of direct
calls to clock_was_set() from the timekeeping code.  Use this notifier
and a new timer to synchronize the Xen wallclock.

Changes since v2:

Don't peek at the timekeeper internals (use __current_kernel_time()
instead).  Use the native set_wallclock hook in dom0.

Changes since v1:

Reworked to use the pvclock_gtod notifier to sync the wallclock (this
looked similar to what a KVM host does).  update_persistent_clock()
will now only update the CMOS RTC.

David


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/4] hrtimers: provide a hrtimers_late_resume() call
  2013-06-20 19:16 [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases David Vrabel
@ 2013-06-20 19:16 ` David Vrabel
  2013-06-21  7:53   ` Thomas Gleixner
  2013-06-20 19:16 ` [PATCH 2/4] time: add a notifier chain for when the system time is stepped David Vrabel
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: David Vrabel @ 2013-06-20 19:16 UTC (permalink / raw)
  To: xen-devel
  Cc: David Vrabel, Konrad Rzeszutek Wilk, linux-kernel, John Stultz,
	Thomas Gleixner

From: David Vrabel <david.vrabel@citrix.com>

Xen suspends (and resumes) without disabling non-boot CPUs as doing so
adds considerable delay to live migrations.  A 4 VCPU guest had more
than 200 ms of additional downtime if disable_nonboot_cpus() was
called prior to suspending.

As a consequence, only high resolution timers on the current CPU are
retriggered when resuming.  The Xen resume path worked around this
with a call to clock_was_set() to retrigger timers on all the CPUs.

A subsequent change will make clock_was_set() internal to hrtimers so
add an new call (hrtimers_late_resume()) to do the same job.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/xen/manage.c    |    8 ++++++--
 include/linux/hrtimer.h |    1 +
 kernel/hrtimer.c        |    9 +++++++++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 412b96c..75bc2d5 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -166,8 +166,12 @@ out_resume:
 
 	dpm_resume_end(si.cancelled ? PMSG_THAW : PMSG_RESTORE);
 
-	/* Make sure timer events get retriggered on all CPUs */
-	clock_was_set();
+	/*
+	 * syscore_resume() ends up calling hrtimer_resume() but this
+	 * only retriggers timer events on the current CPU.  We need
+	 * to retrigger the events on all the other CPUS.
+	 */
+	hrtimers_late_resume();
 
 out_thaw:
 #ifdef CONFIG_PREEMPT
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index d19a5c2..13df0fa 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -323,6 +323,7 @@ extern void timerfd_clock_was_set(void);
 static inline void timerfd_clock_was_set(void) { }
 #endif
 extern void hrtimers_resume(void);
+extern void hrtimers_late_resume(void);
 
 extern ktime_t ktime_get(void);
 extern ktime_t ktime_get_real(void);
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index fd4b13b..34384b4 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -784,6 +784,15 @@ void hrtimers_resume(void)
 	timerfd_clock_was_set();
 }
 
+/*
+ * If non-boot CPUs were online during resume, we need to retrigger
+ * the events for all the non-boot CPUs.
+ */
+void hrtimers_late_resume(void)
+{
+	clock_was_set();
+}
+
 static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
 {
 #ifdef CONFIG_TIMER_STATS
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/4] time: add a notifier chain for when the system time is stepped
  2013-06-20 19:16 [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases David Vrabel
  2013-06-20 19:16 ` [PATCH 1/4] hrtimers: provide a hrtimers_late_resume() call David Vrabel
@ 2013-06-20 19:16 ` David Vrabel
  2013-06-21  7:57   ` Thomas Gleixner
  2013-06-20 19:16 ` [PATCH 3/4] x86/xen: sync the wallclock " David Vrabel
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: David Vrabel @ 2013-06-20 19:16 UTC (permalink / raw)
  To: xen-devel
  Cc: David Vrabel, Konrad Rzeszutek Wilk, linux-kernel, John Stultz,
	Thomas Gleixner

From: David Vrabel <david.vrabel@citrix.com>

The high resolution timer code gets notified of step changes to the
system time with clock_was_set() or clock_was_set_delayed() calls.  If
other parts of the kernel require similar notification there is no
clear place to hook into.

Add a clock_was_set atomic notifier chain
(clock_was_set_notifier_list) and call this in place of
clock_was_set().  If the timekeeping locks are held, the calls are
deferred to a new tasklet.

The hrtimer code adds a notifier block to this chain and uses it to
call (the now internal) clock_was_set().  Since the timekeeping code
does not call the chain from the timer irq clock_was_set_delayed() and
associated code can be removed.

For the delayed case, clock_was_set() used to be called at the
beginning of the hrtimer softirq and now it is called from a tasklet.
The tasklet softirq will be called before the hrtimer one so this
should give the same behaviour.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/hrtimer.h   |    7 -------
 include/linux/time.h      |    7 +++++++
 kernel/hrtimer.c          |   34 +++++++++++++---------------------
 kernel/time/timekeeping.c |   39 ++++++++++++++++++++++++++++++---------
 4 files changed, 50 insertions(+), 37 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 13df0fa..45e30f6 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -166,7 +166,6 @@ enum  hrtimer_base_type {
  * @lock:		lock protecting the base and associated clock bases
  *			and timers
  * @active_bases:	Bitfield to mark bases with active timers
- * @clock_was_set:	Indicates that clock was set from irq context.
  * @expires_next:	absolute time of the next event which was scheduled
  *			via clock_set_next_event()
  * @hres_active:	State of high resolution mode
@@ -180,7 +179,6 @@ enum  hrtimer_base_type {
 struct hrtimer_cpu_base {
 	raw_spinlock_t			lock;
 	unsigned int			active_bases;
-	unsigned int			clock_was_set;
 #ifdef CONFIG_HIGH_RES_TIMERS
 	ktime_t				expires_next;
 	int				hres_active;
@@ -289,8 +287,6 @@ extern void hrtimer_peek_ahead_timers(void);
 # define MONOTONIC_RES_NSEC	HIGH_RES_NSEC
 # define KTIME_MONOTONIC_RES	KTIME_HIGH_RES
 
-extern void clock_was_set_delayed(void);
-
 #else
 
 # define MONOTONIC_RES_NSEC	LOW_RES_NSEC
@@ -312,11 +308,8 @@ static inline int hrtimer_is_hres_active(struct hrtimer *timer)
 	return 0;
 }
 
-static inline void clock_was_set_delayed(void) { }
-
 #endif
 
-extern void clock_was_set(void);
 #ifdef CONFIG_TIMERFD
 extern void timerfd_clock_was_set(void);
 #else
diff --git a/include/linux/time.h b/include/linux/time.h
index d5d229b..a0c08a7 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -184,6 +184,13 @@ extern void timekeeping_clocktai(struct timespec *ts);
 struct tms;
 extern void do_sys_times(struct tms *);
 
+struct notifier_block;
+
+/*
+ * Notifier chain called when system time is stepped.
+ */
+extern int register_clock_was_set_notifier(struct notifier_block *nb);
+
 /*
  * Similar to the struct tm in userspace <time.h>, but it needs to be here so
  * that the kernel source is self contained.
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 34384b4..a853f9b 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -721,19 +721,6 @@ static int hrtimer_switch_to_hres(void)
 	return 1;
 }
 
-/*
- * Called from timekeeping code to reprogramm the hrtimer interrupt
- * device. If called from the timer interrupt context we defer it to
- * softirq context.
- */
-void clock_was_set_delayed(void)
-{
-	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
-
-	cpu_base->clock_was_set = 1;
-	__raise_softirq_irqoff(HRTIMER_SOFTIRQ);
-}
-
 #else
 
 static inline int hrtimer_hres_active(void) { return 0; }
@@ -762,7 +749,7 @@ static inline void retrigger_next_event(void *arg) { }
  * resolution timer interrupts. On UP we just disable interrupts and
  * call the high resolution interrupt code.
  */
-void clock_was_set(void)
+static void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
 	/* Retrigger the CPU local events everywhere */
@@ -1441,13 +1428,6 @@ void hrtimer_peek_ahead_timers(void)
 
 static void run_hrtimer_softirq(struct softirq_action *h)
 {
-	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
-
-	if (cpu_base->clock_was_set) {
-		cpu_base->clock_was_set = 0;
-		clock_was_set();
-	}
-
 	hrtimer_peek_ahead_timers();
 }
 
@@ -1785,11 +1765,23 @@ static struct notifier_block __cpuinitdata hrtimers_nb = {
 	.notifier_call = hrtimer_cpu_notify,
 };
 
+static int hrtimer_clock_was_set_notify(struct notifier_block *self,
+					unsigned long action, void *data)
+{
+	clock_was_set();
+	return NOTIFY_OK;
+}
+
+static struct notifier_block hrtimers_clock_was_set_nb = {
+	.notifier_call = hrtimer_clock_was_set_notify,
+};
+
 void __init hrtimers_init(void)
 {
 	hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
 			  (void *)(long)smp_processor_id());
 	register_cpu_notifier(&hrtimers_nb);
+	register_clock_was_set_notifier(&hrtimers_clock_was_set_nb);
 #ifdef CONFIG_HIGH_RES_TIMERS
 	open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq);
 #endif
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index baeeb5c..96c5c8e 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -198,6 +198,30 @@ static inline s64 timekeeping_get_ns_raw(struct timekeeper *tk)
 	return nsec + get_arch_timeoffset();
 }
 
+static ATOMIC_NOTIFIER_HEAD(clock_was_set_notifier_list);
+
+int register_clock_was_set_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_register(&clock_was_set_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(register_clock_was_set_notifier);
+
+static void timekeeping_clock_was_set(void)
+{
+	atomic_notifier_call_chain(&clock_was_set_notifier_list, 0, NULL);
+}
+
+static void timekeeping_clock_was_set_task(unsigned long d)
+{
+	timekeeping_clock_was_set();
+}
+DECLARE_TASKLET(clock_was_set_tasklet, timekeeping_clock_was_set_task, 0);
+
+static void timekeeping_clock_was_set_delayed(void)
+{
+	tasklet_schedule(&clock_was_set_tasklet);
+}
+
 static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);
 
 static void update_pvclock_gtod(struct timekeeper *tk)
@@ -513,8 +537,7 @@ int do_settimeofday(const struct timespec *tv)
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
-	/* signal hrtimers about time change */
-	clock_was_set();
+	timekeeping_clock_was_set();
 
 	return 0;
 }
@@ -557,8 +580,7 @@ error: /* even if we error out, we forwarded the time, so call update */
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
-	/* signal hrtimers about time change */
-	clock_was_set();
+	timekeeping_clock_was_set();
 
 	return ret;
 }
@@ -607,7 +629,7 @@ void timekeeping_set_tai_offset(s32 tai_offset)
 	__timekeeping_set_tai_offset(tk, tai_offset);
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
-	clock_was_set();
+	timekeeping_clock_was_set();
 }
 
 /**
@@ -877,8 +899,7 @@ void timekeeping_inject_sleeptime(struct timespec *delta)
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
-	/* signal hrtimers about time change */
-	clock_was_set();
+	timekeeping_clock_was_set();
 }
 
 /**
@@ -1260,7 +1281,7 @@ static inline void accumulate_nsecs_to_secs(struct timekeeper *tk)
 
 			__timekeeping_set_tai_offset(tk, tk->tai_offset - leap);
 
-			clock_was_set_delayed();
+			timekeeping_clock_was_set_delayed();
 		}
 	}
 }
@@ -1677,7 +1698,7 @@ int do_adjtimex(struct timex *txc)
 
 	if (tai != orig_tai) {
 		__timekeeping_set_tai_offset(tk, tai);
-		clock_was_set_delayed();
+		timekeeping_clock_was_set_delayed();
 	}
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/4] x86/xen: sync the wallclock when the system time is stepped
  2013-06-20 19:16 [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases David Vrabel
  2013-06-20 19:16 ` [PATCH 1/4] hrtimers: provide a hrtimers_late_resume() call David Vrabel
  2013-06-20 19:16 ` [PATCH 2/4] time: add a notifier chain for when the system time is stepped David Vrabel
@ 2013-06-20 19:16 ` David Vrabel
  2013-06-20 19:16 ` [PATCH 4/4] x86/xen: sync the CMOS RTC as well as the Xen wallclock David Vrabel
  2013-06-20 20:03 ` [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases John Stultz
  4 siblings, 0 replies; 23+ messages in thread
From: David Vrabel @ 2013-06-20 19:16 UTC (permalink / raw)
  To: xen-devel
  Cc: David Vrabel, Konrad Rzeszutek Wilk, linux-kernel, John Stultz,
	Thomas Gleixner

From: David Vrabel <david.vrabel@citrix.com>

The Xen wallclock is a software only clock within the Xen hypervisor
that is used by: a) PV guests as the equivalent of a hardware RTC; and
b) the hypervisor as the clock source for the emulated RTC provided to
HVM guests.

Currently the Xen wallclock is only updated every 11 minutes if NTP is
synchronized to its clock source.  If a guest is started before NTP is
synchronized it may see an incorrect wallclock time.

Use the clock_was_set notifier chain to receive a notification when
the system time is stepped and update the wallclock to match the
current system time.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 arch/x86/xen/time.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index a1947ac..ad077ca 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -211,6 +211,25 @@ static int xen_set_wallclock(const struct timespec *now)
 
 	return HYPERVISOR_dom0_op(&op);
 }
+ 
+static int xen_clock_was_set_notify(struct notifier_block *nb, unsigned long unused,
+				    void *priv)
+{
+       struct timespec now;
+
+       /*
+        * Set the Xen wallclock from Linux system time.
+        */
+       now = current_kernel_time();
+       xen_set_wallclock(&now);
+
+       return NOTIFY_OK;
+}
+
+static struct notifier_block xen_clock_was_set_notifier = {
+	.notifier_call = xen_clock_was_set_notify,
+};
+
 
 static struct clocksource xen_clocksource __read_mostly = {
 	.name = "xen",
@@ -473,6 +492,9 @@ static void __init xen_time_init(void)
 	xen_setup_runstate_info(cpu);
 	xen_setup_timer(cpu);
 	xen_setup_cpu_clockevents();
+
+	if (xen_initial_domain())
+		register_clock_was_set_notifier(&xen_clock_was_set_notifier);
 }
 
 void __init xen_init_time_ops(void)
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/4] x86/xen: sync the CMOS RTC as well as the Xen wallclock
  2013-06-20 19:16 [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases David Vrabel
                   ` (2 preceding siblings ...)
  2013-06-20 19:16 ` [PATCH 3/4] x86/xen: sync the wallclock " David Vrabel
@ 2013-06-20 19:16 ` David Vrabel
  2013-06-20 20:03 ` [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases John Stultz
  4 siblings, 0 replies; 23+ messages in thread
From: David Vrabel @ 2013-06-20 19:16 UTC (permalink / raw)
  To: xen-devel
  Cc: David Vrabel, Konrad Rzeszutek Wilk, linux-kernel, John Stultz,
	Thomas Gleixner

From: David Vrabel <david.vrabel@citrix.com>

Adjustments to Xen's persistent clock via update_persistent_clock()
don't actually persist, as the Xen wallclock is a software only clock
and modifications to it do not modify the underlying CMOS RTC.

The x86_platform.set_wallclock hook can be used to keep the hardware
RTC synchronized (as on bare metal).  Because the Xen wallclock is now
kept synchronized by the clock_was_set notifier and a new timer,
xen_set_wallclock() need not do this and dom0 can simply use the
native implementation.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 arch/x86/xen/time.c |   48 +++++++++++++++++++++++++++++++-----------------
 1 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index ad077ca..f3d09eb 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -198,38 +198,50 @@ static void xen_get_wallclock(struct timespec *now)
 
 static int xen_set_wallclock(const struct timespec *now)
 {
+	return -1;
+}
+
+static void xen_wallclock_timer_func(unsigned long d);
+static DEFINE_TIMER(xen_wallclock_timer, xen_wallclock_timer_func, 0, 0);
+
+static void xen_sync_wallclock(void)
+{
+	struct timespec now;
 	struct xen_platform_op op;
 
-	/* do nothing for domU */
-	if (!xen_initial_domain())
-		return -1;
+	now = current_kernel_time();
 
 	op.cmd = XENPF_settime;
-	op.u.settime.secs = now->tv_sec;
-	op.u.settime.nsecs = now->tv_nsec;
+	op.u.settime.secs = now.tv_sec;
+	op.u.settime.nsecs = now.tv_nsec;
 	op.u.settime.system_time = xen_clocksource_read();
 
-	return HYPERVISOR_dom0_op(&op);
+	(void)HYPERVISOR_dom0_op(&op);
+
+	/*
+	 * Use a timer to correct for any drift in the Xen
+	 * wallclock.
+	 *
+	 * 11 minutes is the same period as sync_cmos_clock().
+	 */
+	mod_timer(&xen_wallclock_timer, round_jiffies(jiffies + 11*60*HZ));
 }
- 
+
 static int xen_clock_was_set_notify(struct notifier_block *nb, unsigned long unused,
 				    void *priv)
 {
-       struct timespec now;
-
-       /*
-        * Set the Xen wallclock from Linux system time.
-        */
-       now = current_kernel_time();
-       xen_set_wallclock(&now);
-
-       return NOTIFY_OK;
+	xen_sync_wallclock();
+	return NOTIFY_OK;
 }
 
 static struct notifier_block xen_clock_was_set_notifier = {
 	.notifier_call = xen_clock_was_set_notify,
 };
 
+static void xen_wallclock_timer_func(unsigned long d)
+{
+	xen_sync_wallclock();
+}
 
 static struct clocksource xen_clocksource __read_mostly = {
 	.name = "xen",
@@ -507,7 +519,9 @@ void __init xen_init_time_ops(void)
 
 	x86_platform.calibrate_tsc = xen_tsc_khz;
 	x86_platform.get_wallclock = xen_get_wallclock;
-	x86_platform.set_wallclock = xen_set_wallclock;
+	/* Dom0 uses the native method to set the hardware RTC. */
+	if (!xen_initial_domain())
+		x86_platform.set_wallclock = xen_set_wallclock;
 }
 
 #ifdef CONFIG_XEN_PVHVM
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases
  2013-06-20 19:16 [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases David Vrabel
                   ` (3 preceding siblings ...)
  2013-06-20 19:16 ` [PATCH 4/4] x86/xen: sync the CMOS RTC as well as the Xen wallclock David Vrabel
@ 2013-06-20 20:03 ` John Stultz
  2013-06-21 18:31   ` Konrad Rzeszutek Wilk
  4 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2013-06-20 20:03 UTC (permalink / raw)
  To: David Vrabel
  Cc: xen-devel, Konrad Rzeszutek Wilk, linux-kernel, Thomas Gleixner

On 06/20/2013 12:16 PM, David Vrabel wrote:
> Xen guests use the Xen wallclock as their persistent clock.  This is a
> software only clock in the hypervisor that is used by guests instead
> of a real hardware RTC.
>
> The kernel has limited support for updating the persistent clock or
> RTC when NTP is synced.  This has the following limitations:
>
> * The persistent clock is not updated on step changes.  This leaves a
>    window where it will be incorrect (while NTP resyncs).
>
> * Xen guests use the Xen wallclock as their persistent clock.  dom0
>    maintains this clock so it is persistent for domUs but not dom0
>    itself.
>
> These limitations mean that guests started before NTP is synchronized
> will start with an incorrect wallclock time and the hardware RTC will
> not be updated (as on bare metal).
>
> These series fixes the above limitations and depends on "x86: increase
> precision of x86_platform.get/set_wallclock()" which was previously
> posted.
>
> Changes since v4:
>
> Dropped the change to disable non-boot CPUs during suspend on Xen as
> migration downtime was too poor.  Instead, provide
> hrtimers_late_resume() for use by Xen's resume code to replace the
> call of clock_was_set().  Fix two unused variable warnings.

Ok, I've got these 4 in my pending stack. As long as Thomas doesn't 
object to the first two, and it doesn't run into any trouble in testing, 
I'll send them along for 3.12. (Acks from Xen maintainers would be nice 
for the last two as well).

Thanks for all the effort through all the revisions here!

thanks
-john


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] hrtimers: provide a hrtimers_late_resume() call
  2013-06-20 19:16 ` [PATCH 1/4] hrtimers: provide a hrtimers_late_resume() call David Vrabel
@ 2013-06-21  7:53   ` Thomas Gleixner
  2013-06-21 12:32     ` David Vrabel
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2013-06-21  7:53 UTC (permalink / raw)
  To: David Vrabel
  Cc: xen-devel, Konrad Rzeszutek Wilk, LKML, John Stultz, Ingo Molnar,
	Peter Zijlstra

On Thu, 20 Jun 2013, David Vrabel wrote:
> From: David Vrabel <david.vrabel@citrix.com>
> 
> Xen suspends (and resumes) without disabling non-boot CPUs as doing so
> adds considerable delay to live migrations.  A 4 VCPU guest had more
> than 200 ms of additional downtime if disable_nonboot_cpus() was
> called prior to suspending.
> 
> As a consequence, only high resolution timers on the current CPU are
> retriggered when resuming.  The Xen resume path worked around this
> with a call to clock_was_set() to retrigger timers on all the CPUs.
> 
> A subsequent change will make clock_was_set() internal to hrtimers so
> add an new call (hrtimers_late_resume()) to do the same job.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> ---
>  drivers/xen/manage.c    |    8 ++++++--
>  include/linux/hrtimer.h |    1 +
>  kernel/hrtimer.c        |    9 +++++++++
>  3 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
> index 412b96c..75bc2d5 100644
> --- a/drivers/xen/manage.c
> +++ b/drivers/xen/manage.c
> @@ -166,8 +166,12 @@ out_resume:
>  
>  	dpm_resume_end(si.cancelled ? PMSG_THAW : PMSG_RESTORE);
>  
> -	/* Make sure timer events get retriggered on all CPUs */
> -	clock_was_set();
> +	/*
> +	 * syscore_resume() ends up calling hrtimer_resume() but this
> +	 * only retriggers timer events on the current CPU.  We need
> +	 * to retrigger the events on all the other CPUS.
> +	 */
> +	hrtimers_late_resume();

This is the completely wrong approach. If an architecture does not
shut down the non boot cpus on suspend, then this wants to be handled
in the core code and not in some random arch specific driver.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] time: add a notifier chain for when the system time is stepped
  2013-06-20 19:16 ` [PATCH 2/4] time: add a notifier chain for when the system time is stepped David Vrabel
@ 2013-06-21  7:57   ` Thomas Gleixner
  2013-06-21 12:41     ` David Vrabel
  2013-06-21 16:22     ` John Stultz
  0 siblings, 2 replies; 23+ messages in thread
From: Thomas Gleixner @ 2013-06-21  7:57 UTC (permalink / raw)
  To: David Vrabel
  Cc: xen-devel, Konrad Rzeszutek Wilk, LKML, John Stultz, Ingo Molnar,
	Peter Zijlstra

On Thu, 20 Jun 2013, David Vrabel wrote:

> From: David Vrabel <david.vrabel@citrix.com>
> 
> The high resolution timer code gets notified of step changes to the
> system time with clock_was_set() or clock_was_set_delayed() calls.  If
> other parts of the kernel require similar notification there is no
> clear place to hook into.

You fail to explain why any other part of the kernel requires a
notification.
 
We went great length to confine timekeeping inside the core code and
now you add random notifiers along with totally ugly tasklet
constructs. What for?

Without a reasonable explanation of the problem you are trying to
solve this is going nowhere near the tree.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] hrtimers: provide a hrtimers_late_resume() call
  2013-06-21  7:53   ` Thomas Gleixner
@ 2013-06-21 12:32     ` David Vrabel
  2013-06-21 14:32       ` Thomas Gleixner
  0 siblings, 1 reply; 23+ messages in thread
From: David Vrabel @ 2013-06-21 12:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: xen-devel, Konrad Rzeszutek Wilk, LKML, John Stultz, Ingo Molnar,
	Peter Zijlstra

On 21/06/13 08:53, Thomas Gleixner wrote:
> On Thu, 20 Jun 2013, David Vrabel wrote:
>> From: David Vrabel <david.vrabel@citrix.com>
>>
>> Xen suspends (and resumes) without disabling non-boot CPUs as doing so
>> adds considerable delay to live migrations.  A 4 VCPU guest had more
>> than 200 ms of additional downtime if disable_nonboot_cpus() was
>> called prior to suspending.
>>
>> As a consequence, only high resolution timers on the current CPU are
>> retriggered when resuming.  The Xen resume path worked around this
>> with a call to clock_was_set() to retrigger timers on all the CPUs.
>>
>> A subsequent change will make clock_was_set() internal to hrtimers so
>> add an new call (hrtimers_late_resume()) to do the same job.
>>
>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> ---
>>  drivers/xen/manage.c    |    8 ++++++--
>>  include/linux/hrtimer.h |    1 +
>>  kernel/hrtimer.c        |    9 +++++++++
>>  3 files changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
>> index 412b96c..75bc2d5 100644
>> --- a/drivers/xen/manage.c
>> +++ b/drivers/xen/manage.c
>> @@ -166,8 +166,12 @@ out_resume:
>>  
>>  	dpm_resume_end(si.cancelled ? PMSG_THAW : PMSG_RESTORE);
>>  
>> -	/* Make sure timer events get retriggered on all CPUs */
>> -	clock_was_set();
>> +	/*
>> +	 * syscore_resume() ends up calling hrtimer_resume() but this
>> +	 * only retriggers timer events on the current CPU.  We need
>> +	 * to retrigger the events on all the other CPUS.
>> +	 */
>> +	hrtimers_late_resume();
> 
> This is the completely wrong approach. If an architecture does not
> shut down the non boot cpus on suspend, then this wants to be handled
> in the core code and not in some random arch specific driver.

Agreed.  Does the following meet your requirements?

David

8<-------------------------------------------------------
hrtimers: support resuming with two or more CPUs online (but stopped)

hrtimers_resume() only reprograms the timers for the current CPU as it
assumes that all other CPUs are offline at this point in the resume
process.  If other CPUs are online then their timers will not be
corrected and they may fire at the wrong time.

When running as a Xen guest, this assumption is not true.  Non-boot
CPUs are only stopped with IRQs disabled instead of offlining them.
This is a performance optimization as disabling the CPUs would add an
unacceptable amount of additional downtime during a live migration (>
200 ms for a 4 VCPU guest).

hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...)
as the other CPUs will be stopped with IRQs disabled.  Instead, defer
the call to the next softirq.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/xen/manage.c |    3 ---
 kernel/hrtimer.c     |   10 +++++++---
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 412b96c..421da85 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -166,9 +166,6 @@ out_resume:
 
 	dpm_resume_end(si.cancelled ? PMSG_THAW : PMSG_RESTORE);
 
-	/* Make sure timer events get retriggered on all CPUs */
-	clock_was_set();
-
 out_thaw:
 #ifdef CONFIG_PREEMPT
 	thaw_processes();
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index fd4b13b..74aa7c5 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -773,15 +773,19 @@ void clock_was_set(void)
 
 /*
  * During resume we might have to reprogram the high resolution timer
- * interrupt (on the local CPU):
+ * interrupt on all online CPUs.  However, all other CPUs will be
+ * stopped with IRQs interrupts disabled at this point so defer this
+ * to the softirq.
  */
 void hrtimers_resume(void)
 {
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+
 	WARN_ONCE(!irqs_disabled(),
 		  KERN_INFO "hrtimers_resume() called with IRQs enabled!");
 
-	retrigger_next_event(NULL);
-	timerfd_clock_was_set();
+	cpu_base->clock_was_set = 1;
+	__raise_softirq_irqoff(HRTIMER_SOFTIRQ);
 }
 
 static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] time: add a notifier chain for when the system time is stepped
  2013-06-21  7:57   ` Thomas Gleixner
@ 2013-06-21 12:41     ` David Vrabel
  2013-06-21 23:06       ` Thomas Gleixner
  2013-06-21 16:22     ` John Stultz
  1 sibling, 1 reply; 23+ messages in thread
From: David Vrabel @ 2013-06-21 12:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: xen-devel, Konrad Rzeszutek Wilk, LKML, John Stultz, Ingo Molnar,
	Peter Zijlstra

On 21/06/13 08:57, Thomas Gleixner wrote:
> On Thu, 20 Jun 2013, David Vrabel wrote:
> 
>> From: David Vrabel <david.vrabel@citrix.com>
>>
>> The high resolution timer code gets notified of step changes to the
>> system time with clock_was_set() or clock_was_set_delayed() calls.  If
>> other parts of the kernel require similar notification there is no
>> clear place to hook into.
> 
> You fail to explain why any other part of the kernel requires a
> notification.

This is needed by patch 3 in this series.

"The Xen wallclock is a software only clock within the Xen hypervisor
that is used by: a) PV guests as the equivalent of a hardware RTC; and
b) the hypervisor as the clock source for the emulated RTC provided to
HVM guests.

Currently the Xen wallclock is only updated every 11 minutes if NTP is
synchronized to its clock source.  If a guest is started before NTP is
synchronized it may see an incorrect wallclock time.

Use the clock_was_set notifier chain to receive a notification when
the system time is stepped and update the wallclock to match the
current system time."

> We went great length to confine timekeeping inside the core code and
> now you add random notifiers along with totally ugly tasklet
> constructs.

I'm not sure I understand your objection to the use of a tasklet.  Using
the hrtimer softirq for something that is no longer hrtimer-specific did
not seem like the correct thing to do.

David

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] hrtimers: provide a hrtimers_late_resume() call
  2013-06-21 12:32     ` David Vrabel
@ 2013-06-21 14:32       ` Thomas Gleixner
  2013-06-21 17:30         ` David Vrabel
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2013-06-21 14:32 UTC (permalink / raw)
  To: David Vrabel
  Cc: xen-devel, Konrad Rzeszutek Wilk, LKML, John Stultz, Ingo Molnar,
	Peter Zijlstra

On Fri, 21 Jun 2013, David Vrabel wrote:
> On 21/06/13 08:53, Thomas Gleixner wrote:
> > This is the completely wrong approach. If an architecture does not
> > shut down the non boot cpus on suspend, then this wants to be handled
> > in the core code and not in some random arch specific driver.
> 
> Agreed.  Does the following meet your requirements?

Indeed. That's looks way more reasonable. Though...

> hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...)
> as the other CPUs will be stopped with IRQs disabled.  Instead, defer
> the call to the next softirq.

that's just working by chance and not by design as there is no
guarantee that the next interrupt, which invokes the softirq, will
arrive in time. So you want to make sure that an interrupt arrives.

Invoking retrigger_next_event(NULL) from hrtimer_resume() should do
the trick.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] time: add a notifier chain for when the system time is stepped
  2013-06-21  7:57   ` Thomas Gleixner
  2013-06-21 12:41     ` David Vrabel
@ 2013-06-21 16:22     ` John Stultz
  1 sibling, 0 replies; 23+ messages in thread
From: John Stultz @ 2013-06-21 16:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Vrabel, xen-devel, Konrad Rzeszutek Wilk, LKML,
	Ingo Molnar, Peter Zijlstra

On 06/21/2013 12:57 AM, Thomas Gleixner wrote:
> On Thu, 20 Jun 2013, David Vrabel wrote:
>
>> From: David Vrabel <david.vrabel@citrix.com>
>>
>> The high resolution timer code gets notified of step changes to the
>> system time with clock_was_set() or clock_was_set_delayed() calls.  If
>> other parts of the kernel require similar notification there is no
>> clear place to hook into.
> You fail to explain why any other part of the kernel requires a
> notification.
>   
> We went great length to confine timekeeping inside the core code and
> now you add random notifiers along with totally ugly tasklet
> constructs. What for?
>
> Without a reasonable explanation of the problem you are trying to
> solve this is going nowhere near the tree.

It took awhile for me to get why it was needed as well.

My understanding is that on Xen, they want the hypervisor's virtual RTC 
needs to be aligned with Dom0s system time (so that DomN guests boot 
with the correct time). Thus changes to the system time need to cause 
Dom0 to update the virtual RTC.

Its not terribly unlike the timerfd notification we do to userspace, but 
instead is done for the Dom0 Xen management code.

I do agree we need to keep users of the notification on a short leash 
(hopefully keeping the interface behind some sort of internal.h header 
file) so its easy to see when folks start trying to use it.

thanks
-john



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] hrtimers: provide a hrtimers_late_resume() call
  2013-06-21 14:32       ` Thomas Gleixner
@ 2013-06-21 17:30         ` David Vrabel
  2013-06-21 21:24           ` Thomas Gleixner
  0 siblings, 1 reply; 23+ messages in thread
From: David Vrabel @ 2013-06-21 17:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: xen-devel, Konrad Rzeszutek Wilk, LKML, John Stultz, Ingo Molnar,
	Peter Zijlstra

On 21/06/13 15:32, Thomas Gleixner wrote:
> On Fri, 21 Jun 2013, David Vrabel wrote:
>> On 21/06/13 08:53, Thomas Gleixner wrote:
>>> This is the completely wrong approach. If an architecture does not
>>> shut down the non boot cpus on suspend, then this wants to be handled
>>> in the core code and not in some random arch specific driver.
>>
>> Agreed.  Does the following meet your requirements?
> 
> Indeed. That's looks way more reasonable. Though...
> 
>> hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...)
>> as the other CPUs will be stopped with IRQs disabled.  Instead, defer
>> the call to the next softirq.
> 
> that's just working by chance and not by design as there is no
> guarantee that the next interrupt, which invokes the softirq, will
> arrive in time. So you want to make sure that an interrupt arrives.

That is a good point.

> Invoking retrigger_next_event(NULL) from hrtimer_resume() should do
> the trick.

But I'm not sure that would be sufficient, although I may not be
understanding how the hrtimers work or are used.

There may be timers on other CPUs that are supposed to fire earlier than
those on the current CPU.  There may even be no timers scheduled on the
current CPU.

I think there needs to be something like:

hrtimers_resume()
{
    retrigger_next_event(NULL);

    /* Timers on other CPUs might expire earlier.
       Program an earlier event and use this to kick the softirq to
       correctly reprogram the events on the other CPUs. */
    expires_next = cpu_base->expires_next;
    for_each_online_cpu(cpu) {
        expires = hrtimers_next_event_on_cpu(cpu)
        if (expires < expires_next)
            expires_next = expires;
   }
   tick_program_event(expires_next, 1);
   cpu_base->clock_was_set = 1;
   __raise_softirq_irqoff(HRTIMER_SOFTIRQ);
}

However, since hrtimers require the use of a one-shot ticker and when
one-shot timers are resumed they are armed to fire immediately (see
tick_resume_oneshot()) this interrupt is sufficient to kick the require
softirq.

So, as proposed before:

hrtimers_resume()
{
   /* This CPU's tick is armed to fire immediately by
      tick_oneshot_resume(). Just need raise a softirq to program
      the timers on all CPUs. */
   cpu_base->clock_was_set = 1;
   __raise_softirq_irqoff(HRTIMER_SOFTIRQ);
}

Do you agree or disagree?

David

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases
  2013-06-20 20:03 ` [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases John Stultz
@ 2013-06-21 18:31   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-06-21 18:31 UTC (permalink / raw)
  To: John Stultz; +Cc: David Vrabel, xen-devel, linux-kernel, Thomas Gleixner

On Thu, Jun 20, 2013 at 01:03:34PM -0700, John Stultz wrote:
> On 06/20/2013 12:16 PM, David Vrabel wrote:
> >Xen guests use the Xen wallclock as their persistent clock.  This is a
> >software only clock in the hypervisor that is used by guests instead
> >of a real hardware RTC.
> >
> >The kernel has limited support for updating the persistent clock or
> >RTC when NTP is synced.  This has the following limitations:
> >
> >* The persistent clock is not updated on step changes.  This leaves a
> >   window where it will be incorrect (while NTP resyncs).
> >
> >* Xen guests use the Xen wallclock as their persistent clock.  dom0
> >   maintains this clock so it is persistent for domUs but not dom0
> >   itself.
> >
> >These limitations mean that guests started before NTP is synchronized
> >will start with an incorrect wallclock time and the hardware RTC will
> >not be updated (as on bare metal).
> >
> >These series fixes the above limitations and depends on "x86: increase
> >precision of x86_platform.get/set_wallclock()" which was previously
> >posted.
> >
> >Changes since v4:
> >
> >Dropped the change to disable non-boot CPUs during suspend on Xen as
> >migration downtime was too poor.  Instead, provide
> >hrtimers_late_resume() for use by Xen's resume code to replace the
> >call of clock_was_set().  Fix two unused variable warnings.
> 
> Ok, I've got these 4 in my pending stack. As long as Thomas doesn't
> object to the first two, and it doesn't run into any trouble in
> testing, I'll send them along for 3.12. (Acks from Xen maintainers
> would be nice for the last two as well).

Please consider them Acked-by.

Thanks!
> 
> Thanks for all the effort through all the revisions here!
> 
> thanks
> -john
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] hrtimers: provide a hrtimers_late_resume() call
  2013-06-21 17:30         ` David Vrabel
@ 2013-06-21 21:24           ` Thomas Gleixner
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2013-06-21 21:24 UTC (permalink / raw)
  To: David Vrabel
  Cc: xen-devel, Konrad Rzeszutek Wilk, LKML, John Stultz, Ingo Molnar,
	Peter Zijlstra

On Fri, 21 Jun 2013, David Vrabel wrote:
> On 21/06/13 15:32, Thomas Gleixner wrote:
> However, since hrtimers require the use of a one-shot ticker and when
> one-shot timers are resumed they are armed to fire immediately (see
> tick_resume_oneshot()) this interrupt is sufficient to kick the require
> softirq.
> 
> So, as proposed before:
> 
> hrtimers_resume()
> {
>    /* This CPU's tick is armed to fire immediately by
>       tick_oneshot_resume(). Just need raise a softirq to program
>       the timers on all CPUs. */
>    cpu_base->clock_was_set = 1;
>    __raise_softirq_irqoff(HRTIMER_SOFTIRQ);
> }
> 
> Do you agree or disagree?

Fair enough. I did not think about that. With the comment in place it
is clear. It might be a bit more elaborate for the casual reader.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] time: add a notifier chain for when the system time is stepped
  2013-06-21 12:41     ` David Vrabel
@ 2013-06-21 23:06       ` Thomas Gleixner
  2013-06-24 10:51         ` David Vrabel
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2013-06-21 23:06 UTC (permalink / raw)
  To: David Vrabel
  Cc: xen-devel, Konrad Rzeszutek Wilk, LKML, John Stultz, Ingo Molnar,
	Peter Zijlstra

On Fri, 21 Jun 2013, David Vrabel wrote:
> On 21/06/13 08:57, Thomas Gleixner wrote:
> > On Thu, 20 Jun 2013, David Vrabel wrote:
> > 
> >> From: David Vrabel <david.vrabel@citrix.com>
> >>
> >> The high resolution timer code gets notified of step changes to the
> >> system time with clock_was_set() or clock_was_set_delayed() calls.  If
> >> other parts of the kernel require similar notification there is no
> >> clear place to hook into.
> > 
> > You fail to explain why any other part of the kernel requires a
> > notification.
> 
> This is needed by patch 3 in this series.
> 
> "The Xen wallclock is a software only clock within the Xen hypervisor
> that is used by: a) PV guests as the equivalent of a hardware RTC; and
> b) the hypervisor as the clock source for the emulated RTC provided to
> HVM guests.
> 
> Currently the Xen wallclock is only updated every 11 minutes if NTP is
> synchronized to its clock source.  If a guest is started before NTP is
> synchronized it may see an incorrect wallclock time.

What you are saying is, that you are fixing Xens failure to implement
a proper RTC emulation by hacking a notifier into the core code. You
can't be serious about that.

According to your changelog:

  Currently the Xen wallclock is only updated every 11 minutes if NTP is
  synchronized to its clocksource.

How is that related to clock_was_set() ?

clock_was_set*() is invoked from:

	do_settimeofday()
	timekeeping_inject_offset()
	timekeeping_set_tai_offset()
	timekeeping_inject_sleeptime()
	update_wall_time()
	do_adjtimex()

The only function which calls clock_was_set() and can affect RTC is
do_adjtimex(). Though you claim that the natural place to add a
notifier is clock_was_set().

So you went the other way round this time. In the hrtimers case you
tried to fix shortcomings of the core code in some random Xen
code. With this patch you try to fix Xen nonsense in the core code.

Can you please provide a proper explanation of the problem you are
trying to solve? This means that you should explain the semantics of
the desired XEN RTC emulation and not the desired workarounds to fix
the shortcommings current implementation.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] time: add a notifier chain for when the system time is stepped
  2013-06-21 23:06       ` Thomas Gleixner
@ 2013-06-24 10:51         ` David Vrabel
  2013-06-24 16:30           ` Thomas Gleixner
  0 siblings, 1 reply; 23+ messages in thread
From: David Vrabel @ 2013-06-24 10:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: xen-devel, Konrad Rzeszutek Wilk, LKML, John Stultz, Ingo Molnar,
	Peter Zijlstra

On 22/06/13 00:06, Thomas Gleixner wrote:
> On Fri, 21 Jun 2013, David Vrabel wrote:
>> On 21/06/13 08:57, Thomas Gleixner wrote:
>>> On Thu, 20 Jun 2013, David Vrabel wrote:
>>>
>>>> From: David Vrabel <david.vrabel@citrix.com>
>>>>
>>>> The high resolution timer code gets notified of step changes to the
>>>> system time with clock_was_set() or clock_was_set_delayed() calls.  If
>>>> other parts of the kernel require similar notification there is no
>>>> clear place to hook into.
>>>
>>> You fail to explain why any other part of the kernel requires a
>>> notification.
>>
>> This is needed by patch 3 in this series.
>>
>> "The Xen wallclock is a software only clock within the Xen hypervisor
>> that is used by: a) PV guests as the equivalent of a hardware RTC; and
>> b) the hypervisor as the clock source for the emulated RTC provided to
>> HVM guests.
>>
>> Currently the Xen wallclock is only updated every 11 minutes if NTP is
>> synchronized to its clock source.  If a guest is started before NTP is
>> synchronized it may see an incorrect wallclock time.
> 
> What you are saying is, that you are fixing Xens failure to implement
> a proper RTC emulation by hacking a notifier into the core code. You
> can't be serious about that.

Xen does provide proper emulation of an RTC for guests.  Both full
hardware emulation for fully-virtualized guest (HVM) and a lighter
weight interface for paravirtualized guests (PV).

As with any emulated RTC, there needs to be underlying clocksource
providing the time.  Under Xen, this is the Xen wallclock and it is
implemented as a record of the date/time timestamped with the
corresponding Xen clocksource value[1].

KVM provides an identical wallclock to its guests -- see the common
pvclock_read_wallclock() function.

Xen has hardware drivers for only the minimal amount of hardware
necessary for the scheduling and isolation of guests.  It does not have
drivers for any hardware RTC nor does it have a network stack or an
implementation of NTP.  Therefore it has no way to maintain the
correctness of the Xen wallclock and relies on the control domain (dom0)
to do this.

Dom0 updates the Xen wallclock with the XENPF_settime platform_op hypercall.

To ensure the correctness of the Xen wallclock it is kept in sync with
dom0's system time (which is assumed to be correct and would typically
be corrected by NTP).

This requires that the Xen wallclock is both:

a) updated on step changes to system time.
b) updated periodically to correct for any drift.

This behaviour (keeping the wallclock in sync with dom0 system time) is
part of the ABI provided by the kernel and changing it would break
existing user space.

This patch set is fixing the rare case where a guest is started before
NTP has synced and thus sees an incorrect wallclock time which may cause
the guest to fail to boot.

> According to your changelog:
> 
>   Currently the Xen wallclock is only updated every 11 minutes if NTP is
>   synchronized to its clocksource.
> 
> How is that related to clock_was_set() ?

It's not. This is the update_persistent_clock() call from the periodic
sync_cmos_clock() work.

> clock_was_set*() is invoked from:
> 
> 	do_settimeofday()
> 	timekeeping_inject_offset()
> 	timekeeping_set_tai_offset()
> 	timekeeping_inject_sleeptime()
> 	update_wall_time()
> 	do_adjtimex()
> 
> The only function which calls clock_was_set() and can affect RTC is
> do_adjtimex(). Though you claim that the natural place to add a
> notifier is clock_was_set().
> 
> So you went the other way round this time. In the hrtimers case you
> tried to fix shortcomings of the core code in some random Xen
> code. With this patch you try to fix Xen nonsense in the core code.

KVM uses a very similar mechanism to maintain system time for a guest so
guest system time is synchronized with host system time.  See the
pvclock_gtod notifier chain and its usage in arch/x86/kvm/x86.c.

v3 of this series did use this existing notifier but it is called on
every timer tick so this is more expensive than necessary to meet the
requirements (see (a) and (b) above) for maintaining the Xen wallclock.
 John suggested hooking into clock_was_set().

> Can you please provide a proper explanation of the problem you are
> trying to solve? This means that you should explain the semantics of
> the desired XEN RTC emulation and not the desired workarounds to fix
> the shortcommings current implementation.

In summary, both Xen and KVM need to solve similar problems with keeping
time synchronized between a host and guests.

The key difference between the two hypervisors is that Xen synchronizes
the wallclock and KVM synchronizes system time.

David

[1] The Xen clocksource is monotonic, nanosecond resolution clocksource
provided by Xen for use internally and by guests.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] time: add a notifier chain for when the system time is stepped
  2013-06-24 10:51         ` David Vrabel
@ 2013-06-24 16:30           ` Thomas Gleixner
  2013-06-24 17:00             ` David Vrabel
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2013-06-24 16:30 UTC (permalink / raw)
  To: David Vrabel
  Cc: xen-devel, Konrad Rzeszutek Wilk, LKML, John Stultz, Ingo Molnar,
	Peter Zijlstra

On Mon, 24 Jun 2013, David Vrabel wrote:
> On 22/06/13 00:06, Thomas Gleixner wrote:
> This patch set is fixing the rare case where a guest is started before
> NTP has synced and thus sees an incorrect wallclock time which may cause
> the guest to fail to boot.

You're not fixing it, you are just making the window smaller.

clock_was_set() is called outside of the timekeeper_lock protected
regions, so what prevents the guest to start before the notifier is
invoked?

We already have a synchronous notifier in place and the notifier call
itself is not expensive. What's expensive is the hypercall and there
is no way at the moment to figure out whether the update is relevant
for you or just a tick. Though that's trivial information to provide
without imposing another notifier including the surrounding mess on
the core code.

Completely untested patch below.

Thanks,

	tglx
---
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index baeeb5c..6e9f838 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -200,9 +200,9 @@ static inline s64 timekeeping_get_ns_raw(struct timekeeper *tk)
 
 static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);
 
-static void update_pvclock_gtod(struct timekeeper *tk)
+static void update_pvclock_gtod(struct timekeeper *tk, bool cws)
 {
-	raw_notifier_call_chain(&pvclock_gtod_chain, 0, tk);
+	raw_notifier_call_chain(&pvclock_gtod_chain, cws, tk);
 }
 
 /**
@@ -216,7 +216,7 @@ int pvclock_gtod_register_notifier(struct notifier_block *nb)
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
 	ret = raw_notifier_chain_register(&pvclock_gtod_chain, nb);
-	update_pvclock_gtod(tk);
+	update_pvclock_gtod(tk, true);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
 	return ret;
@@ -241,14 +241,15 @@ int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
 EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);
 
 /* must hold timekeeper_lock */
-static void timekeeping_update(struct timekeeper *tk, bool clearntp, bool mirror)
+static void timekeeping_update(struct timekeeper *tk, bool clearntp,
+			       bool mirror, bool cws)
 {
 	if (clearntp) {
 		tk->ntp_error = 0;
 		ntp_clear();
 	}
 	update_vsyscall(tk);
-	update_pvclock_gtod(tk);
+	update_pvclock_gtod(tk, cws);
 
 	if (mirror)
 		memcpy(&shadow_timekeeper, &timekeeper, sizeof(timekeeper));
@@ -508,7 +509,7 @@ int do_settimeofday(const struct timespec *tv)
 
 	tk_set_xtime(tk, tv);
 
-	timekeeping_update(tk, true, true);
+	timekeeping_update(tk, true, true, true);
 
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -552,7 +553,7 @@ int timekeeping_inject_offset(struct timespec *ts)
 	tk_set_wall_to_mono(tk, timespec_sub(tk->wall_to_monotonic, *ts));
 
 error: /* even if we error out, we forwarded the time, so call update */
-	timekeeping_update(tk, true, true);
+	timekeeping_update(tk, true, true, true);
 
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -633,7 +634,7 @@ static int change_clocksource(void *data)
 		if (old->disable)
 			old->disable(old);
 	}
-	timekeeping_update(tk, true, true);
+	timekeeping_update(tk, true, true, true);
 
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -872,7 +873,7 @@ void timekeeping_inject_sleeptime(struct timespec *delta)
 
 	__timekeeping_inject_sleeptime(tk, delta);
 
-	timekeeping_update(tk, true, true);
+	timekeeping_update(tk, true, true, true);
 
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -954,7 +955,7 @@ static void timekeeping_resume(void)
 	tk->cycle_last = clock->cycle_last = cycle_now;
 	tk->ntp_error = 0;
 	timekeeping_suspended = 0;
-	timekeeping_update(tk, false, true);
+	timekeeping_update(tk, false, true, true);
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
@@ -1236,9 +1237,10 @@ out_adjust:
  * It also calls into the NTP code to handle leapsecond processing.
  *
  */
-static inline void accumulate_nsecs_to_secs(struct timekeeper *tk)
+static inline bool accumulate_nsecs_to_secs(struct timekeeper *tk)
 {
 	u64 nsecps = (u64)NSEC_PER_SEC << tk->shift;
+	bool ret = false;
 
 	while (tk->xtime_nsec >= nsecps) {
 		int leap;
@@ -1261,8 +1263,10 @@ static inline void accumulate_nsecs_to_secs(struct timekeeper *tk)
 			__timekeeping_set_tai_offset(tk, tk->tai_offset - leap);
 
 			clock_was_set_delayed();
+			ret = true;
 		}
 	}
+	return ret;
 }
 
 /**
@@ -1348,6 +1352,7 @@ static void update_wall_time(void)
 	cycle_t offset;
 	int shift = 0, maxshift;
 	unsigned long flags;
+	bool cws;
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
 
@@ -1399,7 +1404,7 @@ static void update_wall_time(void)
 	 * Finally, make sure that after the rounding
 	 * xtime_nsec isn't larger than NSEC_PER_SEC
 	 */
-	accumulate_nsecs_to_secs(tk);
+	cws = accumulate_nsecs_to_secs(tk);
 
 	write_seqcount_begin(&timekeeper_seq);
 	/* Update clock->cycle_last with the new value */
@@ -1415,7 +1420,7 @@ static void update_wall_time(void)
 	 * updating.
 	 */
 	memcpy(real_tk, tk, sizeof(*tk));
-	timekeeping_update(real_tk, false, false);
+	timekeeping_update(real_tk, false, false, cws);
 	write_seqcount_end(&timekeeper_seq);
 out:
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -1677,6 +1682,7 @@ int do_adjtimex(struct timex *txc)
 
 	if (tai != orig_tai) {
 		__timekeeping_set_tai_offset(tk, tai);
+		update_pvclock_gtod(tk, true);
 		clock_was_set_delayed();
 	}
 	write_seqcount_end(&timekeeper_seq);

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] time: add a notifier chain for when the system time is stepped
  2013-06-24 16:30           ` Thomas Gleixner
@ 2013-06-24 17:00             ` David Vrabel
  2013-06-24 17:50               ` John Stultz
  2013-06-24 19:55               ` Thomas Gleixner
  0 siblings, 2 replies; 23+ messages in thread
From: David Vrabel @ 2013-06-24 17:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: xen-devel, Konrad Rzeszutek Wilk, LKML, John Stultz, Ingo Molnar,
	Peter Zijlstra

On 24/06/13 17:30, Thomas Gleixner wrote:
> 
> We already have a synchronous notifier in place and the notifier call
> itself is not expensive. What's expensive is the hypercall and there
> is no way at the moment to figure out whether the update is relevant
> for you or just a tick. Though that's trivial information to provide
> without imposing another notifier including the surrounding mess on
> the core code.

This looks good, thanks.

> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
[...]
> @@ -508,7 +509,7 @@ int do_settimeofday(const struct timespec *tv)
>  
>  	tk_set_xtime(tk, tv);
>  
> -	timekeeping_update(tk, true, true);
> +	timekeeping_update(tk, true, true, true);

These three booleans in a row is getting a bit opaque. How about I also
change it to a set of flags?  e.g.,

timekeeping_updated(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

David

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] time: add a notifier chain for when the system time is stepped
  2013-06-24 17:00             ` David Vrabel
@ 2013-06-24 17:50               ` John Stultz
  2013-06-24 19:55               ` Thomas Gleixner
  1 sibling, 0 replies; 23+ messages in thread
From: John Stultz @ 2013-06-24 17:50 UTC (permalink / raw)
  To: David Vrabel
  Cc: Thomas Gleixner, xen-devel, Konrad Rzeszutek Wilk, LKML,
	Ingo Molnar, Peter Zijlstra

On 06/24/2013 10:00 AM, David Vrabel wrote:
> On 24/06/13 17:30, Thomas Gleixner wrote:
>> @@ -508,7 +509,7 @@ int do_settimeofday(const struct timespec *tv)
>>   
>>   	tk_set_xtime(tk, tv);
>>   
>> -	timekeeping_update(tk, true, true);
>> +	timekeeping_update(tk, true, true, true);
> These three booleans in a row is getting a bit opaque. How about I also
> change it to a set of flags?  e.g.,
>
> timekeeping_updated(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

Yea. I'm not a fan of the bool arguments to functions (which I have to 
look up every time as which bool is which).

The bitflag approach is nicer in my mind, since its a bit more explicit 
when reading the code.

The other approach would be to have different function calls 
(timekeeping_clear_ntp, timekeeping_mirror, timekeeping_clock_was_set), 
which call into the same back end logic.

thanks
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] time: add a notifier chain for when the system time is stepped
  2013-06-24 17:00             ` David Vrabel
  2013-06-24 17:50               ` John Stultz
@ 2013-06-24 19:55               ` Thomas Gleixner
  1 sibling, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2013-06-24 19:55 UTC (permalink / raw)
  To: David Vrabel
  Cc: xen-devel, Konrad Rzeszutek Wilk, LKML, John Stultz, Ingo Molnar,
	Peter Zijlstra

On Mon, 24 Jun 2013, David Vrabel wrote:

> On 24/06/13 17:30, Thomas Gleixner wrote:
> > 
> > We already have a synchronous notifier in place and the notifier call
> > itself is not expensive. What's expensive is the hypercall and there
> > is no way at the moment to figure out whether the update is relevant
> > for you or just a tick. Though that's trivial information to provide
> > without imposing another notifier including the surrounding mess on
> > the core code.
> 
> This looks good, thanks.
> 
> > --- a/kernel/time/timekeeping.c
> > +++ b/kernel/time/timekeeping.c
> [...]
> > @@ -508,7 +509,7 @@ int do_settimeofday(const struct timespec *tv)
> >  
> >  	tk_set_xtime(tk, tv);
> >  
> > -	timekeeping_update(tk, true, true);
> > +	timekeeping_update(tk, true, true, true);
> 
> These three booleans in a row is getting a bit opaque. How about I also
> change it to a set of flags?  e.g.,
> 
> timekeeping_updated(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

Fair enough. Can you convert the existing booleans first and then put
the CLOCK_WAS_SET patch on top of that?

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases
  2013-06-20 19:13 David Vrabel
@ 2013-06-20 19:18 ` David Vrabel
  0 siblings, 0 replies; 23+ messages in thread
From: David Vrabel @ 2013-06-20 19:18 UTC (permalink / raw)
  To: David Vrabel
  Cc: xen-devel, Konrad Rzeszutek Wilk, linux-kernel, John Stultz,
	Thomas Gleixner

On 20/06/13 20:13, David Vrabel wrote:
> Xen guests use the Xen wallclock as their persistent clock.  This is a
> software only clock in the hypervisor that is used by guests instead
> of a real hardware RTC.

Sorry, I accidentally specified the wrong revision range and only
include 2 out 4 of the patches in the series.

I have resent the complete series.

David

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases
@ 2013-06-20 19:13 David Vrabel
  2013-06-20 19:18 ` David Vrabel
  0 siblings, 1 reply; 23+ messages in thread
From: David Vrabel @ 2013-06-20 19:13 UTC (permalink / raw)
  To: xen-devel
  Cc: David Vrabel, Konrad Rzeszutek Wilk, linux-kernel, John Stultz,
	Thomas Gleixner

Xen guests use the Xen wallclock as their persistent clock.  This is a
software only clock in the hypervisor that is used by guests instead
of a real hardware RTC.

The kernel has limited support for updating the persistent clock or
RTC when NTP is synced.  This has the following limitations:

* The persistent clock is not updated on step changes.  This leaves a
  window where it will be incorrect (while NTP resyncs).

* Xen guests use the Xen wallclock as their persistent clock.  dom0
  maintains this clock so it is persistent for domUs but not dom0
  itself.

These limitations mean that guests started before NTP is synchronized
will start with an incorrect wallclock time and the hardware RTC will
not be updated (as on bare metal).

These series fixes the above limitations and depends on "x86: increase
precision of x86_platform.get/set_wallclock()" which was previously
posted.

Changes since v4:

Dropped the change to disable non-boot CPUs during suspend on Xen as
migration downtime was too poor.  Instead, provide
hrtimers_late_resume() for use by Xen's resume code to replace the
call of clock_was_set().  Fix two unused variable warnings.

Changes since v3:

Add a new clock_was_set notifier chain. Use this instead of direct
calls to clock_was_set() from the timekeeping code.  Use this notifier
and a new timer to synchronize the Xen wallclock.

Changes since v2:

Don't peek at the timekeeper internals (use __current_kernel_time()
instead).  Use the native set_wallclock hook in dom0.

Changes since v1:

Reworked to use the pvclock_gtod notifier to sync the wallclock (this
looked similar to what a KVM host does).  update_persistent_clock()
will now only update the CMOS RTC.

David


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2013-06-24 19:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-20 19:16 [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases David Vrabel
2013-06-20 19:16 ` [PATCH 1/4] hrtimers: provide a hrtimers_late_resume() call David Vrabel
2013-06-21  7:53   ` Thomas Gleixner
2013-06-21 12:32     ` David Vrabel
2013-06-21 14:32       ` Thomas Gleixner
2013-06-21 17:30         ` David Vrabel
2013-06-21 21:24           ` Thomas Gleixner
2013-06-20 19:16 ` [PATCH 2/4] time: add a notifier chain for when the system time is stepped David Vrabel
2013-06-21  7:57   ` Thomas Gleixner
2013-06-21 12:41     ` David Vrabel
2013-06-21 23:06       ` Thomas Gleixner
2013-06-24 10:51         ` David Vrabel
2013-06-24 16:30           ` Thomas Gleixner
2013-06-24 17:00             ` David Vrabel
2013-06-24 17:50               ` John Stultz
2013-06-24 19:55               ` Thomas Gleixner
2013-06-21 16:22     ` John Stultz
2013-06-20 19:16 ` [PATCH 3/4] x86/xen: sync the wallclock " David Vrabel
2013-06-20 19:16 ` [PATCH 4/4] x86/xen: sync the CMOS RTC as well as the Xen wallclock David Vrabel
2013-06-20 20:03 ` [PATCHv5 0/4] xen: maintain an accurate persistent clock in more cases John Stultz
2013-06-21 18:31   ` Konrad Rzeszutek Wilk
  -- strict thread matches above, loose matches on Subject: below --
2013-06-20 19:13 David Vrabel
2013-06-20 19:18 ` David Vrabel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).