All of lore.kernel.org
 help / color / mirror / Atom feed
* Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
@ 2013-04-22 22:50 Tom W
  2013-04-23  9:12 ` Ian Campbell
  2013-04-23  9:29 ` Jan Beulich
  0 siblings, 2 replies; 7+ messages in thread
From: Tom W @ 2013-04-22 22:50 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3994 bytes --]

Hello Xen Developers! After fully researching ourselves and talking to many
Xen consultants, we have been advised to inquire here about a rare Xen bug
we are possibly experiencing. Any help or advice would be much appreciated,
thanks in advance! We're also open to offering some financial support to
solve this problem.

*Here is a summary of the problem:*
-very infrequently the domU clock is instantly jumping ahead a massive
amount of time and then appearing to lock on the new time (i.e. time stops)
-this has only happened 3 times since Jan. 2013 for us on two different
physical rack mounted machines that are still running today with very
similar parts and configuration
-the clock jumped ahead to the year 2264 in the first two occurrences and
only 3 days ahead in the third

*Here are more specific details:*
-when the incident occurred there was no heavy load, no high temperatures,
no hardware/memory/EDAC errors, no swapping, no errors reported anywhere
-the dom0 and other concurrently running domUs had no clock issues, the
hardware BIOS clock remained OK as well
-the clock did not slowly skew/drift ahead nor have we ever had any
skewing/drifting clock problems, it appears to have simply jumped to the
new date and stopped
-the hardware has only had CentOS dom0s and domUs (PV) running for multiple
years without incident, domUs have slowly been added with time
-we have many additional nearly identical production servers with multiple
domUs on each with very similar setups (same motherboard, CPU, RAM, OS,
etc.) that have had no clock issues yet
-the jumped domU clock can be corrected by running a "date -s" command with
any value which then syncs the domU clock back up with the dom0
-we don't use live migration, no saving/restoring and no maintenance was
taking place anywhere near or during time of jumps
-for all dom0s & domUs: independent_wallclock=0, ntpd is running,
clocksource=jiffies, Xen version 3.1.2
-incident #1: ~Sun Jan 13 13:31:01 CST 2013 to Sun Mar  6 04:39:20 CST 2264
| dom0=Centos 5.8, Linux 2.6.18-308.20.1.el5xen | domU=Centos 5.5, Linux
2.6.18-194.8.1.el5xen
-incident #2: ~Thu Mar 28 11:54:22 CDT 2013 to Thu May 19 07:32:28 CST 2264
| dom0=Centos 5.5, Linux 2.6.18-194.11.3.el5xen | domU=Centos 5.8, Linux
2.6.18-308.24.1.el5xen
-incident #3: ~Sun Mar 31 10:42:14 CDT 2013 to Wed Apr  3 14:28:31 CDT 2013
| dom versions same as #2
-dom0 specs: TYAN S5397 w/ latest BIOS v1.07, guest count=6/3, DDR ECC
RAM=48/64GB, 2 x Xeon E5420, LSI/Adaptec RAID, ~4 years old

*We already do or have now done the following:*
-full monitoring/logging for memory, disk, RAID, CPU, temperature, clock,
log watch etc. (nothing bad to report)
-enabled XEND and XENSTORED debugging (since last failure to provide more
info for potential future jumps)
-ran MemTest for hours under increased heat conditions and minor
"stresstest" run, no errors reported, fsck passed as well
-visual inspection of the hardware (no corrosion, matched CPUs, identical
properly slotted RAM, etc.)
-full dom0 & domU updates to CentOS 5.9, disabled ntpd on domU, kept domuU
independent_wallclock=0

We have found no references to the same jump & stop clock issue on a domU
given our circumstances. From other clock issue discussions, it appears
that our root issue is probably with the jump itself and the clock stopping
behavior is probably just the domU waiting for the dom0 time to catch up.

We initially thought and were advised that bad hardware could be to blame
but that may not be true given the exact same issue surfaced on very
similar but separate hardware and by the fact that the dom0 and other
resident domUs were totally unaffected clock wise.

With all independent_wallclock=0 (i.e. dependent), we know NTP does not
need to be running in the domU because it's getting its clock from the
dom0, but we run NTP anyway in the domU to aid in our monitoring of the
domU clock and it should not matter because nothing on the domU can set the
clock when independent_wallclock=0.

[-- Attachment #1.2: Type: text/html, Size: 4204 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-04-24  9:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-22 22:50 Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring) Tom W
2013-04-23  9:12 ` Ian Campbell
2013-04-24  1:26   ` Tom W
2013-04-24  9:00     ` Ian Campbell
2013-04-23  9:29 ` Jan Beulich
2013-04-23  9:33   ` Ian Campbell
2013-04-23 17:50     ` Tom W

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.