All of lore.kernel.org
 help / color / mirror / Atom feed
* Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
@ 2013-04-22 22:50 Tom W
  2013-04-23  9:12 ` Ian Campbell
  2013-04-23  9:29 ` Jan Beulich
  0 siblings, 2 replies; 7+ messages in thread
From: Tom W @ 2013-04-22 22:50 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3994 bytes --]

Hello Xen Developers! After fully researching ourselves and talking to many
Xen consultants, we have been advised to inquire here about a rare Xen bug
we are possibly experiencing. Any help or advice would be much appreciated,
thanks in advance! We're also open to offering some financial support to
solve this problem.

*Here is a summary of the problem:*
-very infrequently the domU clock is instantly jumping ahead a massive
amount of time and then appearing to lock on the new time (i.e. time stops)
-this has only happened 3 times since Jan. 2013 for us on two different
physical rack mounted machines that are still running today with very
similar parts and configuration
-the clock jumped ahead to the year 2264 in the first two occurrences and
only 3 days ahead in the third

*Here are more specific details:*
-when the incident occurred there was no heavy load, no high temperatures,
no hardware/memory/EDAC errors, no swapping, no errors reported anywhere
-the dom0 and other concurrently running domUs had no clock issues, the
hardware BIOS clock remained OK as well
-the clock did not slowly skew/drift ahead nor have we ever had any
skewing/drifting clock problems, it appears to have simply jumped to the
new date and stopped
-the hardware has only had CentOS dom0s and domUs (PV) running for multiple
years without incident, domUs have slowly been added with time
-we have many additional nearly identical production servers with multiple
domUs on each with very similar setups (same motherboard, CPU, RAM, OS,
etc.) that have had no clock issues yet
-the jumped domU clock can be corrected by running a "date -s" command with
any value which then syncs the domU clock back up with the dom0
-we don't use live migration, no saving/restoring and no maintenance was
taking place anywhere near or during time of jumps
-for all dom0s & domUs: independent_wallclock=0, ntpd is running,
clocksource=jiffies, Xen version 3.1.2
-incident #1: ~Sun Jan 13 13:31:01 CST 2013 to Sun Mar  6 04:39:20 CST 2264
| dom0=Centos 5.8, Linux 2.6.18-308.20.1.el5xen | domU=Centos 5.5, Linux
2.6.18-194.8.1.el5xen
-incident #2: ~Thu Mar 28 11:54:22 CDT 2013 to Thu May 19 07:32:28 CST 2264
| dom0=Centos 5.5, Linux 2.6.18-194.11.3.el5xen | domU=Centos 5.8, Linux
2.6.18-308.24.1.el5xen
-incident #3: ~Sun Mar 31 10:42:14 CDT 2013 to Wed Apr  3 14:28:31 CDT 2013
| dom versions same as #2
-dom0 specs: TYAN S5397 w/ latest BIOS v1.07, guest count=6/3, DDR ECC
RAM=48/64GB, 2 x Xeon E5420, LSI/Adaptec RAID, ~4 years old

*We already do or have now done the following:*
-full monitoring/logging for memory, disk, RAID, CPU, temperature, clock,
log watch etc. (nothing bad to report)
-enabled XEND and XENSTORED debugging (since last failure to provide more
info for potential future jumps)
-ran MemTest for hours under increased heat conditions and minor
"stresstest" run, no errors reported, fsck passed as well
-visual inspection of the hardware (no corrosion, matched CPUs, identical
properly slotted RAM, etc.)
-full dom0 & domU updates to CentOS 5.9, disabled ntpd on domU, kept domuU
independent_wallclock=0

We have found no references to the same jump & stop clock issue on a domU
given our circumstances. From other clock issue discussions, it appears
that our root issue is probably with the jump itself and the clock stopping
behavior is probably just the domU waiting for the dom0 time to catch up.

We initially thought and were advised that bad hardware could be to blame
but that may not be true given the exact same issue surfaced on very
similar but separate hardware and by the fact that the dom0 and other
resident domUs were totally unaffected clock wise.

With all independent_wallclock=0 (i.e. dependent), we know NTP does not
need to be running in the domU because it's getting its clock from the
dom0, but we run NTP anyway in the domU to aid in our monitoring of the
domU clock and it should not matter because nothing on the domU can set the
clock when independent_wallclock=0.

[-- Attachment #1.2: Type: text/html, Size: 4204 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
  2013-04-22 22:50 Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring) Tom W
@ 2013-04-23  9:12 ` Ian Campbell
  2013-04-24  1:26   ` Tom W
  2013-04-23  9:29 ` Jan Beulich
  1 sibling, 1 reply; 7+ messages in thread
From: Ian Campbell @ 2013-04-23  9:12 UTC (permalink / raw)
  To: Tom W; +Cc: xen-devel

On Mon, 2013-04-22 at 23:50 +0100, Tom W wrote:
> Hello Xen Developers! After fully researching ourselves and talking to
> many Xen consultants, we have been advised to inquire here about a
> rare Xen bug we are possibly experiencing. Any help or advice would be
> much appreciated, thanks in advance! We're also open to offering some
> financial support to solve this problem.

Does your hypervisor tree have this commit in it:

        commit 84628ee52a427b0f0fe50502eb8ffd0eedad0f03
        Author: Jan Beulich <jbeulich@suse.com>
        Date:   Mon Nov 26 17:20:39 2012 +0100
        
            x86/time: fix scale_delta() inline assembly

That was responsible for a rash of strange time jumps, although IIRC it
affected the whole system and not individual VMs.

It might be worth looking at the scale_delta function in your kernel,
which I think you will find in arch/i386/kernel/time-xen.c. There was a
fix made to this code in the upstream kernel which may be missing there:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=de2d1a524e94a79078d9fe22c57c0c6009237547

I have no idea if this fix is relevant to the kernel (or compiler etc)
you are using, but it looks interesting...

> Here is a summary of the problem:
> -very infrequently the domU clock is instantly jumping ahead a massive
> amount of time and then appearing to lock on the new time (i.e. time
> stops)

Some kernels (I expect including yours) contain a "latch" so that time
always appears monotonic, which means that if time glitches forwards and
then back again it will appear to lock at the later time. Look for
monotonic in arch/i386/kernel/time-xen.c for the code.

If you were able to add some debugging to the kernel you should be able
to observe this latching, in fact a single shot debug print when the
latched time is way ahead of the current time would be a useful
diagnostic tool IMHO.

> -for all dom0s & domUs: independent_wallclock=0, ntpd is running,
> clocksource=jiffies, Xen version 3.1.2

I know there is a lot of suggestions to set clocksource=jiffies floating
around on the Internet but I am far from convinced that it is a good
idea.

I won't rule out it being a useful workaround for kernel+hypervisors of
the vintage you are using, but I think it would be interesting to try
without it.

You've already noticed that independent_wallclock=0 and ntpd are
inconsistent, so that's good.

Ian.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
  2013-04-22 22:50 Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring) Tom W
  2013-04-23  9:12 ` Ian Campbell
@ 2013-04-23  9:29 ` Jan Beulich
  2013-04-23  9:33   ` Ian Campbell
  1 sibling, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2013-04-23  9:29 UTC (permalink / raw)
  To: Tom W; +Cc: xen-devel

>>> On 23.04.13 at 00:50, Tom W <tcte.tech@gmail.com> wrote:
> -for all dom0s & domUs: independent_wallclock=0, ntpd is running,
> clocksource=jiffies, Xen version 3.1.2

I can only second Ian's recommendation to drop this clocksource=
option.

And unless this was a typo, you surely want to get off that really
old hypervisor. Nobody's going to help you with issues there, if
they're not reproducible on recent Xen. Even if it was meant to
read 4.1.2, you should update to (or at least check against)
4.1.4 or 4.1.5-rc before claiming to have an unsolved problem.

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
  2013-04-23  9:29 ` Jan Beulich
@ 2013-04-23  9:33   ` Ian Campbell
  2013-04-23 17:50     ` Tom W
  0 siblings, 1 reply; 7+ messages in thread
From: Ian Campbell @ 2013-04-23  9:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Tom W, xen-devel

On Tue, 2013-04-23 at 10:29 +0100, Jan Beulich wrote:
> >>> On 23.04.13 at 00:50, Tom W <tcte.tech@gmail.com> wrote:
> > -for all dom0s & domUs: independent_wallclock=0, ntpd is running,
> > clocksource=jiffies, Xen version 3.1.2
> 
> I can only second Ian's recommendation to drop this clocksource=
> option.
> 
> And unless this was a typo, you surely want to get off that really
> old hypervisor.

FWIW I had assumed this was the RHEL5/CentOS5 supplied hypervisor (it's
the right era at least) and not a typo.

> Nobody's going to help you with issues there,

If I'm right then it would be better to start by reporting a RHEL bug
IMHO.

Ian.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
  2013-04-23  9:33   ` Ian Campbell
@ 2013-04-23 17:50     ` Tom W
  0 siblings, 0 replies; 7+ messages in thread
From: Tom W @ 2013-04-23 17:50 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Jan Beulich, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2642 bytes --]

Thanks for the feedback Ian and Jan!

It was not a typo, we are using RHEL5/CentOS5 which started in 2007 and is
not fully EOL until 2020 but the production phase ends in 2017. We do
understand your point though about getting on a much newer version but
unfortunately we are a small operation and that type of change in the short
term for our existing systems is very cost prohibitive.

"jiffies" is the only clock source option in all our dom0s and domUs as per
the following output so switching sources does not appear to be an option:
>cat /sys/devices/system/clocksource/clocksource0/available_clocksource
>jiffies
Are you thinking that's potential sign of something off if the dom0 only
has the one jiffies option? We're using the default CentOS install and have
no special boot settings related to timing or the clock.

We had previously read Jan's "fix scale_delta() inline assembly" thread but
based on the discussion and all related threads but we didn't think it
really applied to our situation but perhaps it does. As well, our jumping
appears to be much larger of a jump and way less frequent than others. We
will figure out how to check if that change is included in our tree and get
back to you on what we find.

The latching clock behavior seems appropriate given the situation but it
seems potentially odd that the clock can then be fixed by simply issuing a
"date -s" command on the domU when independent_wallclock=0. Should it not
stay latched on the future date?

We shall also try the suggested RHEL bug submission path and see where that
leads, thanks.

If we're stuck for the short/medium term on the latest Centos5 release with
clocksource=jiffies, would switching our domU systems to
independent_wallclock=1 and continuing to run ntpd have any better chance
of bypassing the potential issue causing the jump or is it possible it
could make things worse?


On Tue, Apr 23, 2013 at 5:33 AM, Ian Campbell <Ian.Campbell@citrix.com>wrote:

> On Tue, 2013-04-23 at 10:29 +0100, Jan Beulich wrote:
> > >>> On 23.04.13 at 00:50, Tom W <tcte.tech@gmail.com> wrote:
> > > -for all dom0s & domUs: independent_wallclock=0, ntpd is running,
> > > clocksource=jiffies, Xen version 3.1.2
> >
> > I can only second Ian's recommendation to drop this clocksource=
> > option.
> >
> > And unless this was a typo, you surely want to get off that really
> > old hypervisor.
>
> FWIW I had assumed this was the RHEL5/CentOS5 supplied hypervisor (it's
> the right era at least) and not a typo.
>
> > Nobody's going to help you with issues there,
>
> If I'm right then it would be better to start by reporting a RHEL bug
> IMHO.
>
> Ian.
>
>

[-- Attachment #1.2: Type: text/html, Size: 3465 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
  2013-04-23  9:12 ` Ian Campbell
@ 2013-04-24  1:26   ` Tom W
  2013-04-24  9:00     ` Ian Campbell
  0 siblings, 1 reply; 7+ messages in thread
From: Tom W @ 2013-04-24  1:26 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1146 bytes --]

> Does your hypervisor tree have this commit in it:
>
>         commit 84628ee52a427b0f0fe50502eb8ffd0eedad0f03
>         Author: Jan Beulich <jbeulich@suse.com>
>         Date:   Mon Nov 26 17:20:39 2012 +0100
>
>             x86/time: fix scale_delta() inline assembly
>
> That was responsible for a rash of strange time jumps, although IIRC it
> affected the whole system and not individual VMs.
>
> It might be worth looking at the scale_delta function in your kernel,
> which I think you will find in arch/i386/kernel/time-xen.c. There was a
> fix made to this code in the upstream kernel which may be missing there:
>
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=de2d1a524e94a79078d9fe22c57c0c6009237547
>
>
After looking at the latest source for RHEL5, the scale_delta method does
not have this change, nor does it have the fix Jan described here:
http://markmail.org/message/cngzubj6b6vdo55a

The latest RHEL6 does however have the change you described above.

Would changing independent_wallclock=1 bypass the need for the domU system
to call this potentially bad scale_delta method in RHEL5?

Thanks Ian.

[-- Attachment #1.2: Type: text/html, Size: 1775 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)
  2013-04-24  1:26   ` Tom W
@ 2013-04-24  9:00     ` Ian Campbell
  0 siblings, 0 replies; 7+ messages in thread
From: Ian Campbell @ 2013-04-24  9:00 UTC (permalink / raw)
  To: Tom W; +Cc: xen-devel

On Wed, 2013-04-24 at 02:26 +0100, Tom W wrote:


> Would changing independent_wallclock=1 bypass the need for the domU
> system to call this potentially bad scale_delta method in RHEL5?

In principal the system time (which uses scale_delta) and the wallclock
time (which independent_wallclock controls) are separate things, however
your use of clocksource=jiffies and such an old kernel makes me unsure
if they are intertwined or not in your environment.

The best way to know for sure would be to rebuild your kernel with some
debugging added and test that.

Ian.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-04-24  9:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-22 22:50 Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring) Tom W
2013-04-23  9:12 ` Ian Campbell
2013-04-24  1:26   ` Tom W
2013-04-24  9:00     ` Ian Campbell
2013-04-23  9:29 ` Jan Beulich
2013-04-23  9:33   ` Ian Campbell
2013-04-23 17:50     ` Tom W

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.