linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* clock jumps on dualcore PentiumD with cpufreq
@ 2007-12-09 12:14 Michael Tokarev
  2007-12-09 14:51 ` Rafael J. Wysocki
  0 siblings, 1 reply; 2+ messages in thread
From: Michael Tokarev @ 2007-12-09 12:14 UTC (permalink / raw)
  To: Linux-kernel

This isn't a new issue, but so far no solution(s)
has been found, it seems.  And with kernel development
going on, the issue becomes worse.

Up to 2.6.20 or so (I don't remember exactly, but if
I recall correctly the issue first appeared when new
timer code has been merged), there was no issues at
all.

Starting with 2.6.21 (at least), I see frequent messages
from chronyd (http://chrony.sunsite.dk/), which was running
on this machine since the day one, messages telling me
that system clock has been reset - up to +/- several
seconds.

On 2.6.22 there were only a few such messages per day,
and the time delta was less than a second - in between
0.2-0.8 second usually.  On 2.6.23, it become much,
much worse.  Here's a typical log from today:

...
Dec  9 08:10:50 paltus chronyd[1761]: System clock wrong by 0.664195 seconds, adjustment started
Dec  9 08:31:06 paltus chronyd[1761]: System clock wrong by 0.666263 seconds, adjustment started
Dec  9 09:10:35 paltus chronyd[1761]: System clock wrong by 0.830534 seconds, adjustment started
Dec  9 09:47:56 paltus chronyd[1761]: System clock wrong by -0.632430 seconds, adjustment started
Dec  9 10:11:24 paltus chronyd[1761]: System clock wrong by 0.557093 seconds, adjustment started
Dec  9 10:30:36 paltus chronyd[1761]: System clock wrong by 0.931442 seconds, adjustment started
Dec  9 10:31:40 paltus chronyd[1761]: System clock wrong by 0.918518 seconds, adjustment started
Dec  9 10:32:44 paltus chronyd[1761]: System clock wrong by 0.919524 seconds, adjustment started
Dec  9 10:33:48 paltus chronyd[1761]: System clock wrong by -1.783922 seconds, adjustment started
Dec  9 10:50:53 paltus chronyd[1761]: System clock wrong by 0.815781 seconds, adjustment started
Dec  9 11:18:37 paltus chronyd[1761]: System clock wrong by -1.120279 seconds, adjustment started
Dec  9 11:48:30 paltus chronyd[1761]: System clock wrong by -0.576473 seconds, adjustment started
Dec  9 11:50:38 paltus chronyd[1761]: System clock wrong by 0.728826 seconds, adjustment started
Dec  9 11:52:46 paltus chronyd[1761]: System clock wrong by 0.561940 seconds, adjustment started
Dec  9 11:57:02 paltus chronyd[1761]: System clock wrong by -0.786041 seconds, adjustment started
Dec  9 12:18:22 paltus chronyd[1761]: System clock wrong by -0.559688 seconds, adjustment started
Dec  9 12:20:30 paltus chronyd[1761]: System clock wrong by 0.712789 seconds, adjustment started
Dec  9 12:26:55 paltus chronyd[1761]: System clock wrong by -0.511155 seconds, adjustment started
Dec  9 12:40:47 paltus chronyd[1761]: System clock wrong by 0.721966 seconds, adjustment started
Dec  9 12:47:11 paltus chronyd[1761]: System clock wrong by -1.053725 seconds, adjustment started
Dec  9 13:01:03 paltus chronyd[1761]: System clock wrong by 0.921670 seconds, adjustment started
Dec  9 13:21:20 paltus chronyd[1761]: System clock wrong by 0.948468 seconds, adjustment started
...

As far as I can guess, the problem is that the clock is
different on different CPUs (or, rather, cores), so when
the process (chronyd) gets moved from one core to another,
it loses its minds and behaves like the above, unable to
"understand" what's going on, and the result is that it
speeds up the clock on a core where it should slow the
clock down and vise versa, resulting it a mess.

The machine is an IBM xSeries 3200, single CPU dual-core
PentiumD processor:

vendor_id	: GenuineIntel
cpu family	: 15
model		: 6
model name	: Intel(R) Pentium(R) D CPU 2.80GHz
stepping	: 4
cpu MHz		: 2800.000
cache size	: 2048 KB

The system says that TSC clocksource is unstable and switches
to jiffies:

Clocksource tsc unstable (delta = 70021228 ns)
Time: jiffies clocksource has been installed.

There are two things I noticed so far.

First, this problem is quite strongly related to cpufreq.
Namely, without cpufreq scaling (with the default "performance"
governor), everything works correctly.  After switching to
"ondemand" governor, chronyd goes mad as shown above.  And
after switching back to "performance", after some time (needed
for the madness to slowly stop), the problem vanishes again.
By the way, I'm not exactly sure but it seems that the above
message (about TSC being unstable) is triggered right when
switching to "ondemand" governor.

And second, I've only seen the whole problem with 32bits
kernel.  x86-64 kernel (with the exact same config) does
not expose it in any way, be it cpufreq or not.

This last point is the very interesting one, because it
suggests that the bug is in kernel somewhere, more, that
it's in i386 part of it.

Sadly I've only a limited testing abilities on this machine,
as it's our main production (file) server (so I can only
test something on it during weekends and late evenings).

Similar (but at much smaller extent/scale) behavior has been
observed on several machines with AMD X2-64 processors as well,
also related to cpufreq, but this time both 64 and 32 bits
kernels behaves the same, and it seems this is a different
issue.

Any recommendations for me to try to debug it further, please?..

Thanks.

/mjt

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: clock jumps on dualcore PentiumD with cpufreq
  2007-12-09 12:14 clock jumps on dualcore PentiumD with cpufreq Michael Tokarev
@ 2007-12-09 14:51 ` Rafael J. Wysocki
  0 siblings, 0 replies; 2+ messages in thread
From: Rafael J. Wysocki @ 2007-12-09 14:51 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Linux-kernel, Dave Jones, Ingo Molnar, Thomas Gleixner

Added some CCs.

On Sunday, 9 of December 2007, Michael Tokarev wrote:
> This isn't a new issue, but so far no solution(s)
> has been found, it seems.  And with kernel development
> going on, the issue becomes worse.
> 
> Up to 2.6.20 or so (I don't remember exactly, but if
> I recall correctly the issue first appeared when new
> timer code has been merged), there was no issues at
> all.
> 
> Starting with 2.6.21 (at least), I see frequent messages
> from chronyd (http://chrony.sunsite.dk/), which was running
> on this machine since the day one, messages telling me
> that system clock has been reset - up to +/- several
> seconds.
> 
> On 2.6.22 there were only a few such messages per day,
> and the time delta was less than a second - in between
> 0.2-0.8 second usually.  On 2.6.23, it become much,
> much worse.  Here's a typical log from today:
> 
> ...
> Dec  9 08:10:50 paltus chronyd[1761]: System clock wrong by 0.664195 seconds, adjustment started
> Dec  9 08:31:06 paltus chronyd[1761]: System clock wrong by 0.666263 seconds, adjustment started
> Dec  9 09:10:35 paltus chronyd[1761]: System clock wrong by 0.830534 seconds, adjustment started
> Dec  9 09:47:56 paltus chronyd[1761]: System clock wrong by -0.632430 seconds, adjustment started
> Dec  9 10:11:24 paltus chronyd[1761]: System clock wrong by 0.557093 seconds, adjustment started
> Dec  9 10:30:36 paltus chronyd[1761]: System clock wrong by 0.931442 seconds, adjustment started
> Dec  9 10:31:40 paltus chronyd[1761]: System clock wrong by 0.918518 seconds, adjustment started
> Dec  9 10:32:44 paltus chronyd[1761]: System clock wrong by 0.919524 seconds, adjustment started
> Dec  9 10:33:48 paltus chronyd[1761]: System clock wrong by -1.783922 seconds, adjustment started
> Dec  9 10:50:53 paltus chronyd[1761]: System clock wrong by 0.815781 seconds, adjustment started
> Dec  9 11:18:37 paltus chronyd[1761]: System clock wrong by -1.120279 seconds, adjustment started
> Dec  9 11:48:30 paltus chronyd[1761]: System clock wrong by -0.576473 seconds, adjustment started
> Dec  9 11:50:38 paltus chronyd[1761]: System clock wrong by 0.728826 seconds, adjustment started
> Dec  9 11:52:46 paltus chronyd[1761]: System clock wrong by 0.561940 seconds, adjustment started
> Dec  9 11:57:02 paltus chronyd[1761]: System clock wrong by -0.786041 seconds, adjustment started
> Dec  9 12:18:22 paltus chronyd[1761]: System clock wrong by -0.559688 seconds, adjustment started
> Dec  9 12:20:30 paltus chronyd[1761]: System clock wrong by 0.712789 seconds, adjustment started
> Dec  9 12:26:55 paltus chronyd[1761]: System clock wrong by -0.511155 seconds, adjustment started
> Dec  9 12:40:47 paltus chronyd[1761]: System clock wrong by 0.721966 seconds, adjustment started
> Dec  9 12:47:11 paltus chronyd[1761]: System clock wrong by -1.053725 seconds, adjustment started
> Dec  9 13:01:03 paltus chronyd[1761]: System clock wrong by 0.921670 seconds, adjustment started
> Dec  9 13:21:20 paltus chronyd[1761]: System clock wrong by 0.948468 seconds, adjustment started
> ...
> 
> As far as I can guess, the problem is that the clock is
> different on different CPUs (or, rather, cores), so when
> the process (chronyd) gets moved from one core to another,
> it loses its minds and behaves like the above, unable to
> "understand" what's going on, and the result is that it
> speeds up the clock on a core where it should slow the
> clock down and vise versa, resulting it a mess.
> 
> The machine is an IBM xSeries 3200, single CPU dual-core
> PentiumD processor:
> 
> vendor_id	: GenuineIntel
> cpu family	: 15
> model		: 6
> model name	: Intel(R) Pentium(R) D CPU 2.80GHz
> stepping	: 4
> cpu MHz		: 2800.000
> cache size	: 2048 KB
> 
> The system says that TSC clocksource is unstable and switches
> to jiffies:
> 
> Clocksource tsc unstable (delta = 70021228 ns)
> Time: jiffies clocksource has been installed.
> 
> There are two things I noticed so far.
> 
> First, this problem is quite strongly related to cpufreq.
> Namely, without cpufreq scaling (with the default "performance"
> governor), everything works correctly.  After switching to
> "ondemand" governor, chronyd goes mad as shown above.  And
> after switching back to "performance", after some time (needed
> for the madness to slowly stop), the problem vanishes again.
> By the way, I'm not exactly sure but it seems that the above
> message (about TSC being unstable) is triggered right when
> switching to "ondemand" governor.
> 
> And second, I've only seen the whole problem with 32bits
> kernel.  x86-64 kernel (with the exact same config) does
> not expose it in any way, be it cpufreq or not.
> 
> This last point is the very interesting one, because it
> suggests that the bug is in kernel somewhere, more, that
> it's in i386 part of it.
> 
> Sadly I've only a limited testing abilities on this machine,
> as it's our main production (file) server (so I can only
> test something on it during weekends and late evenings).
> 
> Similar (but at much smaller extent/scale) behavior has been
> observed on several machines with AMD X2-64 processors as well,
> also related to cpufreq, but this time both 64 and 32 bits
> kernels behaves the same, and it seems this is a different
> issue.
> 
> Any recommendations for me to try to debug it further, please?..
> 
> Thanks.
> 
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2007-12-09 14:33 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-09 12:14 clock jumps on dualcore PentiumD with cpufreq Michael Tokarev
2007-12-09 14:51 ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).