linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* System time warping around real time problem - please help
@ 2003-03-25 16:32 Fionn Behrens
  2003-03-25 17:07 ` Richard B. Johnson
  0 siblings, 1 reply; 19+ messages in thread
From: Fionn Behrens @ 2003-03-25 16:32 UTC (permalink / raw)
  To: linux-kernel


Hello all,


I have got an increasingly annoying problem with our fairly new (fall
'02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 2.4.20.
The only kernel patch applied is Alan Cox's ptrace patch.

To say it right away: the system is not overclocked or anything and
never was. It has decent cooling and is used as a combined workstation
and server.

I cant say exactly when it started but the system clock tends to begin
jumping around real time in an erratic manner, usually after about 12-48
hours of uptime. The maximum time jump is about 5 seconds back or forth
so the time is always "about" right.
To give you an example to visualize, you can watch asclock in X and see
the second clock-hand jumping like 3 seconds backwards, then 5 seconds
forth, 2 back and 1 forth or so within 2 or 3 seconds.
For a demonstration I wrote the following short example in python:

t = 0
while 1:
  n = time()
  if t > n: print t, ">", n
  t = n

Running this loop returned the following lines:

1048608745.61 > 1048608745.60
1048608745.63 > 1048608745.62
1048608745.65 > 1048608745.64
1048608748.23 > 1048608745.67
1048608748.27 > 1048608745.71
1048608748.30 > 1048608745.74
1048608748.34 > 1048608745.78
1048608748.42 > 1048608745.86
1048608748.47 > 1048608745.91
1048608748.52 > 1048608745.96
[----cut----]

So you see the time() on this system is constantly overtaking itself and
jumping back. It almost looks like two parallel time()s are there and it
switches back and forth between them.

I recompiled the kernel, I upgraded the BIOS to the latest version
available, I disabled ntp and tried some more I dont recall yet - no
success. Due to the erratic timer, working on the machine is no fun.
Software crashes are regularly - naturally. No programmer expects system
timers going back in time.

I am pretty desperate and I'd appreciate any hints on what to check.
I'll glady present any system detail that you might miss for a proper
analysis on request per email or on freenode (Fionn).

Thank you in advance,
   			F. Behrens (Not a subscriber of this list)
-- 
I believe we have been called by history to lead the world.
                                                 G.W. Bush, 2002-03-01

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-25 16:32 System time warping around real time problem - please help Fionn Behrens
@ 2003-03-25 17:07 ` Richard B. Johnson
  2003-03-25 17:17   ` Tim Schmielau
                     ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Richard B. Johnson @ 2003-03-25 17:07 UTC (permalink / raw)
  To: Fionn Behrens; +Cc: linux-kernel

On Tue, 25 Mar 2003, Fionn Behrens wrote:

>
> Hello all,
>
>
> I have got an increasingly annoying problem with our fairly new (fall
> '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 2.4.20.
> The only kernel patch applied is Alan Cox's ptrace patch.
>

I am using the exact same kernel (a lot of folks are). There
is no such jumping on my system.
Try this program:

#include <stdio.h>
#include <time.h>
int main() {
   time_t x,y;
   (void)time(&x);
   (void)time(&y);
   for(;;) {
       (void)time(&x);
       if(x < y)
           printf("Prev %ld New %ld\n", y, x);
       y = x;
   }
   return 0;
}
If this shows time jumping around you have one of either:

(1)	Bad timer channel 0 chip (PIT).
(2)	Some daemon trying to sync time with another system.
(3)	You are traveling too close to the speed of light.

Now, your script shows time in fractional seconds.

> 1048608745.61 > 1048608745.60

You can modify the program to do this:


#include <stdio.h>
#include <sys/time.h>
int main() {
   struct timeval tv;
   double x, y;
   (void)gettimeofday(&tv, NULL);
   x = (double) tv.tv_sec * 1e6;
   x += (double) tv.tv_usec;
   y = x;
   for(;;) {
       (void)gettimeofday(&tv, NULL);
       x = (double) tv.tv_sec * 1e6;
       x += (double) tv.tv_usec;
       if(x < y)
           printf("Prev %f New %f\n", y, x);
       y = x;
   }
   return 0;
}

There should be no jumping around -- and there isn't on
any system I've tested this on.

> Software crashes are regularly - naturally. No programmer expects system
> timers going back in time.
>

Hmmm, software should never crash. Even if the timers jump backwards
as you say, they should eventually time-out. If you have crashes, this
may point to other hardware problems as well.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-25 17:07 ` Richard B. Johnson
@ 2003-03-25 17:17   ` Tim Schmielau
  2003-03-25 18:12   ` Fionn Behrens
  2003-03-25 21:16   ` Fionn Behrens
  2 siblings, 0 replies; 19+ messages in thread
From: Tim Schmielau @ 2003-03-25 17:17 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Fionn Behrens, linux-kernel

On Tue, 25 Mar 2003, Richard B. Johnson wrote:

> On Tue, 25 Mar 2003, Fionn Behrens wrote:
>
> > I have got an increasingly annoying problem with our fairly new (fall
> > '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 2.4.20.
> > The only kernel patch applied is Alan Cox's ptrace patch.
> >
>
[...]
> If this shows time jumping around you have one of either:
>
> (1)	Bad timer channel 0 chip (PIT).
> (2)	Some daemon trying to sync time with another system.
> (3)	You are traveling too close to the speed of light.

(4) Unsync'ed TSCs?

See help text for CONFIG_X86_TSC_DISABLE. Never had this problem
myself, though.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-25 17:07 ` Richard B. Johnson
  2003-03-25 17:17   ` Tim Schmielau
@ 2003-03-25 18:12   ` Fionn Behrens
  2003-03-25 18:29     ` Richard B. Johnson
  2003-03-25 21:16   ` Fionn Behrens
  2 siblings, 1 reply; 19+ messages in thread
From: Fionn Behrens @ 2003-03-25 18:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: root

On Die, 2003-03-25 at 18:07, Richard B. Johnson wrote:
> On Tue, 25 Mar 2003, Fionn Behrens wrote:

> > I have got an increasingly annoying problem with our fairly new (fall
> > '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 2.4.20.

> I am using the exact same kernel (a lot of folks are). There
> is no such jumping on my system.
> Try this program:

[... prg1.c ...]

> If this shows time jumping around you have one of either:
> 
> (1)	Bad timer channel 0 chip (PIT).
> (2)	Some daemon trying to sync time with another system.
> (3)	You are traveling too close to the speed of light.

It just exits immediately with exit code 1. (*shrug*)

> Now, your script shows time in fractional seconds.
> 
> > 1048608745.61 > 1048608745.60
> 
> You can modify the program to do this:

[... prg2.c ...]

> There should be no jumping around -- and there isn't on
> any system I've tested this on.

When I run this code it begins to put out Prev N New M lines.

Prev 1048615862810879.000000 New 1048615862759879.000000
Prev 1048615862870879.000000 New 1048615862819878.000000
Prev 1048615862900879.000000 New 1048615862849902.000000
Prev 1048615862960882.000000 New 1048615862909875.000000
[-------- cut --------]

After a few seconds of run time random processes on my machine begin to
crash, or I get kernel oopses and kernel freezes. Looks very much like
heavy use of gettimeofday() causes random writes in system memory.

> > Software crashes are regularly - naturally. No programmer expects system
> > timers going back in time.

> Hmmm, software should never crash. Even if the timers jump backwards
> as you say, they should eventually time-out. If you have crashes, this
> may point to other hardware problems as well.

E.g. which type of hardware problem?

Thanks a million for your help so far, it is great to experience how
fast people are respoding!

I'll evaluate that other suggestion about TSC_DISABLE now and will get
back to you as soon as I can tell you more.

Kind regards,
		F. Behrens

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-25 18:12   ` Fionn Behrens
@ 2003-03-25 18:29     ` Richard B. Johnson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard B. Johnson @ 2003-03-25 18:29 UTC (permalink / raw)
  To: poster; +Cc: Linux kernel

On Tue, 25 Mar 2003, Fionn Behrens wrote:

> On Die, 2003-03-25 at 18:07, Richard B. Johnson wrote:
> > On Tue, 25 Mar 2003, Fionn Behrens wrote:
>
> > > I have got an increasingly annoying problem with our fairly new (fall
> > > '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 2.4.20.
>
> > I am using the exact same kernel (a lot of folks are). There
> > is no such jumping on my system.
> > Try this program:
>
> [... prg1.c ...]
>
> > If this shows time jumping around you have one of either:
> >
> > (1)	Bad timer channel 0 chip (PIT).
> > (2)	Some daemon trying to sync time with another system.
> > (3)	You are traveling too close to the speed of light.
>
> It just exits immediately with exit code 1. (*shrug*)
>

Hmmm. Note that the for(;;) { } provides no exit path.
So, you probably have some bad RAM or your CPU is too
hot (broken fan??), or something like that.


> > Now, your script shows time in fractional seconds.
> >
> > > 1048608745.61 > 1048608745.60
> >
> > You can modify the program to do this:
>
> [... prg2.c ...]
>
> > There should be no jumping around -- and there isn't on
> > any system I've tested this on.
>
> When I run this code it begins to put out Prev N New M lines.
>
> Prev 1048615862810879.000000 New 1048615862759879.000000
> Prev 1048615862870879.000000 New 1048615862819878.000000
> Prev 1048615862900879.000000 New 1048615862849902.000000
> Prev 1048615862960882.000000 New 1048615862909875.000000
> [-------- cut --------]
>
> After a few seconds of run time random processes on my machine begin to
> crash, or I get kernel oopses and kernel freezes. Looks very much like
> heavy use of gettimeofday() causes random writes in system memory.
>

Looks very much like you have a real bad hardware problem.


>
> E.g. which type of hardware problem?
>

Look inside and see if your CPU fan has stopped. Also move your RAM
sticks around after wiping any dirt off the contacts. Since the
machine used to work last fall, It's probably just a FAN or RAM
problems.


> Thanks a million for your help so far, it is great to experience how
> fast people are respoding!
>
> I'll evaluate that other suggestion about TSC_DISABLE now and will get
> back to you as soon as I can tell you more.
>

I doubt that this will help you, but it's worth trying.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-25 17:07 ` Richard B. Johnson
  2003-03-25 17:17   ` Tim Schmielau
  2003-03-25 18:12   ` Fionn Behrens
@ 2003-03-25 21:16   ` Fionn Behrens
  2003-03-25 22:14     ` george anzinger
  2 siblings, 1 reply; 19+ messages in thread
From: Fionn Behrens @ 2003-03-25 21:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: root, tim

Richard B. Johnson wrote:

> On Tue, 25 Mar 2003, Fionn Behrens wrote: 
> > On Die, 2003-03-25 at 18:07, Richard B. Johnson wrote:
> > > On Tue, 25 Mar 2003, Fionn Behrens wrote:
> >
> > > > I have got an increasingly annoying problem with our fairly new
> > > > (fall '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 
> > > > 2.4.20.
> >
> > > I am using the exact same kernel (a lot of folks are). There
> > > is no such jumping on my system.
> > > Try this program:
> >
> > [... prg1.c ...]
> >
> > > If this shows time jumping around you have one of either:
> > >
> > > (1) Bad timer channel 0 chip (PIT).
> > > (2) Some daemon trying to sync time with another system.
> > > (3) You are traveling too close to the speed of light.
> >
> > It just exits immediately with exit code 1. (*shrug*)

> Hmmm. Note that the for(;;) { } provides no exit path.

I noticed that well and investigated the issue using ddd. Funnily enough
the program runs well in ddd until X crashes. But in the shell it still
behaves like it would be nothing but exit(1);
 
> So, you probably have some bad RAM or your CPU is too 
> hot (broken fan??), or something like that. 

None of the above. The system is liquid cooled and subject to contiuous
thermal monitoring. The RAM is 1GB Infineon ECC. Before the weekend I
had the machine running overnight with memtest86 - 14 hours, all tests
activated. Not a single error.
I also tried an endless kernel compile loop the other day and the
machine compiled about 100 kernels in approx two hours without a hitch.

> > [... prg2.c ...]
> >
> > When I run this code it begins to put out Prev N New M lines.

> > Prev 1048615862810879.000000 New 1048615862759879.000000

> > After a few seconds of run time random processes on my machine begin
> > to crash, or I get kernel oopses and kernel freezes. Looks very 
> > much like heavy use of gettimeofday() causes random writes in system
> > memory.

> Looks very much like you have a real bad hardware problem. 

Just what, that is the question. After having activated the notsc
feature the system has not yet exposed the warp symptons but as I noted
in the beginning it may well take a day or two for that to happen.

Yet still, running the first (in ddd) or second test programs - despite
the current absence of any error message - causes random processes to
crash until the program is being stopped (by a crashed terminal, X or
kernel, that is).

Oddly enough, the system runs pretty stable for at least days of normal
use as long as the clock symptoms dont show up (and you dont run those
test programs). Which means it has not crashed a lot recently, just
being rebooted by me because of the jumping clock annoyance which -
among others - results in sluggishly behaving UI components and frequent
short connection freezes in ssh connections.

> > E.g. which type of hardware problem?
> Since the machine used to work last fall, It's probably just a
> FAN or RAM  problems.

I'll swap the RAM sticks around for now but I suspect its something
else. I just still fail to grasp  how calls to gettimeofday() are able
to cause random writes to memory...

Summary:
       - No apparent hardware issue.
       - System runs stable as long as you dont for (;;) gettimeofday();
       - notsc being evaluated. I will get back to you later.
         Does not resolve the odd test software crash, though. 

Kind regards,
		Fionn

P.S.: Please keep sending me a Cc:, I grabbed this one from the archive

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-25 21:16   ` Fionn Behrens
@ 2003-03-25 22:14     ` george anzinger
  2003-03-25 22:55       ` Fionn Behrens
  0 siblings, 1 reply; 19+ messages in thread
From: george anzinger @ 2003-03-25 22:14 UTC (permalink / raw)
  To: Fionn Behrens; +Cc: linux-kernel, root, tim

Fionn Behrens wrote:
> Richard B. Johnson wrote:
> 
> 
>>On Tue, 25 Mar 2003, Fionn Behrens wrote: 
>>
>>>On Die, 2003-03-25 at 18:07, Richard B. Johnson wrote:
>>>
>>>>On Tue, 25 Mar 2003, Fionn Behrens wrote:
>>>
>>>>>I have got an increasingly annoying problem with our fairly new
>>>>>(fall '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 
>>>>>2.4.20.
>>>
>>>>I am using the exact same kernel (a lot of folks are). There
>>>>is no such jumping on my system.
>>>>Try this program:
>>>
>>>[... prg1.c ...]
>>>
>>>
>>>>If this shows time jumping around you have one of either:
>>>>
>>>>(1) Bad timer channel 0 chip (PIT).
>>>>(2) Some daemon trying to sync time with another system.
>>>>(3) You are traveling too close to the speed of light.
>>>
>>>It just exits immediately with exit code 1. (*shrug*)
> 
> 
>>Hmmm. Note that the for(;;) { } provides no exit path.
> 
> 
> I noticed that well and investigated the issue using ddd. Funnily enough
> the program runs well in ddd until X crashes. But in the shell it still
> behaves like it would be nothing but exit(1);
>  
> 
>>So, you probably have some bad RAM or your CPU is too 
>>hot (broken fan??), or something like that. 
> 
> 
> None of the above. The system is liquid cooled and subject to contiuous
> thermal monitoring. The RAM is 1GB Infineon ECC. Before the weekend I
> had the machine running overnight with memtest86 - 14 hours, all tests
> activated. Not a single error.
> I also tried an endless kernel compile loop the other day and the
> machine compiled about 100 kernels in approx two hours without a hitch.
> 
> 
>>>[... prg2.c ...]
>>>
>>>When I run this code it begins to put out Prev N New M lines.
> 
> 
>>>Prev 1048615862810879.000000 New 1048615862759879.000000
> 
> 
>>>After a few seconds of run time random processes on my machine begin
>>>to crash, or I get kernel oopses and kernel freezes. Looks very 
>>>much like heavy use of gettimeofday() causes random writes in system
>>>memory.
> 
> 
>>Looks very much like you have a real bad hardware problem. 
> 
> 
> Just what, that is the question. After having activated the notsc
> feature the system has not yet exposed the warp symptons but as I noted
> in the beginning it may well take a day or two for that to happen.
> 
> Yet still, running the first (in ddd) or second test programs - despite
> the current absence of any error message - causes random processes to
> crash until the program is being stopped (by a crashed terminal, X or
> kernel, that is).
> 
> Oddly enough, the system runs pretty stable for at least days of normal
> use as long as the clock symptoms dont show up (and you dont run those
> test programs). Which means it has not crashed a lot recently, just
> being rebooted by me because of the jumping clock annoyance which -
> among others - results in sluggishly behaving UI components and frequent
> short connection freezes in ssh connections.
> 
> 
>>>E.g. which type of hardware problem?
>>
>>Since the machine used to work last fall, It's probably just a
>>FAN or RAM  problems.
> 
> 
> I'll swap the RAM sticks around for now but I suspect its something
> else. I just still fail to grasp  how calls to gettimeofday() are able
> to cause random writes to memory...
> 
> Summary:
>        - No apparent hardware issue.
>        - System runs stable as long as you dont for (;;) gettimeofday();
>        - notsc being evaluated. I will get back to you later.
>          Does not resolve the odd test software crash, though. 
> 
> Kind regards,
> 		Fionn
> 
> P.S.: Please keep sending me a Cc:, I grabbed this one from the archive
> -
This all sounds very much like the TSCs are drifting WRT each other. 
Is it possible that you have some power management code (or hardware) 
that is slowing one cpu and not the other?

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-25 22:14     ` george anzinger
@ 2003-03-25 22:55       ` Fionn Behrens
  2003-03-26  0:13         ` Alan Cox
  0 siblings, 1 reply; 19+ messages in thread
From: Fionn Behrens @ 2003-03-25 22:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: george anzinger

On Die, 2003-03-25 at 23:14, george anzinger wrote:
> Fionn Behrens wrote:

> > Summary:
> >        - No apparent hardware issue.
> >        - System runs stable as long as you dont for (;;) gettimeofday();
> >        - notsc being evaluated. I will get back to you later.
> >          Does not resolve the odd test software crash, though.

> This all sounds very much like the TSCs are drifting WRT each other. 
> Is it possible that you have some power management code (or hardware) 
> that is slowing one cpu and not the other?

Well, I still don't really know what TSCs actually are (or what TSC
stands for).

The only suspect in that case would be the amd76x_pm.o kernel module
which I am admittedly using. It saves about 90Watts of power when the
machine is idle...

I'll check what happens when the system boots without amd76x_pm.
Will report back tomorrow.

Thanks to all for keeping the suggestions going!

Regards,
	F. Behrens

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-25 22:55       ` Fionn Behrens
@ 2003-03-26  0:13         ` Alan Cox
  2003-03-26  2:28           ` george anzinger
                             ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Alan Cox @ 2003-03-26  0:13 UTC (permalink / raw)
  To: Fionn Behrens; +Cc: Linux Kernel Mailing List, george anzinger

On Tue, 2003-03-25 at 22:55, Fionn Behrens wrote:
> > This all sounds very much like the TSCs are drifting WRT each other. 
> > Is it possible that you have some power management code (or hardware) 
> > that is slowing one cpu and not the other?
> 
> Well, I still don't really know what TSCs actually are (or what TSC
> stands for).
> 
> The only suspect in that case would be the amd76x_pm.o kernel module
> which I am admittedly using. It saves about 90Watts of power when the
> machine is idle...

If you are using amd76x_pm boot with "notsc", ditto for that matter
on dual athlons with APM or ACPI in some cases. In fact I wish people
would stop using the tsc for clock timing altogether. It simply doesn't
work on a lot of modern systems



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-26  0:13         ` Alan Cox
@ 2003-03-26  2:28           ` george anzinger
  2003-03-26 14:38             ` Alan Cox
  2003-03-26  3:11           ` Chris Friesen
  2003-03-26 10:48           ` Fionn Behrens
  2 siblings, 1 reply; 19+ messages in thread
From: george anzinger @ 2003-03-26  2:28 UTC (permalink / raw)
  To: Alan Cox; +Cc: Fionn Behrens, Linux Kernel Mailing List

Alan Cox wrote:
> On Tue, 2003-03-25 at 22:55, Fionn Behrens wrote:
> 
>>>This all sounds very much like the TSCs are drifting WRT each other. 
>>>Is it possible that you have some power management code (or hardware) 
>>>that is slowing one cpu and not the other?
>>
>>Well, I still don't really know what TSCs actually are (or what TSC
>>stands for).

Stands for Time Stamp Counter.  It is a special cpu register that 
basically counts cpu cycles.  Some times (incorrectly me thinks) it is 
affected by power management code which slows the cpu by changing the 
cpu frequency.
>>
>>The only suspect in that case would be the amd76x_pm.o kernel module
>>which I am admittedly using. It saves about 90Watts of power when the
>>machine is idle...
> 
> 
> If you are using amd76x_pm boot with "notsc", ditto for that matter
> on dual athlons with APM or ACPI in some cases. In fact I wish people
> would stop using the tsc for clock timing altogether. It simply doesn't
> work on a lot of modern systems
> 
I agree, however, what is really needed is not available in x86 
machines, i.e. a cpu register that has a fixed and stable count rate. 
  An I/O register is second best because of the long time it takes to 
read it.
> 
> 

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-26  0:13         ` Alan Cox
  2003-03-26  2:28           ` george anzinger
@ 2003-03-26  3:11           ` Chris Friesen
  2003-03-26 14:35             ` Alan Cox
  2003-03-26 10:48           ` Fionn Behrens
  2 siblings, 1 reply; 19+ messages in thread
From: Chris Friesen @ 2003-03-26  3:11 UTC (permalink / raw)
  To: Alan Cox; +Cc: Fionn Behrens, Linux Kernel Mailing List, george anzinger

Alan Cox wrote:

> If you are using amd76x_pm boot with "notsc", ditto for that matter
> on dual athlons with APM or ACPI in some cases. In fact I wish people
> would stop using the tsc for clock timing altogether. It simply doesn't
> work on a lot of modern systems

But its awfully nice for low-impact high-resolution timestamps.

Maybe someday hardware manufacturers will give us a monotonic GHz+ clock that is 
synced across all cpus and is cheap to read...

Chris


-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-26  0:13         ` Alan Cox
  2003-03-26  2:28           ` george anzinger
  2003-03-26  3:11           ` Chris Friesen
@ 2003-03-26 10:48           ` Fionn Behrens
  2 siblings, 0 replies; 19+ messages in thread
From: Fionn Behrens @ 2003-03-26 10:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alan Cox

On Mit, 2003-03-26 at 01:13, Alan Cox wrote:
> On Tue, 2003-03-25 at 22:55, Fionn Behrens wrote:
> > > This all sounds very much like the TSCs are drifting WRT each other. 
> > > Is it possible that you have some power management code (or hardware) 
> > > that is slowing one cpu and not the other?
> > 
> > The only suspect in that case would be the amd76x_pm.o kernel module
> > which I am admittedly using. It saves about 90Watts of power when the
> > machine is idle...
> 
> If you are using amd76x_pm boot with "notsc", ditto for that matter
> on dual athlons with APM or ACPI in some cases.

I booted without amd76x_pm today and the problems are gone. I tried
notsc yesterday and dmesg said TSC had been deactivated on both CPUs. No
libc6 problems - debian is using the i386 version by default.
Oddly enough the system still crashed on those two for (;;) time(); test
loops posted earlier in this thread. So the only (unsatisfying) solution
I see for now is to keep the CPUs glowing hot for the sake of stability.

Any idea what else could cause the crashes in the absence of TSC usage?

As a yet unresolved side note I am still unable to execute the first
test program with my default user (immediately exits with retval 1).
Being run as root or as the system test user, the program runs as
expected (including crash with amd76x_pm). ldd shows no difference. Same
shell being used.

With kind regards,
		F. Behrens

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-26  3:11           ` Chris Friesen
@ 2003-03-26 14:35             ` Alan Cox
  0 siblings, 0 replies; 19+ messages in thread
From: Alan Cox @ 2003-03-26 14:35 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Fionn Behrens, Linux Kernel Mailing List, george anzinger

On Wed, 2003-03-26 at 03:11, Chris Friesen wrote:
> But its awfully nice for low-impact high-resolution timestamps.
> 
> Maybe someday hardware manufacturers will give us a monotonic GHz+ clock that is 
> synced across all cpus and is cheap to read...

x86-64 has HPET


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-26  2:28           ` george anzinger
@ 2003-03-26 14:38             ` Alan Cox
  2003-03-26 16:12               ` george anzinger
  0 siblings, 1 reply; 19+ messages in thread
From: Alan Cox @ 2003-03-26 14:38 UTC (permalink / raw)
  To: george anzinger; +Cc: Fionn Behrens, Linux Kernel Mailing List

On Wed, 2003-03-26 at 02:28, george anzinger wrote:
> Stands for Time Stamp Counter.  It is a special cpu register that 
> basically counts cpu cycles.  Some times (incorrectly me thinks) it is 
> affected by power management code which slows the cpu by changing the 
> cpu frequency.

Not incorrectly. It counts cpu clocks, its designed for profiling and
the like. There is no guarantee in any Intel MP standard that the clocks
are synched up.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-26 14:38             ` Alan Cox
@ 2003-03-26 16:12               ` george anzinger
  2003-03-26 17:06                 ` Richard B. Johnson
  0 siblings, 1 reply; 19+ messages in thread
From: george anzinger @ 2003-03-26 16:12 UTC (permalink / raw)
  To: Alan Cox; +Cc: Fionn Behrens, Linux Kernel Mailing List, Grover, Andrew

Alan Cox wrote:
> On Wed, 2003-03-26 at 02:28, george anzinger wrote:
> 
>>Stands for Time Stamp Counter.  It is a special cpu register that 
>>basically counts cpu cycles.  Some times (incorrectly me thinks) it is 
>>affected by power management code which slows the cpu by changing the 
>>cpu frequency.
> 
> 
> Not incorrectly. It counts cpu clocks, its designed for profiling and
> the like. There is no guarantee in any Intel MP standard that the clocks
> are synched up.


> 
I seem to recall a different notion of correctness from Andy Grover... 
  but memory may deceive :(


As for sync, I would think it is a mother board issue.

But as you say, Intel should put in a usable counter.  The HPET seems 
like it has the capabilities, however, I suspect that it is a slow 
read.  Any idea how many cycles it takes to do a memory mapped I/O access?
-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-26 16:12               ` george anzinger
@ 2003-03-26 17:06                 ` Richard B. Johnson
  2003-03-26 18:12                   ` george anzinger
  0 siblings, 1 reply; 19+ messages in thread
From: Richard B. Johnson @ 2003-03-26 17:06 UTC (permalink / raw)
  To: george anzinger
  Cc: Alan Cox, Fionn Behrens, Linux Kernel Mailing List, Grover, Andrew

On Wed, 26 Mar 2003, george anzinger wrote:

> Alan Cox wrote:
> > On Wed, 2003-03-26 at 02:28, george anzinger wrote:
> >
> >>Stands for Time Stamp Counter.  It is a special cpu register that
> >>basically counts cpu cycles.  Some times (incorrectly me thinks) it is
> >>affected by power management code which slows the cpu by changing the
> >>cpu frequency.
> >
> >
> > Not incorrectly. It counts cpu clocks, its designed for profiling and
> > the like. There is no guarantee in any Intel MP standard that the clocks
> > are synched up.
>
>
> >
> I seem to recall a different notion of correctness from Andy Grover...
>   but memory may deceive :(
>
>
> As for sync, I would think it is a mother board issue.
>
> But as you say, Intel should put in a usable counter.  The HPET seems
> like it has the capabilities, however, I suspect that it is a slow
> read.  Any idea how many cycles it takes to do a memory mapped I/O access?
> --

It depends how the read is made. A direct read of non-cached RAM
like this:

		movl	(MY_IODEV), %eax

... takes 4 CPU clocks if the device doesn't insert any wait-states.
With fast CPUs, such is in possible. You are I/O bound by the front-side
bus speed. A good guess of the time to read is 4/(bus MHz) because
there are 4 bus-cycles for a read or write.

If the 'C' compiler decides to do indexed addressing, where the
address gets calculated, the read times are greater. For instance,

		movl	(%ebx), %eax

... takes 8 clocks even ignoring the fact that the virtual address
needs to be put into register ebx. If the address is on certain
boundaries (not necessarily the same for different CPUs), the
reads can be slower.

Slow reads don't really hurt. In fact, they make sure that subsequent
reads will always return positive time. It's just a bias in the
time that affects everybody the same way. What hurts is trying to
synchronize to some external clock. In my opinion, this is not
the correct way to get the time. `rdtsc` returns a long long
in two registers. This should be saved as "reference time"
every time the system clock is set. Setting the system clock
means saving (only) the time_t object, in seconds, at the
time one saves the rdtsc time. This time_t object is never
changed otherwise. The PIT only generates interrupts. It is
not used for time. When somebody needs the time, it is calculated
from the present `rdtsc`, the saved long long value, and the
time_t time at which that value was saved. This guarantees
that all time is positive and no CPU cycles are wasted trying
to read anything.

The number of CPU cycles per second are calculated once upon
startup just as they are now. If you shut-down, or slow the
CPU for power-saving, you just recalculate the CPU cycles
and reset the time from CMOS. Any time, when the machine is
in 'slow' mode is still correct.

time_t  set_time;
long long cpu_cycles_sec;
long long rd_tsc_at_set_time;

To read time:

current_time = ((rdtsc() - rd_tsc_at_set_time)/ cpu_cycles_sec) +
               set_time;

To set time:

set_time = get_time_from_CMOS();
rd_tsc_at_set_time = rdtsc();

... That's all you need.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
  2003-03-26 17:06                 ` Richard B. Johnson
@ 2003-03-26 18:12                   ` george anzinger
  0 siblings, 0 replies; 19+ messages in thread
From: george anzinger @ 2003-03-26 18:12 UTC (permalink / raw)
  To: root; +Cc: Alan Cox, Fionn Behrens, Linux Kernel Mailing List, Grover, Andrew

Richard B. Johnson wrote:
> On Wed, 26 Mar 2003, george anzinger wrote:
> 
> 
>>Alan Cox wrote:
>>
>>>On Wed, 2003-03-26 at 02:28, george anzinger wrote:
>>>
>>>
>>>>Stands for Time Stamp Counter.  It is a special cpu register that
>>>>basically counts cpu cycles.  Some times (incorrectly me thinks) it is
>>>>affected by power management code which slows the cpu by changing the
>>>>cpu frequency.
>>>
>>>
>>>Not incorrectly. It counts cpu clocks, its designed for profiling and
>>>the like. There is no guarantee in any Intel MP standard that the clocks
>>>are synched up.
>>
>>
>>I seem to recall a different notion of correctness from Andy Grover...
>>  but memory may deceive :(
>>
>>
>>As for sync, I would think it is a mother board issue.
>>
>>But as you say, Intel should put in a usable counter.  The HPET seems
>>like it has the capabilities, however, I suspect that it is a slow
>>read.  Any idea how many cycles it takes to do a memory mapped I/O access?
>>--
> 
> 
> It depends how the read is made. A direct read of non-cached RAM
> like this:
> 
> 		movl	(MY_IODEV), %eax
> 
> ... takes 4 CPU clocks if the device doesn't insert any wait-states.
> With fast CPUs, such is in possible. You are I/O bound by the front-side
> bus speed. A good guess of the time to read is 4/(bus MHz) because
> there are 4 bus-cycles for a read or write.
> 
> If the 'C' compiler decides to do indexed addressing, where the
> address gets calculated, the read times are greater. For instance,
> 
> 		movl	(%ebx), %eax
> 
> ... takes 8 clocks even ignoring the fact that the virtual address
> needs to be put into register ebx. If the address is on certain
> boundaries (not necessarily the same for different CPUs), the
> reads can be slower.
> 
> Slow reads don't really hurt. 

gettimeofday() is small enough that a couple of extra cycles DOES show 
up.  And, as cpus get faster and faster WRT the front-side bus, this 
can easily be the majority of the time it takes to do gettimeofday().

> In fact, they make sure that subsequent
> reads will always return positive time. It's just a bias in the
> time that affects everybody the same way. What hurts is trying to
> synchronize to some external clock. In my opinion, this is not
> the correct way to get the time. `rdtsc` returns a long long
> in two registers. This should be saved as "reference time"
> every time the system clock is set. Setting the system clock
> means saving (only) the time_t object, in seconds, at the
> time one saves the rdtsc time. This time_t object is never
> changed otherwise. The PIT only generates interrupts. It is
> not used for time. When somebody needs the time, it is calculated
> from the present `rdtsc`, the saved long long value, and the
> time_t time at which that value was saved. This guarantees
> that all time is positive and no CPU cycles are wasted trying
> to read anything.
> 
> The number of CPU cycles per second are calculated once upon
> startup just as they are now. If you shut-down, or slow the
> CPU for power-saving, you just recalculate the CPU cycles
> and reset the time from CMOS. Any time, when the machine is
> in 'slow' mode is still correct.

I did something close to this in the high-res-timers patch.  There are 
several problems WRT TSC:

First, and the reason this thread started, there are SMP boxes with pm 
code that causes the TSC to run at different speeds on the various cpus.

Second, I am lead to believe that there are boxes that adjust the cpu 
speed with out telling software about it.

Third, there are boxes that just don't have TSCs, (yeah, I know they 
are old...).

Because of all of this, I have a configure option to use the ACPI pm 
counter.  The down side of this is a) it is slow (an I/O instruction) 
and b.) the resolution is much less than the TSC.
> 
> time_t  set_time;
> long long cpu_cycles_sec;
> long long rd_tsc_at_set_time;
> 
> To read time:
> 
> current_time = ((rdtsc() - rd_tsc_at_set_time)/ cpu_cycles_sec) +
>                set_time;
> 
> To set time:
> 
> set_time = get_time_from_CMOS();
> rd_tsc_at_set_time = rdtsc();
> 
> ... That's all you need.

In my code, since the time needs to be "fresh" each tick, I keep a 
counter up to date each tick.  This allows me to speed up the code by 
only reading the low half of the TSC, and at the same time, avoid some 
64-bit math.

The other thing this way of doing things needs is code to discipline 
the interrupt source (the PIT in this case) so that the interrupts 
come "reasonably" close to the 1/Hz boundary.  Without this the timers 
have too much jitter.



-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
@ 2003-04-03 13:22 Fionn Behrens
  0 siblings, 0 replies; 19+ messages in thread
From: Fionn Behrens @ 2003-04-03 13:22 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Alan Cox, george anzinger


Ok, I had the system running about a week with "notsc" AND no power
management. No system crashes so far. Nevertheless, I keep getting some
kernel oopses like this from time to time. The call trace suggests that
there is still an issue with timing.

Apr  3 15:09:51 rtfm kernel:  printing eip:
Apr  3 15:09:51 rtfm kernel: 49199fd0
Apr  3 15:09:51 rtfm kernel: *pde = 00000000
Apr  3 15:09:51 rtfm kernel: Oops: 0000
Apr  3 15:09:51 rtfm kernel: CPU:    0
Apr  3 15:09:51 rtfm kernel: EIP:    0010:[<49199fd0>]    Tainted: P 
Apr  3 15:09:51 rtfm kernel: EFLAGS: 00210287
Apr  3 15:09:51 rtfm kernel: eax: 3e8c329f   ebx: cfd15fac   ecx:
054f7f1e   edx: 000e213f
Apr  3 15:09:51 rtfm kernel: esi: bffffa50   edi: 00000000   ebp:
bffffa58   esp: cfd15f9c
Apr  3 15:09:51 rtfm kernel: ds: 0018   es: 0018   ss: 0018
Apr  3 15:09:51 rtfm kernel: Process lmule (pid: 27969,
stackpage=cfd15000)
Apr  3 15:09:51 rtfm kernel: Stack: c0122b4b bffffa50 cfd15fac 00000008
3e8c329f 000e213f cfd14000 bffffa50 
Apr  3 15:09:51 rtfm kernel:        bffffab0 c01091ff bffffa50 00000000
408584d4 bffffa50 bffffab0 bffffa58 
Apr  3 15:09:51 rtfm kernel:        0000004e 0000002b 0000002b 0000004e
40655501 00000023 00200287 bffffa1c 
Apr  3 15:09:51 rtfm kernel: Call Trace:    [sys_gettimeofday+59/128]
[system_call+51/56]
Apr  3 15:09:51 rtfm kernel: 
Apr  3 15:09:51 rtfm kernel: Code:  Bad EIP value.


Do you have any more ideas regarding this issue? I'd hate trying to send
the board in for a check...

Regards,
	Fionn (not subscribed to lklm)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: System time warping around real time problem - please help
       [not found] <20030325164014$031c@gated-at.bofh.it>
@ 2003-03-26  9:31 ` Kay Diederichs
  0 siblings, 0 replies; 19+ messages in thread
From: Kay Diederichs @ 2003-03-26  9:31 UTC (permalink / raw)
  To: linux; +Cc: Fionn Behrens

Fionn,

I had similar problems, and reported them on this list on 12/04/2002 . 
The reason is the amd76x_pm module which leads to the TSCs of the CPUs 
to become unsyncronized.

One way around this is to disable TSC altogether; this in my case 
required installing a glibc compiled for i386 (instead of i686) which 
slows some things down, and to use the 'notsc' boot option.

However, programs that use rdtsc (in my case Intels Fortran Compiler, 
ifc) then fail. As I develop programs using ifc, I therefore have not 
been able to use amd76x_pm - I wish there were a better solution, and 
wonder why this is not a problem with e.g. dual-processor Xeons.

Kay



Fionn Behrens wrote:
> Hello all,
> 
> 
> I have got an increasingly annoying problem with our fairly new (fall
> '02) Dual Athlon2k+ Gigabyte 7dpxdw linux system running 2.4.20.
> The only kernel patch applied is Alan Cox's ptrace patch.
> 
> To say it right away: the system is not overclocked or anything and
> never was. It has decent cooling and is used as a combined workstation
> and server.
> 
> I cant say exactly when it started but the system clock tends to begin
> jumping around real time in an erratic manner, usually after about 12-48
> hours of uptime. The maximum time jump is about 5 seconds back or forth
> so the time is always "about" right.
> To give you an example to visualize, you can watch asclock in X and see
> the second clock-hand jumping like 3 seconds backwards, then 5 seconds
> forth, 2 back and 1 forth or so within 2 or 3 seconds.
> For a demonstration I wrote the following short example in python:
> 
> t = 0
> while 1:
>   n = time()
>   if t > n: print t, ">", n
>   t = n
> 
> Running this loop returned the following lines:
> 
> 1048608745.61 > 1048608745.60
> 1048608745.63 > 1048608745.62
> 1048608745.65 > 1048608745.64
> 1048608748.23 > 1048608745.67
> 1048608748.27 > 1048608745.71
> 1048608748.30 > 1048608745.74
> 1048608748.34 > 1048608745.78
> 1048608748.42 > 1048608745.86
> 1048608748.47 > 1048608745.91
> 1048608748.52 > 1048608745.96
> [----cut----]
> 
> So you see the time() on this system is constantly overtaking itself and
> jumping back. It almost looks like two parallel time()s are there and it
> switches back and forth between them.
> 
> I recompiled the kernel, I upgraded the BIOS to the latest version
> available, I disabled ntp and tried some more I dont recall yet - no
> success. Due to the erratic timer, working on the machine is no fun.
> Software crashes are regularly - naturally. No programmer expects system
> timers going back in time.
> 
> I am pretty desperate and I'd appreciate any hints on what to check.
> I'll glady present any system detail that you might miss for a proper
> analysis on request per email or on freenode (Fionn).
> 
> Thank you in advance,
>    			F. Behrens (Not a subscriber of this list)

-- 
Kay Diederichs         http://strucbio.biologie.uni-konstanz.de/~kay
email: Kay.Diederichs @ uni-konstanz.de  Tel +49 7531 88 4049 Fax 3183
When replying to my email, please remove the blanks before and after the 
"@" !
Fakultaet fuer Biologie, Universitaet Konstanz, Box M656, D-78457 Konstanz


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2003-04-03 13:10 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-03-25 16:32 System time warping around real time problem - please help Fionn Behrens
2003-03-25 17:07 ` Richard B. Johnson
2003-03-25 17:17   ` Tim Schmielau
2003-03-25 18:12   ` Fionn Behrens
2003-03-25 18:29     ` Richard B. Johnson
2003-03-25 21:16   ` Fionn Behrens
2003-03-25 22:14     ` george anzinger
2003-03-25 22:55       ` Fionn Behrens
2003-03-26  0:13         ` Alan Cox
2003-03-26  2:28           ` george anzinger
2003-03-26 14:38             ` Alan Cox
2003-03-26 16:12               ` george anzinger
2003-03-26 17:06                 ` Richard B. Johnson
2003-03-26 18:12                   ` george anzinger
2003-03-26  3:11           ` Chris Friesen
2003-03-26 14:35             ` Alan Cox
2003-03-26 10:48           ` Fionn Behrens
     [not found] <20030325164014$031c@gated-at.bofh.it>
2003-03-26  9:31 ` Kay Diederichs
2003-04-03 13:22 Fionn Behrens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).